Building a Local AI Coding Assistant with NVIDIA Spark and Claude Code Router
Learn how to set up a powerful local coding assistant using NVIDIA Spark's vLLM inference, the Qwen3-Coder model, and Claude Code Router for seamless model switching.

Prerequisites
- NVIDIA GPU with CUDA support
- Basic Docker experience
- Node.js 18+ installed
- Understanding of command line operations
Required Materials
Overview
This guide walks you through setting up a local AI coding assistant that combines the power of NVIDIA's Spark platform with the flexibility of Claude Code Router. By the end, you'll have a system that can route coding tasks to different AI models, including the powerful Qwen3-Coder-30B model running locally via vLLM.
What you'll achieve:
- Run state-of-the-art coding models locally using NVIDIA Spark
- Integrate with Claude Code for seamless AI-assisted development
- Configure intelligent model routing based on task type
- Save on API costs while maintaining high-quality code assistance
Prerequisites
Before starting this project, you should have:
- Basic command line experience
- Understanding of Docker containers
- Familiarity with environment variables and API keys
- Experience with Node.js and npm
- NVIDIA GPU with CUDA support (recommended: 24GB+ VRAM for 30B model)
Software Requirements
| Software | Version | Purpose |
|---|---|---|
| Docker | Latest | Container runtime for vLLM |
| NVIDIA Container Toolkit | Latest | GPU access in containers |
| Node.js | 18+ | Claude Code Router runtime |
| Claude Code | Latest | AI coding assistant |
| NVIDIA Spark Account | - | Access to model inference |
Materials
Hardware
- NVIDIA GPU (RTX 3090/4090/5090 or better recommended)
- 32GB+ System RAM
- 100GB+ free disk space
Software & Services
- NVIDIA DGX Spark
- Claude Code Router npm package
- Docker and NVIDIA Container Toolkit
Step 1: Set Up NVIDIA Spark vLLM
First, we'll set up the vLLM inference server using NVIDIA Spark's optimized container.
Pull the vLLM Container
docker pull nvcr.io/nvidia/vllm:25.09-py3Launch vLLM with GPU Support
Start the vLLM container with GPU access and your chosen model:
docker run -it -d --name qwen \
--gpus all -p 8000:8000 \
--ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
-v "$HOME/.cache/huggingface:/root/.cache/huggingface" \
nvcr.io/nvidia/vllm:25.09-py3 \
vllm serve Qwen/Qwen3-Coder-30B-A3B-Instruct \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coderKey parameters explained:
--gpus all- Grants access to all available GPUs-p 8000:8000- Exposes the vLLM API server on port 8000-e HF_TOKEN=<your_huggingface_token>- Your Hugging Face token for downloading gated/private models (get from HF Settings)-v ~/.cache/huggingface:/root/.cache/huggingface- Mounts local Hugging Face cache to avoid re-downloading models--tensor-parallel-size 1- Number of GPUs to use (increase for multi-GPU systems)--dtype auto- Automatically selects the best precision for your hardware--api-key- (optional) Sets authentication for the vLLM API (use any string you want)
Note: The HF_TOKEN is required for gated or private models. For public models, you can omit the
-e HF_TOKENline.
Verify the Server is Running
Watch the container logs for successful startup:
docker logs -f vllm-sparkLook for messages indicating:
- Model loading completion
- Server startup on port 8000
- GPU memory allocation details
You should see output like:
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000
Step 2: Test vLLM Inference
Before integrating with Claude Code Router, verify the vLLM server responds correctly.
Basic API Test
curl -X POST http://spark1.local:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_VLLM_API_KEY" \
-d '{
"model": "Qwen/Qwen3-Coder-30B-A3B-Instruct",
"messages": [
{
"role": "user",
"content": "Write a Python function to calculate fibonacci numbers"
}
],
"max_tokens": 500
}'Expected Result: You should receive a JSON response with a code completion from the Qwen model.
Step 3: Install Claude Code Router
Now we'll install the router that will connect Claude Code to your local vLLM instance.
# Install Claude Code (if not already installed)
npm install -g @anthropic-ai/claude-code
# Install Claude Code Router
npm install -g @musistudio/claude-code-routerVerify Installation
ccr --versionStep 4: Configure Claude Code Router
Create the router configuration that defines how to connect to your local vLLM server.
Create Configuration Directory
mkdir -p ~/.claude-code-routerConfigure Router Settings
Create ~/.claude-code-router/config.json:
{
"LOG": false,
"LOG_LEVEL": "debug",
"CLAUDE_PATH": "",
"HOST": "127.0.0.1",
"PORT": 3456,
"APIKEY": "",
"API_TIMEOUT_MS": "600000",
"PROXY_URL": "",
"transformers": [],
"Providers": [
{
"name": "spark1",
"api_base_url": "http://spark1.local:8000/v1/chat/completions",
"api_key": "notused",
"models": ["Qwen/Qwen3-Coder-30B-A3B-Instruct"]
}
],
"Router": {
"default": "spark1,Qwen/Qwen3-Coder-30B-A3B-Instruct",
"background": "",
"think": "",
"longContext": "",
"longContextThreshold": 200000,
"webSearch": "",
"image": ""
},
"CUSTOM_ROUTER_PATH": ""
}Configuration Stubs:
- Replace
YOUR_VLLM_API_KEYwith the API key you set when launching the container- Adjust
maxTokensbased on your use case- Modify routing strategy based on your preferences
Step 5: Launch Claude Code with Router
Start Claude Code using the router configuration:
ccr codeThis launches Claude Code with intelligent routing to your local vLLM instance.
Verify Routing
Ask Claude Code to perform a coding task and observe which model responds. For background tasks and complex code generation, it should automatically route to your local Qwen instance based on the router configuration.
Step 6: Optimize Performance
Adjust vLLM Settings
Depending on your hardware and usage patterns, you may want to tune vLLM parameters:
# TODO: Adjust these parameters based on your GPU memory and performance needs
docker run --gpus all \
--name vllm-spark \
-p 8000:8000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
nvcr.io/nvidia/vllm:25.09-py3 \
--model Qwen/Qwen3-Coder-30B-A3B-Instruct \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.9 \
--max-model-len 8192 \
--dtype autoKey parameters to adjust:
--gpu-memory-utilization: Fraction of GPU memory to use (0.0-1.0)--max-model-len: Maximum sequence length (affects memory usage)--tensor-parallel-size: Number of GPUs to use
Monitor Resource Usage
# Watch GPU utilization
nvidia-smi -l 1
# Monitor container logs
docker logs -f vllm-sparkTroubleshooting
Issue: vLLM Container Fails to Start
Symptoms: Container exits immediately or shows CUDA errors
Solutions:
- Verify NVIDIA drivers are installed:
nvidia-smi - Check NVIDIA Container Toolkit:
docker run --rm --gpus all nvidia/cuda:11.8.0-base-ubuntu22.04 nvidia-smi - Ensure sufficient GPU memory for the model
- Try a smaller model first (e.g., Qwen2.5-Math-1.5B-Instruct)
Issue: Claude Code Router Not Connecting
Symptoms: Router fails to start or can't reach vLLM
Solutions:
- Verify vLLM is running:
curl http://localhost:8000/health - Check API key in config matches the one used when starting vLLM
- Ensure no firewall blocking localhost:8000
- Review router logs:
ccr code --verbose
Issue: Slow Inference Performance
Symptoms: Long wait times for responses
Solutions:
- Reduce
--max-model-lento decrease memory usage - Increase
--gpu-memory-utilizationif you have headroom - Use
--dtype halfor--dtype bfloat16for faster inference - Monitor GPU usage with
nvidia-smito identify bottlenecks
Issue: Out of Memory Errors
Symptoms: CUDA OOM errors in vLLM logs
Solutions:
- Decrease
--max-model-len - Lower
--gpu-memory-utilization - Use a smaller model variant
- Enable tensor parallelism across multiple GPUs if available
Advanced Configuration
Note: Running multiple models requires starting separate vLLM containers on different ports.
Next Steps
Now that you have a working local AI coding setup, consider:
- Experiment with different models: Try other models from Hugging Face
- Optimize routing rules: Fine-tune which tasks go to which models
- Add monitoring: Set up logging and performance tracking
- Integrate with CI/CD: Use
ccrin GitHub Actions for automated code review - Create custom workflows: Build specialized transformers for your team's needs
Related Projects
- Setting up a local LLM inference server
- Optimizing AI model performance on consumer hardware
- Building custom AI coding workflows
Conclusion
You now have a powerful local AI coding assistant that combines the best of both worlds: the versatility of Claude Code with the cost-effectiveness and privacy of local inference. The Qwen3-Coder-30B model provides excellent code generation capabilities, and the router gives you flexibility to use the right model for each task.
As you use this setup, experiment with the routing configuration to find what works best for your workflow. The ability to run models locally while maintaining access to cloud-based models gives you unprecedented control over your AI-assisted development environment.
Have questions or improvements? Feel free to reach out or contribute to the discussion!