Building a Local AI Coding Assistant with NVIDIA Spark and Claude Code Router

Learn how to set up a powerful local coding assistant using NVIDIA Spark's vLLM inference, the Qwen3-Coder model, and Claude Code Router for seamless model switching.

Building a Local AI Coding Assistant with NVIDIA Spark and Claude Code Router
Difficulty:Intermediate
Time:2-3 hours
Published:10/25/2025
Technologies:
NVIDIA SparkvLLMClaude CodeDockerNode.js
Share:

Prerequisites

  • NVIDIA GPU with CUDA support
  • Basic Docker experience
  • Node.js 18+ installed
  • Understanding of command line operations

Required Materials

NVIDIA GeForce RTX 5090(x1)Buy →
$3359.00
NVIDIA DGX Spark(x1)Buy →
$4000.00
Estimated Total:$7359.00

Overview

This guide walks you through setting up a local AI coding assistant that combines the power of NVIDIA's Spark platform with the flexibility of Claude Code Router. By the end, you'll have a system that can route coding tasks to different AI models, including the powerful Qwen3-Coder-30B model running locally via vLLM.

What you'll achieve:

  • Run state-of-the-art coding models locally using NVIDIA Spark
  • Integrate with Claude Code for seamless AI-assisted development
  • Configure intelligent model routing based on task type
  • Save on API costs while maintaining high-quality code assistance

Prerequisites

Before starting this project, you should have:

  • Basic command line experience
  • Understanding of Docker containers
  • Familiarity with environment variables and API keys
  • Experience with Node.js and npm
  • NVIDIA GPU with CUDA support (recommended: 24GB+ VRAM for 30B model)

Software Requirements

SoftwareVersionPurpose
DockerLatestContainer runtime for vLLM
NVIDIA Container ToolkitLatestGPU access in containers
Node.js18+Claude Code Router runtime
Claude CodeLatestAI coding assistant
NVIDIA Spark Account-Access to model inference

Materials

Hardware

  • NVIDIA GPU (RTX 3090/4090/5090 or better recommended)
  • 32GB+ System RAM
  • 100GB+ free disk space

Software & Services

  • NVIDIA DGX Spark
  • Claude Code Router npm package
  • Docker and NVIDIA Container Toolkit

Step 1: Set Up NVIDIA Spark vLLM

First, we'll set up the vLLM inference server using NVIDIA Spark's optimized container.

Pull the vLLM Container

docker pull nvcr.io/nvidia/vllm:25.09-py3

Launch vLLM with GPU Support

Start the vLLM container with GPU access and your chosen model:

docker run -it -d --name qwen \
  --gpus all -p 8000:8000 \
  --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
  -v "$HOME/.cache/huggingface:/root/.cache/huggingface" \
  nvcr.io/nvidia/vllm:25.09-py3 \
  vllm serve Qwen/Qwen3-Coder-30B-A3B-Instruct \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder

Key parameters explained:

  • --gpus all - Grants access to all available GPUs
  • -p 8000:8000 - Exposes the vLLM API server on port 8000
  • -e HF_TOKEN=<your_huggingface_token> - Your Hugging Face token for downloading gated/private models (get from HF Settings)
  • -v ~/.cache/huggingface:/root/.cache/huggingface - Mounts local Hugging Face cache to avoid re-downloading models
  • --tensor-parallel-size 1 - Number of GPUs to use (increase for multi-GPU systems)
  • --dtype auto - Automatically selects the best precision for your hardware
  • --api-key - (optional) Sets authentication for the vLLM API (use any string you want)

Note: The HF_TOKEN is required for gated or private models. For public models, you can omit the -e HF_TOKEN line.

Verify the Server is Running

Watch the container logs for successful startup:

docker logs -f vllm-spark

Look for messages indicating:

  • Model loading completion
  • Server startup on port 8000
  • GPU memory allocation details

You should see output like:

INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000

Step 2: Test vLLM Inference

Before integrating with Claude Code Router, verify the vLLM server responds correctly.

Basic API Test

curl -X POST http://spark1.local:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_VLLM_API_KEY" \
  -d '{
    "model": "Qwen/Qwen3-Coder-30B-A3B-Instruct",
    "messages": [
      {
        "role": "user",
        "content": "Write a Python function to calculate fibonacci numbers"
      }
    ],
    "max_tokens": 500
  }'

Expected Result: You should receive a JSON response with a code completion from the Qwen model.

Step 3: Install Claude Code Router

Now we'll install the router that will connect Claude Code to your local vLLM instance.

# Install Claude Code (if not already installed)
npm install -g @anthropic-ai/claude-code
 
# Install Claude Code Router
npm install -g @musistudio/claude-code-router

Verify Installation

ccr --version

Step 4: Configure Claude Code Router

Create the router configuration that defines how to connect to your local vLLM server.

Create Configuration Directory

mkdir -p ~/.claude-code-router

Configure Router Settings

Create ~/.claude-code-router/config.json:

{
  "LOG": false,
  "LOG_LEVEL": "debug",
  "CLAUDE_PATH": "",
  "HOST": "127.0.0.1",
  "PORT": 3456,
  "APIKEY": "",
  "API_TIMEOUT_MS": "600000",
  "PROXY_URL": "",
  "transformers": [],
  "Providers": [
    {
      "name": "spark1",
      "api_base_url": "http://spark1.local:8000/v1/chat/completions",
      "api_key": "notused",
      "models": ["Qwen/Qwen3-Coder-30B-A3B-Instruct"]
    }
  ],
  "Router": {
    "default": "spark1,Qwen/Qwen3-Coder-30B-A3B-Instruct",
    "background": "",
    "think": "",
    "longContext": "",
    "longContextThreshold": 200000,
    "webSearch": "",
    "image": ""
  },
  "CUSTOM_ROUTER_PATH": ""
}

Configuration Stubs:

  • Replace YOUR_VLLM_API_KEY with the API key you set when launching the container
  • Adjust maxTokens based on your use case
  • Modify routing strategy based on your preferences

Step 5: Launch Claude Code with Router

Start Claude Code using the router configuration:

ccr code

This launches Claude Code with intelligent routing to your local vLLM instance.

Verify Routing

Ask Claude Code to perform a coding task and observe which model responds. For background tasks and complex code generation, it should automatically route to your local Qwen instance based on the router configuration.

Step 6: Optimize Performance

Adjust vLLM Settings

Depending on your hardware and usage patterns, you may want to tune vLLM parameters:

# TODO: Adjust these parameters based on your GPU memory and performance needs
docker run --gpus all \
  --name vllm-spark \
  -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  nvcr.io/nvidia/vllm:25.09-py3 \
  --model Qwen/Qwen3-Coder-30B-A3B-Instruct \
  --tensor-parallel-size 1 \
  --gpu-memory-utilization 0.9 \
  --max-model-len 8192 \
  --dtype auto

Key parameters to adjust:

  • --gpu-memory-utilization: Fraction of GPU memory to use (0.0-1.0)
  • --max-model-len: Maximum sequence length (affects memory usage)
  • --tensor-parallel-size: Number of GPUs to use

Monitor Resource Usage

# Watch GPU utilization
nvidia-smi -l 1
 
# Monitor container logs
docker logs -f vllm-spark

Troubleshooting

Issue: vLLM Container Fails to Start

Symptoms: Container exits immediately or shows CUDA errors

Solutions:

  • Verify NVIDIA drivers are installed: nvidia-smi
  • Check NVIDIA Container Toolkit: docker run --rm --gpus all nvidia/cuda:11.8.0-base-ubuntu22.04 nvidia-smi
  • Ensure sufficient GPU memory for the model
  • Try a smaller model first (e.g., Qwen2.5-Math-1.5B-Instruct)

Issue: Claude Code Router Not Connecting

Symptoms: Router fails to start or can't reach vLLM

Solutions:

  • Verify vLLM is running: curl http://localhost:8000/health
  • Check API key in config matches the one used when starting vLLM
  • Ensure no firewall blocking localhost:8000
  • Review router logs: ccr code --verbose

Issue: Slow Inference Performance

Symptoms: Long wait times for responses

Solutions:

  • Reduce --max-model-len to decrease memory usage
  • Increase --gpu-memory-utilization if you have headroom
  • Use --dtype half or --dtype bfloat16 for faster inference
  • Monitor GPU usage with nvidia-smi to identify bottlenecks

Issue: Out of Memory Errors

Symptoms: CUDA OOM errors in vLLM logs

Solutions:

  • Decrease --max-model-len
  • Lower --gpu-memory-utilization
  • Use a smaller model variant
  • Enable tensor parallelism across multiple GPUs if available

Advanced Configuration

Note: Running multiple models requires starting separate vLLM containers on different ports.

Next Steps

Now that you have a working local AI coding setup, consider:

  1. Experiment with different models: Try other models from Hugging Face
  2. Optimize routing rules: Fine-tune which tasks go to which models
  3. Add monitoring: Set up logging and performance tracking
  4. Integrate with CI/CD: Use ccr in GitHub Actions for automated code review
  5. Create custom workflows: Build specialized transformers for your team's needs
  • Setting up a local LLM inference server
  • Optimizing AI model performance on consumer hardware
  • Building custom AI coding workflows

Conclusion

You now have a powerful local AI coding assistant that combines the best of both worlds: the versatility of Claude Code with the cost-effectiveness and privacy of local inference. The Qwen3-Coder-30B model provides excellent code generation capabilities, and the router gives you flexibility to use the right model for each task.

As you use this setup, experiment with the routing configuration to find what works best for your workflow. The ability to run models locally while maintaining access to cloud-based models gives you unprecedented control over your AI-assisted development environment.


Have questions or improvements? Feel free to reach out or contribute to the discussion!