Building a Local AI Coding Assistant with NVIDIA Spark and Claude Code Router

Learn how to set up a powerful local coding assistant using NVIDIA Spark's vLLM inference, the Qwen3-Coder model, and Claude Code Router for seamless model switching.

Difficulty:Intermediate

Time:2-3 hours

Published:10/25/2025

Technologies:

NVIDIA SparkvLLMClaude CodeDockerNode.js

NVIDIA GPU with CUDA support
Basic Docker experience
Node.js 18+ installed
Understanding of command line operations

Required Materials

NVIDIA GeForce RTX 5090(x1)Buy →

$3359.00

NVIDIA DGX Spark(x1)Buy →

$4000.00

Estimated Total:$7359.00

Overview

This guide walks you through setting up a local AI coding assistant that combines the power of NVIDIA's Spark platform with the flexibility of Claude Code Router. By the end, you'll have a system that can route coding tasks to different AI models, including the powerful Qwen3-Coder-30B model running locally via vLLM.

What you'll achieve:

Run state-of-the-art coding models locally using NVIDIA Spark
Integrate with Claude Code for seamless AI-assisted development
Configure intelligent model routing based on task type
Save on API costs while maintaining high-quality code assistance

Prerequisites

Before starting this project, you should have:

Basic command line experience
Understanding of Docker containers
Familiarity with environment variables and API keys
Experience with Node.js and npm
NVIDIA GPU with CUDA support (recommended: 24GB+ VRAM for 30B model)

Software Requirements

Software	Version	Purpose
Docker	Latest	Container runtime for vLLM
NVIDIA Container Toolkit	Latest	GPU access in containers
Node.js	18+	Claude Code Router runtime
Claude Code	Latest	AI coding assistant
NVIDIA Spark Account	-	Access to model inference

Materials

Hardware

NVIDIA GPU (RTX 3090/4090/5090 or better recommended)
32GB+ System RAM
100GB+ free disk space

Software & Services

NVIDIA DGX Spark
Claude Code Router npm package
Docker and NVIDIA Container Toolkit

Step 1: Set Up NVIDIA Spark vLLM

First, we'll set up the vLLM inference server using NVIDIA Spark's optimized container.

Pull the vLLM Container

docker pull nvcr.io/nvidia/vllm:25.09-py3

Launch vLLM with GPU Support

Start the vLLM container with GPU access and your chosen model:

docker run -it -d --name qwen \
  --gpus all -p 8000:8000 \
  --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
  -v "$HOME/.cache/huggingface:/root/.cache/huggingface" \
  nvcr.io/nvidia/vllm:25.09-py3 \
  vllm serve Qwen/Qwen3-Coder-30B-A3B-Instruct \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder

Key parameters explained:

--gpus all - Grants access to all available GPUs
-p 8000:8000 - Exposes the vLLM API server on port 8000
-e HF_TOKEN=<your_huggingface_token> - Your Hugging Face token for downloading gated/private models (get from HF Settings)
-v ~/.cache/huggingface:/root/.cache/huggingface - Mounts local Hugging Face cache to avoid re-downloading models
--tensor-parallel-size 1 - Number of GPUs to use (increase for multi-GPU systems)
--dtype auto - Automatically selects the best precision for your hardware
--api-key - (optional) Sets authentication for the vLLM API (use any string you want)

Note: The HF_TOKEN is required for gated or private models. For public models, you can omit the -e HF_TOKEN line.

Verify the Server is Running

Watch the container logs for successful startup:

docker logs -f vllm-spark

Look for messages indicating:

Model loading completion
Server startup on port 8000
GPU memory allocation details

You should see output like:

INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000

Step 2: Test vLLM Inference

Before integrating with Claude Code Router, verify the vLLM server responds correctly.

Basic API Test

curl -X POST http://spark1.local:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_VLLM_API_KEY" \
  -d '{
    "model": "Qwen/Qwen3-Coder-30B-A3B-Instruct",
    "messages": [
      {
        "role": "user",
        "content": "Write a Python function to calculate fibonacci numbers"
      }
    ],
    "max_tokens": 500
  }'

Expected Result: You should receive a JSON response with a code completion from the Qwen model.

Step 3: Install Claude Code Router

Now we'll install the router that will connect Claude Code to your local vLLM instance.

# Install Claude Code (if not already installed)
npm install -g @anthropic-ai/claude-code
 
# Install Claude Code Router
npm install -g @musistudio/claude-code-router

Verify Installation

ccr --version

Step 4: Configure Claude Code Router

Create the router configuration that defines how to connect to your local vLLM server.

Create Configuration Directory

mkdir -p ~/.claude-code-router

Configure Router Settings

Create ~/.claude-code-router/config.json:

{
  "LOG": false,
  "LOG_LEVEL": "debug",
  "CLAUDE_PATH": "",
  "HOST": "127.0.0.1",
  "PORT": 3456,
  "APIKEY": "",
  "API_TIMEOUT_MS": "600000",
  "PROXY_URL": "",
  "transformers": [],
  "Providers": [
    {
      "name": "spark1",
      "api_base_url": "http://spark1.local:8000/v1/chat/completions",
      "api_key": "notused",
      "models": ["Qwen/Qwen3-Coder-30B-A3B-Instruct"]
    }
  ],
  "Router": {
    "default": "spark1,Qwen/Qwen3-Coder-30B-A3B-Instruct",
    "background": "",
    "think": "",
    "longContext": "",
    "longContextThreshold": 200000,
    "webSearch": "",
    "image": ""
  },
  "CUSTOM_ROUTER_PATH": ""
}

Configuration Stubs:

Replace YOUR_VLLM_API_KEY with the API key you set when launching the container

Adjust maxTokens based on your use case

Modify routing strategy based on your preferences

Step 5: Launch Claude Code with Router

Start Claude Code using the router configuration:

ccr code

This launches Claude Code with intelligent routing to your local vLLM instance.

Verify Routing

Ask Claude Code to perform a coding task and observe which model responds. For background tasks and complex code generation, it should automatically route to your local Qwen instance based on the router configuration.

Step 6: Optimize Performance

Adjust vLLM Settings

Depending on your hardware and usage patterns, you may want to tune vLLM parameters:

# TODO: Adjust these parameters based on your GPU memory and performance needs
docker run --gpus all \
  --name vllm-spark \
  -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  nvcr.io/nvidia/vllm:25.09-py3 \
  --model Qwen/Qwen3-Coder-30B-A3B-Instruct \
  --tensor-parallel-size 1 \
  --gpu-memory-utilization 0.9 \
  --max-model-len 8192 \
  --dtype auto

Key parameters to adjust:

--gpu-memory-utilization: Fraction of GPU memory to use (0.0-1.0)
--max-model-len: Maximum sequence length (affects memory usage)
--tensor-parallel-size: Number of GPUs to use

Monitor Resource Usage

# Watch GPU utilization
nvidia-smi -l 1
 
# Monitor container logs
docker logs -f vllm-spark

Troubleshooting

Issue: vLLM Container Fails to Start

Symptoms: Container exits immediately or shows CUDA errors

Solutions:

Verify NVIDIA drivers are installed: nvidia-smi
Check NVIDIA Container Toolkit: docker run --rm --gpus all nvidia/cuda:11.8.0-base-ubuntu22.04 nvidia-smi
Ensure sufficient GPU memory for the model
Try a smaller model first (e.g., Qwen2.5-Math-1.5B-Instruct)

Issue: Claude Code Router Not Connecting

Symptoms: Router fails to start or can't reach vLLM

Solutions:

Verify vLLM is running: curl http://localhost:8000/health
Check API key in config matches the one used when starting vLLM
Ensure no firewall blocking localhost:8000
Review router logs: ccr code --verbose

Issue: Slow Inference Performance

Symptoms: Long wait times for responses

Solutions:

Reduce --max-model-len to decrease memory usage
Increase --gpu-memory-utilization if you have headroom
Use --dtype half or --dtype bfloat16 for faster inference
Monitor GPU usage with nvidia-smi to identify bottlenecks

Issue: Out of Memory Errors

Symptoms: CUDA OOM errors in vLLM logs

Solutions:

Decrease --max-model-len
Lower --gpu-memory-utilization
Use a smaller model variant
Enable tensor parallelism across multiple GPUs if available

Advanced Configuration

Note: Running multiple models requires starting separate vLLM containers on different ports.

Next Steps

Now that you have a working local AI coding setup, consider:

Experiment with different models: Try other models from Hugging Face
Optimize routing rules: Fine-tune which tasks go to which models
Add monitoring: Set up logging and performance tracking
Integrate with CI/CD: Use ccr in GitHub Actions for automated code review
Create custom workflows: Build specialized transformers for your team's needs

Setting up a local LLM inference server
Optimizing AI model performance on consumer hardware
Building custom AI coding workflows

Conclusion

You now have a powerful local AI coding assistant that combines the best of both worlds: the versatility of Claude Code with the cost-effectiveness and privacy of local inference. The Qwen3-Coder-30B model provides excellent code generation capabilities, and the router gives you flexibility to use the right model for each task.

As you use this setup, experiment with the routing configuration to find what works best for your workflow. The ability to run models locally while maintaining access to cloud-based models gives you unprecedented control over your AI-assisted development environment.

Have questions or improvements? Feel free to reach out or contribute to the discussion!

Prerequisites

Required Materials

Overview

Prerequisites

Software Requirements

Materials

Hardware

Software & Services

Step 1: Set Up NVIDIA Spark vLLM

Pull the vLLM Container

Launch vLLM with GPU Support

Verify the Server is Running

Step 2: Test vLLM Inference

Basic API Test

Step 3: Install Claude Code Router

Verify Installation

Step 4: Configure Claude Code Router

Create Configuration Directory

Configure Router Settings

Step 5: Launch Claude Code with Router

Verify Routing

Step 6: Optimize Performance

Adjust vLLM Settings

Monitor Resource Usage

Troubleshooting

Issue: vLLM Container Fails to Start

Issue: Claude Code Router Not Connecting

Issue: Slow Inference Performance

Issue: Out of Memory Errors

Advanced Configuration

Next Steps

Related Projects

Conclusion