# llama-server with CUDA on WSL2 Guide to deploying `llama-server` (llama.cpp) as a systemd service on WSL2 with full NVIDIA GPU offload via CUDA. Configured in **router mode** to serve multiple GGUF models on-demand (with optional MTP speculative decoding) via an OpenAI-compatible API. **Target environment:** - WSL2 (Ubuntu 24.04 Noble) - NVIDIA RTX 3080 12GB (or similar), driver exposed via WSL2 GPU passthrough - No separate CUDA toolkit install required to _run_; only needed when building --- ## Why not Ollama? Ollama vendors a pinned version of llama.cpp and bundles its own CUDA runtime. New model architectures (like `qwen35` / Qwen3-Next) may not be supported until Ollama syncs its fork. `llama-server` from upstream llama.cpp supports them as soon as the architecture lands in the main branch. **Ollama does nothing special** beyond: bundling `libggml-cuda.so` alongside its runner and setting `PATH` to include `/usr/lib/wsl/lib` (the WSL2 CUDA driver passthrough). No flash-attention env vars, no special flags. We replicate this. --- ## Prerequisites ```bash # Verify WSL2 CUDA driver passthrough is working ls /usr/lib/wsl/lib/libcuda.so.1 # must exist nvidia-smi # must show your GPU ``` --- ## Step 1 — Install CUDA toolkit and build dependencies > Only needed once per machine to compile llama.cpp. Not needed at runtime. ```bash sudo apt-get install -y nvidia-cuda-toolkit cmake build-essential git ``` Ubuntu 24.04 ships CUDA 12.0 in the `multiverse` repo. This is sufficient to build llama.cpp with CUDA support even when the runtime driver is newer (e.g. CUDA 13.1 via WSL2 passthrough). Alternatively, install CUDA 12.x from [NVIDIA's own APT repo](https://developer.nvidia.com/cuda-downloads) to get a more recent toolkit. Verify the compiler is available: ```bash nvcc --version ``` --- ## Step 2 — Clone and build llama.cpp from source ```bash # Clone at a specific tag — check https://github.com/ggml-org/llama.cpp/releases for latest # b9144+ required for qwen35 architecture (Qwen3.6, OmniCoder 2, etc.) # b9279+ required for MTP speculative decoding (--spec-type draft-mtp) git clone --depth 1 --branch b9279 https://github.com/ggml-org/llama.cpp.git /tmp/llama-build cd /tmp/llama-build # Configure with CUDA backend cmake -B build \ -DGGML_CUDA=ON \ -DCMAKE_BUILD_TYPE=Release \ -DLLAMA_BUILD_SERVER=ON \ -DLLAMA_BUILD_TESTS=OFF \ -DLLAMA_BUILD_EXAMPLES=OFF # Build (uses all cores; takes 10-15 min on a 12-core CPU) cmake --build build --config Release -j$(nproc) ``` After the build completes you should see `build/bin/llama-server`. --- ## Step 3 — Install to /opt/llama-server ```bash sudo mkdir -p /opt/llama-server # Copy the server binary sudo cp build/bin/llama-server /opt/llama-server/ # Copy all shared libraries (b9144+ puts them all in build/bin/) sudo cp -P build/bin/libggml*.so* /opt/llama-server/ sudo cp -P build/bin/libllama*.so* /opt/llama-server/ sudo cp -P build/bin/libmtmd*.so* /opt/llama-server/ 2>/dev/null || true # Register the directory so transitive .so dependencies resolve echo "/opt/llama-server" | sudo tee /etc/ld.so.conf.d/llama-server.conf sudo ldconfig ``` > **Note (b9144+):** The library layout changed — all `.so` files now live in > `build/bin/` (not `build/ggml/src/` or `build/src/`). When upgrading, copy > with `-P` to preserve versioned symlinks and overwrite the old ones. --- ## Step 4 — Create the start script Run llama-server in **router mode** — no `--model` flag. Models are loaded on-demand from `~/models/` when a request names them. Switching models requires no restart and no `sudo`: just change the `model` field in `opencode.json`. ```bash sudo tee /opt/llama-server/start.sh > /dev/null << 'SCRIPT' #!/bin/bash export LD_LIBRARY_PATH=/opt/llama-server${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH} cd /opt/llama-server exec /opt/llama-server/llama-server \ --models-dir /home/dev/models \ --models-max 1 \ --models-preset /home/dev/models/presets.ini \ --host 127.0.0.1 \ --port 8080 SCRIPT sudo chmod +x /opt/llama-server/start.sh ``` **Key router flags:** - `--models-dir` — directory scanned for GGUF files. Flat `.gguf` files become model IDs using the filename **without** `.gguf`. Subdirectories become model IDs using the directory name (used for multimodal models with a separate mmproj file — see _Multimodal models_ below). - `--models-max 1` — only one model resident at a time. When a different model is requested, the current one is evicted and the new one loads (cold-start delay). With 12GB VRAM this is required. - `--models-preset` — path to `presets.ini` for global defaults and per-model overrides. All inference flags belong here, not in `start.sh`. **Per-model settings via `presets.ini`** All inference flags (`ctx-size`, `n-predict`, `n-gpu-layers`, `flash-attn`, `threads`, `parallel`, `jinja`, `spec-type`, etc.) live in `~/models/presets.ini`, not in `start.sh`. The `[*]` section sets defaults inherited by every model; named sections override individual keys. Section names must match the router's model ID — the filename **without** `.gguf`. Using the `.gguf` suffix in a section name creates a duplicate entry in the model list. ```ini version = 1 [*] n-gpu-layers = 99 flash-attn = on threads = 8 parallel = 1 [Qwen_Qwen3-14B-Q4_K_M] ctx-size = 32768 n-predict = 4096 [OmniCoder-2-9B.Q8_0] ctx-size = 32768 n-predict = 4096 [Qwen_Qwen3.6-27B-Q4_K_M] ctx-size = 16384 n-predict = 4096 ``` > **Note:** The router reads `presets.ini` **once at service startup** — it is > not watched for changes. After editing it, run > `sudo systemctl restart llama-server` to apply the new settings. Any > currently-loaded model will be evicted and must cold-reload on the next > request (~10–60 s). **On GPU layer offload:** Hybrid inference (some layers on CPU, some on GPU) is significantly slower than full-GPU due to CPU↔GPU memory transfers each forward pass. For interactive use, prefer models that fit entirely in VRAM. MoE models (like Qwen3.6-35B-A3B) are an exception — their sparse activation means active computation per token is only ~3B parameters regardless of total model size, so partial CPU offload is less painful than with a dense model of the same file size. See the _Model choice_ section below. --- ## Step 5 — Create the systemd service ```bash sudo tee /etc/systemd/system/llama-server.service > /dev/null << 'EOF' [Unit] Description=llama-server (OmniCoder 2 / qwen35) After=network-online.target [Service] ExecStart=/opt/llama-server/start.sh User=ollama Group=ollama Restart=always RestartSec=3 Environment="PATH=/home/dev/.nvm/versions/node/v24.15.0/bin:/home/dev/.opencode/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/usr/lib/wsl/lib:/snap/bin" [Install] WantedBy=default.target EOF sudo systemctl daemon-reload sudo systemctl enable llama-server sudo systemctl start llama-server ``` > **Note:** The `PATH` includes `/usr/lib/wsl/lib` — this is what exposes the > CUDA driver (`libcuda.so.1`) to the process in WSL2. Without this, the CUDA > backend will load but fail to initialize the device. --- ## Step 6 — Verify GPU offload ```bash # Check service is running systemctl status llama-server # Health endpoint curl -s http://127.0.0.1:8080/health # → {"status":"ok"} # Watch GPU memory in another terminal during a request watch -n1 nvidia-smi # Quick inference test curl -s http://127.0.0.1:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"model":"local","messages":[{"role":"user","content":"Say hello."}],"max_tokens":20}' \ | python3 -m json.tool ``` During inference, `nvidia-smi` should show: - GPU-Util: 80-100% - GPU Memory: ~10-11GB used (model weights + KV cache) - CPU: near idle ```bash # Quick inference test (node instead of python3) curl -s http://127.0.0.1:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"model":"Qwen_Qwen3-14B-Q4_K_M","messages":[{"role":"user","content":"Say hello."}],"max_tokens":20}' \ | node -e "let d=''; process.stdin.on('data',c=>d+=c); process.stdin.on('end',()=>console.log(JSON.parse(d).choices[0].message.content))" ``` --- ## Step 7 — Configure OpenCode Edit `~/.config/opencode/opencode.json` to add the provider. Model IDs are the filenames **without** `.gguf` (or the subdirectory name for multimodal models). The `limit` values here inform opencode's context window tracking; the actual server-side limits are set in `presets.ini`. ```json "llama-server": { "npm": "@ai-sdk/openai-compatible", "name": "llama-server", "options": { "baseURL": "http://127.0.0.1:8080/v1" }, "models": { "Qwen_Qwen3-14B-Q4_K_M": { "name": "Qwen3 14B Q4 (fast)", "tools": true, "limit": { "context": 32768, "output": 4096 } }, "Qwen_Qwen3.6-27B-Q4_K_M": { "name": "Qwen3.6 27B Q4 (deep)", "tools": true, "limit": { "context": 16384, "output": 4096 } }, "OmniCoder-2-9B.Q8_0": { "name": "OmniCoder 2 9B Q8 (vision)", "tools": true, "limit": { "context": 32768, "output": 4096 } }, "Qwen3.6-35B-A3B-IQ3_S-3.06bpw": { "name": "Qwen3.6 35B A3B IQ3 (MoE+MTP)", "tools": true, "limit": { "context": 8192, "output": 4096 } } } } ``` In the project-level `opencode.json`, set the active model per agent: ```json "agent": { "orchestrator": { "model": "llama-server/Qwen_Qwen3-14B-Q4_K_M" } } ``` --- ## Model choice for RTX 3080 12GB Pick based on what fits **entirely** in VRAM — hybrid inference (model too large for VRAM) is 4–8× slower and makes interactive use painful. MoE models are an exception; see note below the table. | Model | File size | Fits in 12GB? | Speed (est.) | Notes | | ------------------------------- | --------- | ------------- | ------------- | ---------------------------------------------------------------------------------------- | | Qwen3-8B Q4_K_M | ~5 GB | ✅ fully | ~25–35 tok/s | Fast; weaker reasoning | | **Qwen3-14B Q4_K_M** | ~8.5 GB | ✅ fully | ~12–18 tok/s | **Daily driver** — fast interactive use, good instruction following | | OmniCoder-2-9B Q8_0 | ~9.5 GB | ✅ fully | ~15–20 tok/s | Vision-capable (multimodal); subdirectory layout for auto-detected mmproj | | **Qwen3.6-27B Q4_K_M** | 17 GB | ⚠️ partial | ~4–8 tok/s | **Deep reasoning** — better at vague/complex tasks; slow due to CPU offload | | **Qwen3.6-35B-A3B IQ3_S (MTP)** | 13.6 GB | ⚠️ partial | ~20–35 tok/s† | **MoE + MTP** — sparse activation (~3B active params); needs MTP-format GGUF (byteshape) | | Qwen3-32B Q4_K_M | ~20 GB | ❌ | — | Won't fit | † MoE speed estimate with `--spec-type draft-mtp`. Despite 13.6 GB file size, only ~1.6 GB needs CPU offload (few dense attention layers overflow VRAM). The sparse feed-forward experts make active-parameter compute comparable to a 3B dense model. All models sit in `~/models/` simultaneously and are swapped on-demand by the router. Cold-swap time is ~10s (9–14B) / ~30–45s (27B+). Download from bartowski on HuggingFace (imatrix quants, standard GGUF format): ```bash mkdir -p ~/models wget -c "https://huggingface.co/bartowski/Qwen_Qwen3-14B-GGUF/resolve/main/Qwen_Qwen3-14B-Q4_K_M.gguf" \ -O ~/models/Qwen_Qwen3-14B-Q4_K_M.gguf ``` > **⚠️ Use HuggingFace GGUFs, not Ollama blobs for qwen35-architecture models.** > Ollama's converter outputs different tensor names and per-layer KV-head arrays > that are incompatible with llama.cpp's `qwen35` model loader. Symptoms: > `missing tensor 'blk.0.ssm_dt'`, `check_tensor_dims: wrong shape`, or > `rope.dimension_sections has wrong array length`. Always download from > bartowski or unsloth on HuggingFace for these models. --- ## Switching models With router mode, switching requires **no restart and no `sudo`**. Place GGUFs in `~/models/` and reference them by model ID in `opencode.json`. ### Add a model ```bash # Download to ~/models/ — filename without .gguf becomes the model ID wget -c "https://huggingface.co/bartowski/Qwen_Qwen3.6-27B-GGUF/resolve/main/Qwen_Qwen3.6-27B-Q4_K_M.gguf" \ -O ~/models/Qwen_Qwen3.6-27B-Q4_K_M.gguf ``` Then add a section to `~/models/presets.ini` (name = filename without `.gguf`): ```ini [Qwen_Qwen3.6-27B-Q4_K_M] ctx-size = 16384 n-predict = 4096 ``` And register it in `~/.config/opencode/opencode.json`: ```json "Qwen_Qwen3.6-27B-Q4_K_M": { "name": "Qwen3.6 27B Q4 (deep)", "tools": true, "limit": { "context": 16384, "output": 4096 } } ``` ### Switch active model Edit `opencode.json` (project-level or `~/.config/opencode/opencode.json`) and change the agent's `model` to `llama-server/`: ```json "agent": { "orchestrator": { "model": "llama-server/Qwen_Qwen3-14B-Q4_K_M" } } ``` The next request triggers a cold load of the new model (~10–30s for 14B, ~30–60s for 27B+). No service restart needed. `--models-max 1` ensures the previous model is evicted from VRAM automatically. To switch from the CLI without editing files: ```bash opencode run -m "llama-server/Qwen_Qwen3-14B-Q4_K_M" "your message here" ``` ### Multimodal models For models with a separate vision encoder (mmproj), use a **subdirectory** in `~/models/`. The directory name becomes the model ID; llama.cpp auto-detects any file whose name starts with `mmproj` as the projector. ``` ~/models/ OmniCoder-2-9B.Q8_0/ ← model ID = "OmniCoder-2-9B.Q8_0" OmniCoder-2-9B.Q8_0.gguf ← main weights mmproj-Q8_0.gguf ← vision projector (auto-detected) ``` ### List available models ```bash # See what's in ~/models/ (all are immediately usable as model IDs) ls ~/models/ # See what's currently loaded curl -s http://127.0.0.1:8080/v1/models | node -e \ "process.stdin.resume(); let d=''; process.stdin.on('data',c=>d+=c); process.stdin.on('end',()=>JSON.parse(d).data.forEach(m=>console.log(m.id, m.meta?.loaded ? '[loaded]' : '[unloaded]')))" # Force a rescan (picks up newly added model files) curl -s 'http://127.0.0.1:8080/models?reload=1' | node -e \ "process.stdin.resume(); let d=''; process.stdin.on('data',c=>d+=c); process.stdin.on('end',()=>JSON.parse(d).data.forEach(m=>console.log(m.id)))" ``` ### Auto-restart on presets.ini change The router caches `presets.ini` at startup, so any edit requires a service restart to take effect. You can automate this with a systemd **path unit** that watches the file and triggers a restart whenever it is written: ```bash sudo tee /etc/systemd/system/llama-server-presets.path > /dev/null << 'EOF' [Unit] Description=Restart llama-server when presets.ini changes [Path] PathChanged=/home/dev/models/presets.ini [Install] WantedBy=default.target EOF sudo tee /etc/systemd/system/llama-server-presets.service > /dev/null << 'EOF' [Unit] Description=Restart llama-server (triggered by presets.ini change) [Service] Type=oneshot ExecStart=/bin/systemctl restart llama-server EOF sudo systemctl daemon-reload sudo systemctl enable --now llama-server-presets.path ``` After this, saving `~/models/presets.ini` automatically restarts the service (~3 s) and the next inference request cold-loads the model with the new settings. The restart is intentionally disruptive — the currently-loaded model is evicted — so only enable this if disruptive restarts on every presets save are acceptable. --- ## MTP speculative decoding Multi-Token Prediction (MTP) lets the model predict several tokens per forward pass using draft heads baked into the model weights — no separate draft model needed. For Qwen3.6-35B-A3B this roughly doubles throughput (from ~15 tok/s to ~25–35 tok/s on RTX 3080) while preserving output quality. **Requirements:** 1. **b9279+ binary** — `--spec-type draft-mtp` was added in this era. Verify: ```bash /opt/llama-server/llama-server --help | grep spec-type # must list draft-mtp ``` 2. **MTP-format GGUF** — standard bartowski/unsloth quants do not include MTP heads. Use byteshape's dedicated MTP GGUFs: ```bash # IQ3_S (13.6 GB) — best quality/size for 12 GB VRAM with slight CPU offload wget -c "https://huggingface.co/byteshape/Qwen3.6-35B-A3B-MTP-GGUF/resolve/main/Qwen3.6-35B-A3B-IQ3_S-3.06bpw.gguf" \ -O ~/models/Qwen3.6-35B-A3B-IQ3_S-3.06bpw.gguf # IQ2_S (10 GB) — fully fits in VRAM; heavier quantization wget -c "https://huggingface.co/byteshape/Qwen3.6-35B-A3B-MTP-GGUF/resolve/main/Qwen3.6-35B-A3B-IQ2_S-2.25bpw.gguf" \ -O ~/models/Qwen3.6-35B-A3B-IQ2_S-2.25bpw.gguf ``` **`presets.ini` section for MTP:** ```ini [Qwen3.6-35B-A3B-IQ3_S-3.06bpw] ctx-size = 32768 n-predict = 4096 spec-type = draft-mtp spec-draft-p-min = 0.75 spec-draft-n-max = 3 ``` - `spec-draft-p-min` — minimum draft token acceptance probability. 0.75 is a good starting point; lower values accept more speculative tokens (faster but may diverge from non-speculative output). - `spec-draft-n-max` — maximum tokens to speculate per step. 3 is the sweet spot for Qwen3.6 MTP; higher values have diminishing returns and add overhead. **Note:** ik_llama.cpp (a fork) achieves ~10–20% higher throughput with MTP than official llama.cpp due to a more optimized MTP head implementation. Official llama.cpp MTP is still significantly faster than non-speculative inference and is the simpler setup. --- ## Troubleshooting ### Active model keeps resetting to the configured default Known opencode bug [#28735](https://github.com/anomalyco/opencode/issues/28735) (open as of May 2026): when a background subagent result is delivered back into the main session, the active model resets to whatever `orchestrator.model` is configured in `opencode.json`. This means any model switch made via `-m` flag or the TUI selector gets silently reverted whenever a tool call or subagent completes. **Workaround:** keep `orchestrator.model` in `opencode.json` set to the model you actually want to use. The reset lands on the configured model, so if it matches your intent there's no observable effect. --- ### `no backends are loaded` at startup The backend `.so` plugins must be in the same directory as the binary, or on `LD_LIBRARY_PATH`. The `start.sh` script sets this explicitly. ### `make_cpu_buft_list: no CPU backend found` Install `libgomp1` (OpenMP runtime — required by the CPU backend): ```bash sudo apt-get install -y libgomp1 ``` ### CUDA device not found / GPU not offloading - Confirm `/usr/lib/wsl/lib` is in `PATH` or `LD_LIBRARY_PATH` for the process - Run `nvidia-smi` as the service user: `sudo -u ollama nvidia-smi` - Check `journalctl -u llama-server -n 50` for lines like `ggml_cuda_init: CUDA not found` ### High CPU / fan noise at idle - Remove `--no-mmap` if present (forces 9GB into RAM on startup) - Check `--n-parallel` isn't set high (default 1 is fine for single-user use) - llama-server is permanently loaded; fans will spin during model load (~30s) then drop to zero at idle — this is expected behavior ### `qwen35` architecture errors (rope, tensor shape, missing tensor) These errors all indicate an **incompatible GGUF source**: - `rope.dimension_sections has wrong array length; expected 4, got 3` — Ollama stores a 3-element array; llama.cpp (before a patch) expects 4. - `missing tensor 'blk.0.ssm_dt'` or `blk.0.ssm_dt.bias` — Ollama omits the `.bias` suffix that HuggingFace-converted GGUFs use (or vice versa). - `check_tensor_dims: wrong shape` on `blk.N.attn_k.weight` — Ollama's converter stores `head_count_kv` as a per-layer array; llama.cpp's qwen35 model loader expects a scalar. **Solution:** use HuggingFace GGUFs (bartowski or unsloth) instead of Ollama blobs for any `qwen35`-architecture model. See _Model choice_ above. ### Upgrading llama.cpp (replacing binaries while service is running) The service holds the binary open; `cp` will fail with `Text file busy`. Always stop the service first: ```bash sudo systemctl stop llama-server sudo cp build/bin/llama-server /opt/llama-server/ sudo cp -P build/bin/lib*.so* /opt/llama-server/ sudo systemctl start llama-server ``` ### Model file permissions (service runs as `ollama` user) Files downloaded as your user aren't readable by the `ollama` service user: ```bash # Make model file readable by all sudo chmod o+r ~/models/MyModel.gguf # Make the directory traversable sudo chmod o+x ~ ~/models ```