dotfiles/.agents/docs/llama-server-cuda-wsl2.md
Brydon DeWitt 6b07e4ccb2 feat: add shared agent infrastructure (.agents/)
- AGENTS.md: design principles, enforcement hierarchy, deferred loading
- agents/: brainstorm, build, orchestrator, research (auto-discovered by MCP server)
- skills/: research methodology (auto-discovered by MCP server)
- hooks/: pre-tool-use, post-tool-use (BFF block removed), session-start,
  stop, pre-compact, user-prompt-submit
- frameworks/: opencode/plugin.ts (resolves hooks via import.meta.url — works
  as project-local or global plugin), github/hooks.json
- mcp/index.ts: auto-discovers agents/*.md and skills/*.md from frontmatter
  (replaces hand-maintained registry); server renamed all-agents
- docs/: agent-infrastructure.md (generalized), research docs (7 files),
  ai_architectures.md, llama-server-cuda-wsl2.md
- install.sh: idempotent setup — Copilot global hooks, OpenCode global plugin +
  AGENTS.md + MCP entry, VS Code global MCP config
2026-05-22 13:13:43 -04:00

20 KiB
Raw Blame History

llama-server with CUDA on WSL2

Guide to deploying llama-server (llama.cpp) as a systemd service on WSL2 with full NVIDIA GPU offload via CUDA. Configured in router mode to serve multiple GGUF models on-demand (with optional MTP speculative decoding) via an OpenAI-compatible API.

Target environment:

  • WSL2 (Ubuntu 24.04 Noble)
  • NVIDIA RTX 3080 12GB (or similar), driver exposed via WSL2 GPU passthrough
  • No separate CUDA toolkit install required to run; only needed when building

Why not Ollama?

Ollama vendors a pinned version of llama.cpp and bundles its own CUDA runtime. New model architectures (like qwen35 / Qwen3-Next) may not be supported until Ollama syncs its fork. llama-server from upstream llama.cpp supports them as soon as the architecture lands in the main branch.

Ollama does nothing special beyond: bundling libggml-cuda.so alongside its runner and setting PATH to include /usr/lib/wsl/lib (the WSL2 CUDA driver passthrough). No flash-attention env vars, no special flags. We replicate this.


Prerequisites

# Verify WSL2 CUDA driver passthrough is working
ls /usr/lib/wsl/lib/libcuda.so.1   # must exist
nvidia-smi                          # must show your GPU

Step 1 — Install CUDA toolkit and build dependencies

Only needed once per machine to compile llama.cpp. Not needed at runtime.

sudo apt-get install -y nvidia-cuda-toolkit cmake build-essential git

Ubuntu 24.04 ships CUDA 12.0 in the multiverse repo. This is sufficient to build llama.cpp with CUDA support even when the runtime driver is newer (e.g. CUDA 13.1 via WSL2 passthrough). Alternatively, install CUDA 12.x from NVIDIA's own APT repo to get a more recent toolkit.

Verify the compiler is available:

nvcc --version

Step 2 — Clone and build llama.cpp from source

# Clone at a specific tag — check https://github.com/ggml-org/llama.cpp/releases for latest
# b9144+ required for qwen35 architecture (Qwen3.6, OmniCoder 2, etc.)
# b9279+ required for MTP speculative decoding (--spec-type draft-mtp)
git clone --depth 1 --branch b9279 https://github.com/ggml-org/llama.cpp.git /tmp/llama-build
cd /tmp/llama-build

# Configure with CUDA backend
cmake -B build \
  -DGGML_CUDA=ON \
  -DCMAKE_BUILD_TYPE=Release \
  -DLLAMA_BUILD_SERVER=ON \
  -DLLAMA_BUILD_TESTS=OFF \
  -DLLAMA_BUILD_EXAMPLES=OFF

# Build (uses all cores; takes 10-15 min on a 12-core CPU)
cmake --build build --config Release -j$(nproc)

After the build completes you should see build/bin/llama-server.


Step 3 — Install to /opt/llama-server

sudo mkdir -p /opt/llama-server

# Copy the server binary
sudo cp build/bin/llama-server /opt/llama-server/

# Copy all shared libraries (b9144+ puts them all in build/bin/)
sudo cp -P build/bin/libggml*.so* /opt/llama-server/
sudo cp -P build/bin/libllama*.so* /opt/llama-server/
sudo cp -P build/bin/libmtmd*.so*  /opt/llama-server/ 2>/dev/null || true

# Register the directory so transitive .so dependencies resolve
echo "/opt/llama-server" | sudo tee /etc/ld.so.conf.d/llama-server.conf
sudo ldconfig

Note (b9144+): The library layout changed — all .so files now live in build/bin/ (not build/ggml/src/ or build/src/). When upgrading, copy with -P to preserve versioned symlinks and overwrite the old ones.


Step 4 — Create the start script

Run llama-server in router mode — no --model flag. Models are loaded on-demand from ~/models/ when a request names them. Switching models requires no restart and no sudo: just change the model field in opencode.json.

sudo tee /opt/llama-server/start.sh > /dev/null << 'SCRIPT'
#!/bin/bash
export LD_LIBRARY_PATH=/opt/llama-server${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH}
cd /opt/llama-server
exec /opt/llama-server/llama-server \
  --models-dir /home/dev/models \
  --models-max 1 \
  --models-preset /home/dev/models/presets.ini \
  --host 127.0.0.1 \
  --port 8080
SCRIPT
sudo chmod +x /opt/llama-server/start.sh

Key router flags:

  • --models-dir — directory scanned for GGUF files. Flat .gguf files become model IDs using the filename without .gguf. Subdirectories become model IDs using the directory name (used for multimodal models with a separate mmproj file — see Multimodal models below).
  • --models-max 1 — only one model resident at a time. When a different model is requested, the current one is evicted and the new one loads (cold-start delay). With 12GB VRAM this is required.
  • --models-preset — path to presets.ini for global defaults and per-model overrides. All inference flags belong here, not in start.sh.

Per-model settings via presets.ini

All inference flags (ctx-size, n-predict, n-gpu-layers, flash-attn, threads, parallel, jinja, spec-type, etc.) live in ~/models/presets.ini, not in start.sh. The [*] section sets defaults inherited by every model; named sections override individual keys.

Section names must match the router's model ID — the filename without .gguf. Using the .gguf suffix in a section name creates a duplicate entry in the model list.

version = 1

[*]
n-gpu-layers = 99
flash-attn = on
threads = 8
parallel = 1

[Qwen_Qwen3-14B-Q4_K_M]
ctx-size = 32768
n-predict = 4096

[OmniCoder-2-9B.Q8_0]
ctx-size = 32768
n-predict = 4096

[Qwen_Qwen3.6-27B-Q4_K_M]
ctx-size = 16384
n-predict = 4096

Note: The router reads presets.ini once at service startup — it is not watched for changes. After editing it, run sudo systemctl restart llama-server to apply the new settings. Any currently-loaded model will be evicted and must cold-reload on the next request (~1060 s).

On GPU layer offload: Hybrid inference (some layers on CPU, some on GPU) is significantly slower than full-GPU due to CPU↔GPU memory transfers each forward pass. For interactive use, prefer models that fit entirely in VRAM. MoE models (like Qwen3.6-35B-A3B) are an exception — their sparse activation means active computation per token is only ~3B parameters regardless of total model size, so partial CPU offload is less painful than with a dense model of the same file size. See the Model choice section below.


Step 5 — Create the systemd service

sudo tee /etc/systemd/system/llama-server.service > /dev/null << 'EOF'
[Unit]
Description=llama-server (OmniCoder 2 / qwen35)
After=network-online.target

[Service]
ExecStart=/opt/llama-server/start.sh
User=ollama
Group=ollama
Restart=always
RestartSec=3
Environment="PATH=/home/dev/.nvm/versions/node/v24.15.0/bin:/home/dev/.opencode/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/usr/lib/wsl/lib:/snap/bin"

[Install]
WantedBy=default.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable llama-server
sudo systemctl start llama-server

Note: The PATH includes /usr/lib/wsl/lib — this is what exposes the CUDA driver (libcuda.so.1) to the process in WSL2. Without this, the CUDA backend will load but fail to initialize the device.


Step 6 — Verify GPU offload

# Check service is running
systemctl status llama-server

# Health endpoint
curl -s http://127.0.0.1:8080/health
# → {"status":"ok"}

# Watch GPU memory in another terminal during a request
watch -n1 nvidia-smi

# Quick inference test
curl -s http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"local","messages":[{"role":"user","content":"Say hello."}],"max_tokens":20}' \
  | python3 -m json.tool

During inference, nvidia-smi should show:

  • GPU-Util: 80-100%
  • GPU Memory: ~10-11GB used (model weights + KV cache)
  • CPU: near idle
# Quick inference test (node instead of python3)
curl -s http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"Qwen_Qwen3-14B-Q4_K_M","messages":[{"role":"user","content":"Say hello."}],"max_tokens":20}' \
  | node -e "let d=''; process.stdin.on('data',c=>d+=c); process.stdin.on('end',()=>console.log(JSON.parse(d).choices[0].message.content))"

Step 7 — Configure OpenCode

Edit ~/.config/opencode/opencode.json to add the provider. Model IDs are the filenames without .gguf (or the subdirectory name for multimodal models). The limit values here inform opencode's context window tracking; the actual server-side limits are set in presets.ini.

"llama-server": {
  "npm": "@ai-sdk/openai-compatible",
  "name": "llama-server",
  "options": { "baseURL": "http://127.0.0.1:8080/v1" },
  "models": {
    "Qwen_Qwen3-14B-Q4_K_M": {
      "name": "Qwen3 14B Q4 (fast)",
      "tools": true,
      "limit": { "context": 32768, "output": 4096 }
    },
    "Qwen_Qwen3.6-27B-Q4_K_M": {
      "name": "Qwen3.6 27B Q4 (deep)",
      "tools": true,
      "limit": { "context": 16384, "output": 4096 }
    },
    "OmniCoder-2-9B.Q8_0": {
      "name": "OmniCoder 2 9B Q8 (vision)",
      "tools": true,
      "limit": { "context": 32768, "output": 4096 }
    },
    "Qwen3.6-35B-A3B-IQ3_S-3.06bpw": {
      "name": "Qwen3.6 35B A3B IQ3 (MoE+MTP)",
      "tools": true,
      "limit": { "context": 8192, "output": 4096 }
    }
  }
}

In the project-level opencode.json, set the active model per agent:

"agent": {
  "orchestrator": {
    "model": "llama-server/Qwen_Qwen3-14B-Q4_K_M"
  }
}

Model choice for RTX 3080 12GB

Pick based on what fits entirely in VRAM — hybrid inference (model too large for VRAM) is 48× slower and makes interactive use painful. MoE models are an exception; see note below the table.

Model File size Fits in 12GB? Speed (est.) Notes
Qwen3-8B Q4_K_M ~5 GB fully ~2535 tok/s Fast; weaker reasoning
Qwen3-14B Q4_K_M ~8.5 GB fully ~1218 tok/s Daily driver — fast interactive use, good instruction following
OmniCoder-2-9B Q8_0 ~9.5 GB fully ~1520 tok/s Vision-capable (multimodal); subdirectory layout for auto-detected mmproj
Qwen3.6-27B Q4_K_M 17 GB ⚠️ partial ~48 tok/s Deep reasoning — better at vague/complex tasks; slow due to CPU offload
Qwen3.6-35B-A3B IQ3_S (MTP) 13.6 GB ⚠️ partial ~2035 tok/s† MoE + MTP — sparse activation (~3B active params); needs MTP-format GGUF (byteshape)
Qwen3-32B Q4_K_M ~20 GB Won't fit

† MoE speed estimate with --spec-type draft-mtp. Despite 13.6 GB file size, only ~1.6 GB needs CPU offload (few dense attention layers overflow VRAM). The sparse feed-forward experts make active-parameter compute comparable to a 3B dense model.

All models sit in ~/models/ simultaneously and are swapped on-demand by the router. Cold-swap time is ~10s (914B) / ~3045s (27B+).

Download from bartowski on HuggingFace (imatrix quants, standard GGUF format):

mkdir -p ~/models
wget -c "https://huggingface.co/bartowski/Qwen_Qwen3-14B-GGUF/resolve/main/Qwen_Qwen3-14B-Q4_K_M.gguf" \
  -O ~/models/Qwen_Qwen3-14B-Q4_K_M.gguf

⚠️ Use HuggingFace GGUFs, not Ollama blobs for qwen35-architecture models. Ollama's converter outputs different tensor names and per-layer KV-head arrays that are incompatible with llama.cpp's qwen35 model loader. Symptoms: missing tensor 'blk.0.ssm_dt', check_tensor_dims: wrong shape, or rope.dimension_sections has wrong array length. Always download from bartowski or unsloth on HuggingFace for these models.


Switching models

With router mode, switching requires no restart and no sudo. Place GGUFs in ~/models/ and reference them by model ID in opencode.json.

Add a model

# Download to ~/models/ — filename without .gguf becomes the model ID
wget -c "https://huggingface.co/bartowski/Qwen_Qwen3.6-27B-GGUF/resolve/main/Qwen_Qwen3.6-27B-Q4_K_M.gguf" \
  -O ~/models/Qwen_Qwen3.6-27B-Q4_K_M.gguf

Then add a section to ~/models/presets.ini (name = filename without .gguf):

[Qwen_Qwen3.6-27B-Q4_K_M]
ctx-size = 16384
n-predict = 4096

And register it in ~/.config/opencode/opencode.json:

"Qwen_Qwen3.6-27B-Q4_K_M": {
  "name": "Qwen3.6 27B Q4 (deep)",
  "tools": true,
  "limit": { "context": 16384, "output": 4096 }
}

Switch active model

Edit opencode.json (project-level or ~/.config/opencode/opencode.json) and change the agent's model to llama-server/<model-id>:

"agent": {
  "orchestrator": {
    "model": "llama-server/Qwen_Qwen3-14B-Q4_K_M"
  }
}

The next request triggers a cold load of the new model (~1030s for 14B, ~3060s for 27B+). No service restart needed. --models-max 1 ensures the previous model is evicted from VRAM automatically.

To switch from the CLI without editing files:

opencode run -m "llama-server/Qwen_Qwen3-14B-Q4_K_M" "your message here"

Multimodal models

For models with a separate vision encoder (mmproj), use a subdirectory in ~/models/. The directory name becomes the model ID; llama.cpp auto-detects any file whose name starts with mmproj as the projector.

~/models/
  OmniCoder-2-9B.Q8_0/          ← model ID = "OmniCoder-2-9B.Q8_0"
    OmniCoder-2-9B.Q8_0.gguf    ← main weights
    mmproj-Q8_0.gguf             ← vision projector (auto-detected)

List available models

# See what's in ~/models/ (all are immediately usable as model IDs)
ls ~/models/

# See what's currently loaded
curl -s http://127.0.0.1:8080/v1/models | node -e \
  "process.stdin.resume(); let d=''; process.stdin.on('data',c=>d+=c); process.stdin.on('end',()=>JSON.parse(d).data.forEach(m=>console.log(m.id, m.meta?.loaded ? '[loaded]' : '[unloaded]')))"

# Force a rescan (picks up newly added model files)
curl -s 'http://127.0.0.1:8080/models?reload=1' | node -e \
  "process.stdin.resume(); let d=''; process.stdin.on('data',c=>d+=c); process.stdin.on('end',()=>JSON.parse(d).data.forEach(m=>console.log(m.id)))"

Auto-restart on presets.ini change

The router caches presets.ini at startup, so any edit requires a service restart to take effect. You can automate this with a systemd path unit that watches the file and triggers a restart whenever it is written:

sudo tee /etc/systemd/system/llama-server-presets.path > /dev/null << 'EOF'
[Unit]
Description=Restart llama-server when presets.ini changes

[Path]
PathChanged=/home/dev/models/presets.ini

[Install]
WantedBy=default.target
EOF

sudo tee /etc/systemd/system/llama-server-presets.service > /dev/null << 'EOF'
[Unit]
Description=Restart llama-server (triggered by presets.ini change)

[Service]
Type=oneshot
ExecStart=/bin/systemctl restart llama-server
EOF

sudo systemctl daemon-reload
sudo systemctl enable --now llama-server-presets.path

After this, saving ~/models/presets.ini automatically restarts the service (~3 s) and the next inference request cold-loads the model with the new settings. The restart is intentionally disruptive — the currently-loaded model is evicted — so only enable this if disruptive restarts on every presets save are acceptable.


MTP speculative decoding

Multi-Token Prediction (MTP) lets the model predict several tokens per forward pass using draft heads baked into the model weights — no separate draft model needed. For Qwen3.6-35B-A3B this roughly doubles throughput (from ~15 tok/s to ~2535 tok/s on RTX 3080) while preserving output quality.

Requirements:

  1. b9279+ binary--spec-type draft-mtp was added in this era. Verify:
    /opt/llama-server/llama-server --help | grep spec-type
    # must list draft-mtp
    
  2. MTP-format GGUF — standard bartowski/unsloth quants do not include MTP heads. Use byteshape's dedicated MTP GGUFs:
    # IQ3_S (13.6 GB) — best quality/size for 12 GB VRAM with slight CPU offload
    wget -c "https://huggingface.co/byteshape/Qwen3.6-35B-A3B-MTP-GGUF/resolve/main/Qwen3.6-35B-A3B-IQ3_S-3.06bpw.gguf" \
      -O ~/models/Qwen3.6-35B-A3B-IQ3_S-3.06bpw.gguf
    # IQ2_S (10 GB) — fully fits in VRAM; heavier quantization
    wget -c "https://huggingface.co/byteshape/Qwen3.6-35B-A3B-MTP-GGUF/resolve/main/Qwen3.6-35B-A3B-IQ2_S-2.25bpw.gguf" \
      -O ~/models/Qwen3.6-35B-A3B-IQ2_S-2.25bpw.gguf
    

presets.ini section for MTP:

[Qwen3.6-35B-A3B-IQ3_S-3.06bpw]
ctx-size = 32768
n-predict = 4096
spec-type = draft-mtp
spec-draft-p-min = 0.75
spec-draft-n-max = 3
  • spec-draft-p-min — minimum draft token acceptance probability. 0.75 is a good starting point; lower values accept more speculative tokens (faster but may diverge from non-speculative output).
  • spec-draft-n-max — maximum tokens to speculate per step. 3 is the sweet spot for Qwen3.6 MTP; higher values have diminishing returns and add overhead.

Note: ik_llama.cpp (a fork) achieves ~1020% higher throughput with MTP than official llama.cpp due to a more optimized MTP head implementation. Official llama.cpp MTP is still significantly faster than non-speculative inference and is the simpler setup.


Troubleshooting

Active model keeps resetting to the configured default

Known opencode bug #28735 (open as of May 2026): when a background subagent result is delivered back into the main session, the active model resets to whatever orchestrator.model is configured in opencode.json. This means any model switch made via -m flag or the TUI selector gets silently reverted whenever a tool call or subagent completes.

Workaround: keep orchestrator.model in opencode.json set to the model you actually want to use. The reset lands on the configured model, so if it matches your intent there's no observable effect.


no backends are loaded at startup

The backend .so plugins must be in the same directory as the binary, or on LD_LIBRARY_PATH. The start.sh script sets this explicitly.

make_cpu_buft_list: no CPU backend found

Install libgomp1 (OpenMP runtime — required by the CPU backend):

sudo apt-get install -y libgomp1

CUDA device not found / GPU not offloading

  • Confirm /usr/lib/wsl/lib is in PATH or LD_LIBRARY_PATH for the process
  • Run nvidia-smi as the service user: sudo -u ollama nvidia-smi
  • Check journalctl -u llama-server -n 50 for lines like ggml_cuda_init: CUDA not found

High CPU / fan noise at idle

  • Remove --no-mmap if present (forces 9GB into RAM on startup)
  • Check --n-parallel isn't set high (default 1 is fine for single-user use)
  • llama-server is permanently loaded; fans will spin during model load (~30s) then drop to zero at idle — this is expected behavior

qwen35 architecture errors (rope, tensor shape, missing tensor)

These errors all indicate an incompatible GGUF source:

  • rope.dimension_sections has wrong array length; expected 4, got 3 — Ollama stores a 3-element array; llama.cpp (before a patch) expects 4.
  • missing tensor 'blk.0.ssm_dt' or blk.0.ssm_dt.bias — Ollama omits the .bias suffix that HuggingFace-converted GGUFs use (or vice versa).
  • check_tensor_dims: wrong shape on blk.N.attn_k.weight — Ollama's converter stores head_count_kv as a per-layer array; llama.cpp's qwen35 model loader expects a scalar.

Solution: use HuggingFace GGUFs (bartowski or unsloth) instead of Ollama blobs for any qwen35-architecture model. See Model choice above.

Upgrading llama.cpp (replacing binaries while service is running)

The service holds the binary open; cp will fail with Text file busy. Always stop the service first:

sudo systemctl stop llama-server
sudo cp build/bin/llama-server /opt/llama-server/
sudo cp -P build/bin/lib*.so* /opt/llama-server/
sudo systemctl start llama-server

Model file permissions (service runs as ollama user)

Files downloaded as your user aren't readable by the ollama service user:

# Make model file readable by all
sudo chmod o+r ~/models/MyModel.gguf
# Make the directory traversable
sudo chmod o+x ~ ~/models