- AGENTS.md: design principles, enforcement hierarchy, deferred loading - agents/: brainstorm, build, orchestrator, research (auto-discovered by MCP server) - skills/: research methodology (auto-discovered by MCP server) - hooks/: pre-tool-use, post-tool-use (BFF block removed), session-start, stop, pre-compact, user-prompt-submit - frameworks/: opencode/plugin.ts (resolves hooks via import.meta.url — works as project-local or global plugin), github/hooks.json - mcp/index.ts: auto-discovers agents/*.md and skills/*.md from frontmatter (replaces hand-maintained registry); server renamed all-agents - docs/: agent-infrastructure.md (generalized), research docs (7 files), ai_architectures.md, llama-server-cuda-wsl2.md - install.sh: idempotent setup — Copilot global hooks, OpenCode global plugin + AGENTS.md + MCP entry, VS Code global MCP config
20 KiB
llama-server with CUDA on WSL2
Guide to deploying llama-server (llama.cpp) as a systemd service on WSL2 with
full NVIDIA GPU offload via CUDA. Configured in router mode to serve
multiple GGUF models on-demand (with optional MTP speculative decoding) via an
OpenAI-compatible API.
Target environment:
- WSL2 (Ubuntu 24.04 Noble)
- NVIDIA RTX 3080 12GB (or similar), driver exposed via WSL2 GPU passthrough
- No separate CUDA toolkit install required to run; only needed when building
Why not Ollama?
Ollama vendors a pinned version of llama.cpp and bundles its own CUDA runtime.
New model architectures (like qwen35 / Qwen3-Next) may not be supported until
Ollama syncs its fork. llama-server from upstream llama.cpp supports them as
soon as the architecture lands in the main branch.
Ollama does nothing special beyond: bundling libggml-cuda.so alongside its
runner and setting PATH to include /usr/lib/wsl/lib (the WSL2 CUDA driver
passthrough). No flash-attention env vars, no special flags. We replicate this.
Prerequisites
# Verify WSL2 CUDA driver passthrough is working
ls /usr/lib/wsl/lib/libcuda.so.1 # must exist
nvidia-smi # must show your GPU
Step 1 — Install CUDA toolkit and build dependencies
Only needed once per machine to compile llama.cpp. Not needed at runtime.
sudo apt-get install -y nvidia-cuda-toolkit cmake build-essential git
Ubuntu 24.04 ships CUDA 12.0 in the multiverse repo. This is sufficient to
build llama.cpp with CUDA support even when the runtime driver is newer (e.g.
CUDA 13.1 via WSL2 passthrough). Alternatively, install CUDA 12.x from
NVIDIA's own APT repo to get a
more recent toolkit.
Verify the compiler is available:
nvcc --version
Step 2 — Clone and build llama.cpp from source
# Clone at a specific tag — check https://github.com/ggml-org/llama.cpp/releases for latest
# b9144+ required for qwen35 architecture (Qwen3.6, OmniCoder 2, etc.)
# b9279+ required for MTP speculative decoding (--spec-type draft-mtp)
git clone --depth 1 --branch b9279 https://github.com/ggml-org/llama.cpp.git /tmp/llama-build
cd /tmp/llama-build
# Configure with CUDA backend
cmake -B build \
-DGGML_CUDA=ON \
-DCMAKE_BUILD_TYPE=Release \
-DLLAMA_BUILD_SERVER=ON \
-DLLAMA_BUILD_TESTS=OFF \
-DLLAMA_BUILD_EXAMPLES=OFF
# Build (uses all cores; takes 10-15 min on a 12-core CPU)
cmake --build build --config Release -j$(nproc)
After the build completes you should see build/bin/llama-server.
Step 3 — Install to /opt/llama-server
sudo mkdir -p /opt/llama-server
# Copy the server binary
sudo cp build/bin/llama-server /opt/llama-server/
# Copy all shared libraries (b9144+ puts them all in build/bin/)
sudo cp -P build/bin/libggml*.so* /opt/llama-server/
sudo cp -P build/bin/libllama*.so* /opt/llama-server/
sudo cp -P build/bin/libmtmd*.so* /opt/llama-server/ 2>/dev/null || true
# Register the directory so transitive .so dependencies resolve
echo "/opt/llama-server" | sudo tee /etc/ld.so.conf.d/llama-server.conf
sudo ldconfig
Note (b9144+): The library layout changed — all
.sofiles now live inbuild/bin/(notbuild/ggml/src/orbuild/src/). When upgrading, copy with-Pto preserve versioned symlinks and overwrite the old ones.
Step 4 — Create the start script
Run llama-server in router mode — no --model flag. Models are loaded
on-demand from ~/models/ when a request names them. Switching models requires
no restart and no sudo: just change the model field in opencode.json.
sudo tee /opt/llama-server/start.sh > /dev/null << 'SCRIPT'
#!/bin/bash
export LD_LIBRARY_PATH=/opt/llama-server${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH}
cd /opt/llama-server
exec /opt/llama-server/llama-server \
--models-dir /home/dev/models \
--models-max 1 \
--models-preset /home/dev/models/presets.ini \
--host 127.0.0.1 \
--port 8080
SCRIPT
sudo chmod +x /opt/llama-server/start.sh
Key router flags:
--models-dir— directory scanned for GGUF files. Flat.gguffiles become model IDs using the filename without.gguf. Subdirectories become model IDs using the directory name (used for multimodal models with a separate mmproj file — see Multimodal models below).--models-max 1— only one model resident at a time. When a different model is requested, the current one is evicted and the new one loads (cold-start delay). With 12GB VRAM this is required.--models-preset— path topresets.inifor global defaults and per-model overrides. All inference flags belong here, not instart.sh.
Per-model settings via presets.ini
All inference flags (ctx-size, n-predict, n-gpu-layers, flash-attn,
threads, parallel, jinja, spec-type, etc.) live in
~/models/presets.ini, not in start.sh. The [*] section sets defaults
inherited by every model; named sections override individual keys.
Section names must match the router's model ID — the filename without
.gguf. Using the .gguf suffix in a section name creates a duplicate entry in
the model list.
version = 1
[*]
n-gpu-layers = 99
flash-attn = on
threads = 8
parallel = 1
[Qwen_Qwen3-14B-Q4_K_M]
ctx-size = 32768
n-predict = 4096
[OmniCoder-2-9B.Q8_0]
ctx-size = 32768
n-predict = 4096
[Qwen_Qwen3.6-27B-Q4_K_M]
ctx-size = 16384
n-predict = 4096
Note: The router reads
presets.inionce at service startup — it is not watched for changes. After editing it, runsudo systemctl restart llama-serverto apply the new settings. Any currently-loaded model will be evicted and must cold-reload on the next request (~10–60 s).
On GPU layer offload: Hybrid inference (some layers on CPU, some on GPU) is significantly slower than full-GPU due to CPU↔GPU memory transfers each forward pass. For interactive use, prefer models that fit entirely in VRAM. MoE models (like Qwen3.6-35B-A3B) are an exception — their sparse activation means active computation per token is only ~3B parameters regardless of total model size, so partial CPU offload is less painful than with a dense model of the same file size. See the Model choice section below.
Step 5 — Create the systemd service
sudo tee /etc/systemd/system/llama-server.service > /dev/null << 'EOF'
[Unit]
Description=llama-server (OmniCoder 2 / qwen35)
After=network-online.target
[Service]
ExecStart=/opt/llama-server/start.sh
User=ollama
Group=ollama
Restart=always
RestartSec=3
Environment="PATH=/home/dev/.nvm/versions/node/v24.15.0/bin:/home/dev/.opencode/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/usr/lib/wsl/lib:/snap/bin"
[Install]
WantedBy=default.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable llama-server
sudo systemctl start llama-server
Note: The
PATHincludes/usr/lib/wsl/lib— this is what exposes the CUDA driver (libcuda.so.1) to the process in WSL2. Without this, the CUDA backend will load but fail to initialize the device.
Step 6 — Verify GPU offload
# Check service is running
systemctl status llama-server
# Health endpoint
curl -s http://127.0.0.1:8080/health
# → {"status":"ok"}
# Watch GPU memory in another terminal during a request
watch -n1 nvidia-smi
# Quick inference test
curl -s http://127.0.0.1:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"local","messages":[{"role":"user","content":"Say hello."}],"max_tokens":20}' \
| python3 -m json.tool
During inference, nvidia-smi should show:
- GPU-Util: 80-100%
- GPU Memory: ~10-11GB used (model weights + KV cache)
- CPU: near idle
# Quick inference test (node instead of python3)
curl -s http://127.0.0.1:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"Qwen_Qwen3-14B-Q4_K_M","messages":[{"role":"user","content":"Say hello."}],"max_tokens":20}' \
| node -e "let d=''; process.stdin.on('data',c=>d+=c); process.stdin.on('end',()=>console.log(JSON.parse(d).choices[0].message.content))"
Step 7 — Configure OpenCode
Edit ~/.config/opencode/opencode.json to add the provider. Model IDs are the
filenames without .gguf (or the subdirectory name for multimodal models).
The limit values here inform opencode's context window tracking; the actual
server-side limits are set in presets.ini.
"llama-server": {
"npm": "@ai-sdk/openai-compatible",
"name": "llama-server",
"options": { "baseURL": "http://127.0.0.1:8080/v1" },
"models": {
"Qwen_Qwen3-14B-Q4_K_M": {
"name": "Qwen3 14B Q4 (fast)",
"tools": true,
"limit": { "context": 32768, "output": 4096 }
},
"Qwen_Qwen3.6-27B-Q4_K_M": {
"name": "Qwen3.6 27B Q4 (deep)",
"tools": true,
"limit": { "context": 16384, "output": 4096 }
},
"OmniCoder-2-9B.Q8_0": {
"name": "OmniCoder 2 9B Q8 (vision)",
"tools": true,
"limit": { "context": 32768, "output": 4096 }
},
"Qwen3.6-35B-A3B-IQ3_S-3.06bpw": {
"name": "Qwen3.6 35B A3B IQ3 (MoE+MTP)",
"tools": true,
"limit": { "context": 8192, "output": 4096 }
}
}
}
In the project-level opencode.json, set the active model per agent:
"agent": {
"orchestrator": {
"model": "llama-server/Qwen_Qwen3-14B-Q4_K_M"
}
}
Model choice for RTX 3080 12GB
Pick based on what fits entirely in VRAM — hybrid inference (model too large for VRAM) is 4–8× slower and makes interactive use painful. MoE models are an exception; see note below the table.
| Model | File size | Fits in 12GB? | Speed (est.) | Notes |
|---|---|---|---|---|
| Qwen3-8B Q4_K_M | ~5 GB | ✅ fully | ~25–35 tok/s | Fast; weaker reasoning |
| Qwen3-14B Q4_K_M | ~8.5 GB | ✅ fully | ~12–18 tok/s | Daily driver — fast interactive use, good instruction following |
| OmniCoder-2-9B Q8_0 | ~9.5 GB | ✅ fully | ~15–20 tok/s | Vision-capable (multimodal); subdirectory layout for auto-detected mmproj |
| Qwen3.6-27B Q4_K_M | 17 GB | ⚠️ partial | ~4–8 tok/s | Deep reasoning — better at vague/complex tasks; slow due to CPU offload |
| Qwen3.6-35B-A3B IQ3_S (MTP) | 13.6 GB | ⚠️ partial | ~20–35 tok/s† | MoE + MTP — sparse activation (~3B active params); needs MTP-format GGUF (byteshape) |
| Qwen3-32B Q4_K_M | ~20 GB | ❌ | — | Won't fit |
† MoE speed estimate with --spec-type draft-mtp. Despite 13.6 GB file size,
only ~1.6 GB needs CPU offload (few dense attention layers overflow VRAM). The
sparse feed-forward experts make active-parameter compute comparable to a 3B
dense model.
All models sit in ~/models/ simultaneously and are swapped on-demand by the
router. Cold-swap time is ~10s (9–14B) / ~30–45s (27B+).
Download from bartowski on HuggingFace (imatrix quants, standard GGUF format):
mkdir -p ~/models
wget -c "https://huggingface.co/bartowski/Qwen_Qwen3-14B-GGUF/resolve/main/Qwen_Qwen3-14B-Q4_K_M.gguf" \
-O ~/models/Qwen_Qwen3-14B-Q4_K_M.gguf
⚠️ Use HuggingFace GGUFs, not Ollama blobs for qwen35-architecture models. Ollama's converter outputs different tensor names and per-layer KV-head arrays that are incompatible with llama.cpp's
qwen35model loader. Symptoms:missing tensor 'blk.0.ssm_dt',check_tensor_dims: wrong shape, orrope.dimension_sections has wrong array length. Always download from bartowski or unsloth on HuggingFace for these models.
Switching models
With router mode, switching requires no restart and no sudo. Place GGUFs
in ~/models/ and reference them by model ID in opencode.json.
Add a model
# Download to ~/models/ — filename without .gguf becomes the model ID
wget -c "https://huggingface.co/bartowski/Qwen_Qwen3.6-27B-GGUF/resolve/main/Qwen_Qwen3.6-27B-Q4_K_M.gguf" \
-O ~/models/Qwen_Qwen3.6-27B-Q4_K_M.gguf
Then add a section to ~/models/presets.ini (name = filename without .gguf):
[Qwen_Qwen3.6-27B-Q4_K_M]
ctx-size = 16384
n-predict = 4096
And register it in ~/.config/opencode/opencode.json:
"Qwen_Qwen3.6-27B-Q4_K_M": {
"name": "Qwen3.6 27B Q4 (deep)",
"tools": true,
"limit": { "context": 16384, "output": 4096 }
}
Switch active model
Edit opencode.json (project-level or ~/.config/opencode/opencode.json) and
change the agent's model to llama-server/<model-id>:
"agent": {
"orchestrator": {
"model": "llama-server/Qwen_Qwen3-14B-Q4_K_M"
}
}
The next request triggers a cold load of the new model (~10–30s for 14B, ~30–60s
for 27B+). No service restart needed. --models-max 1 ensures the previous
model is evicted from VRAM automatically.
To switch from the CLI without editing files:
opencode run -m "llama-server/Qwen_Qwen3-14B-Q4_K_M" "your message here"
Multimodal models
For models with a separate vision encoder (mmproj), use a subdirectory in
~/models/. The directory name becomes the model ID; llama.cpp auto-detects any
file whose name starts with mmproj as the projector.
~/models/
OmniCoder-2-9B.Q8_0/ ← model ID = "OmniCoder-2-9B.Q8_0"
OmniCoder-2-9B.Q8_0.gguf ← main weights
mmproj-Q8_0.gguf ← vision projector (auto-detected)
List available models
# See what's in ~/models/ (all are immediately usable as model IDs)
ls ~/models/
# See what's currently loaded
curl -s http://127.0.0.1:8080/v1/models | node -e \
"process.stdin.resume(); let d=''; process.stdin.on('data',c=>d+=c); process.stdin.on('end',()=>JSON.parse(d).data.forEach(m=>console.log(m.id, m.meta?.loaded ? '[loaded]' : '[unloaded]')))"
# Force a rescan (picks up newly added model files)
curl -s 'http://127.0.0.1:8080/models?reload=1' | node -e \
"process.stdin.resume(); let d=''; process.stdin.on('data',c=>d+=c); process.stdin.on('end',()=>JSON.parse(d).data.forEach(m=>console.log(m.id)))"
Auto-restart on presets.ini change
The router caches presets.ini at startup, so any edit requires a service
restart to take effect. You can automate this with a systemd path unit that
watches the file and triggers a restart whenever it is written:
sudo tee /etc/systemd/system/llama-server-presets.path > /dev/null << 'EOF'
[Unit]
Description=Restart llama-server when presets.ini changes
[Path]
PathChanged=/home/dev/models/presets.ini
[Install]
WantedBy=default.target
EOF
sudo tee /etc/systemd/system/llama-server-presets.service > /dev/null << 'EOF'
[Unit]
Description=Restart llama-server (triggered by presets.ini change)
[Service]
Type=oneshot
ExecStart=/bin/systemctl restart llama-server
EOF
sudo systemctl daemon-reload
sudo systemctl enable --now llama-server-presets.path
After this, saving ~/models/presets.ini automatically restarts the service (~3
s) and the next inference request cold-loads the model with the new settings.
The restart is intentionally disruptive — the currently-loaded model is evicted
— so only enable this if disruptive restarts on every presets save are
acceptable.
MTP speculative decoding
Multi-Token Prediction (MTP) lets the model predict several tokens per forward pass using draft heads baked into the model weights — no separate draft model needed. For Qwen3.6-35B-A3B this roughly doubles throughput (from ~15 tok/s to ~25–35 tok/s on RTX 3080) while preserving output quality.
Requirements:
- b9279+ binary —
--spec-type draft-mtpwas added in this era. Verify:/opt/llama-server/llama-server --help | grep spec-type # must list draft-mtp - MTP-format GGUF — standard bartowski/unsloth quants do not include MTP
heads. Use byteshape's dedicated MTP GGUFs:
# IQ3_S (13.6 GB) — best quality/size for 12 GB VRAM with slight CPU offload wget -c "https://huggingface.co/byteshape/Qwen3.6-35B-A3B-MTP-GGUF/resolve/main/Qwen3.6-35B-A3B-IQ3_S-3.06bpw.gguf" \ -O ~/models/Qwen3.6-35B-A3B-IQ3_S-3.06bpw.gguf # IQ2_S (10 GB) — fully fits in VRAM; heavier quantization wget -c "https://huggingface.co/byteshape/Qwen3.6-35B-A3B-MTP-GGUF/resolve/main/Qwen3.6-35B-A3B-IQ2_S-2.25bpw.gguf" \ -O ~/models/Qwen3.6-35B-A3B-IQ2_S-2.25bpw.gguf
presets.ini section for MTP:
[Qwen3.6-35B-A3B-IQ3_S-3.06bpw]
ctx-size = 32768
n-predict = 4096
spec-type = draft-mtp
spec-draft-p-min = 0.75
spec-draft-n-max = 3
spec-draft-p-min— minimum draft token acceptance probability. 0.75 is a good starting point; lower values accept more speculative tokens (faster but may diverge from non-speculative output).spec-draft-n-max— maximum tokens to speculate per step. 3 is the sweet spot for Qwen3.6 MTP; higher values have diminishing returns and add overhead.
Note: ik_llama.cpp (a fork) achieves ~10–20% higher throughput with MTP than official llama.cpp due to a more optimized MTP head implementation. Official llama.cpp MTP is still significantly faster than non-speculative inference and is the simpler setup.
Troubleshooting
Active model keeps resetting to the configured default
Known opencode bug #28735
(open as of May 2026): when a background subagent result is delivered back into
the main session, the active model resets to whatever orchestrator.model is
configured in opencode.json. This means any model switch made via -m flag or
the TUI selector gets silently reverted whenever a tool call or subagent
completes.
Workaround: keep orchestrator.model in opencode.json set to the model
you actually want to use. The reset lands on the configured model, so if it
matches your intent there's no observable effect.
no backends are loaded at startup
The backend .so plugins must be in the same directory as the binary, or on
LD_LIBRARY_PATH. The start.sh script sets this explicitly.
make_cpu_buft_list: no CPU backend found
Install libgomp1 (OpenMP runtime — required by the CPU backend):
sudo apt-get install -y libgomp1
CUDA device not found / GPU not offloading
- Confirm
/usr/lib/wsl/libis inPATHorLD_LIBRARY_PATHfor the process - Run
nvidia-smias the service user:sudo -u ollama nvidia-smi - Check
journalctl -u llama-server -n 50for lines likeggml_cuda_init: CUDA not found
High CPU / fan noise at idle
- Remove
--no-mmapif present (forces 9GB into RAM on startup) - Check
--n-parallelisn't set high (default 1 is fine for single-user use) - llama-server is permanently loaded; fans will spin during model load (~30s) then drop to zero at idle — this is expected behavior
qwen35 architecture errors (rope, tensor shape, missing tensor)
These errors all indicate an incompatible GGUF source:
rope.dimension_sections has wrong array length; expected 4, got 3— Ollama stores a 3-element array; llama.cpp (before a patch) expects 4.missing tensor 'blk.0.ssm_dt'orblk.0.ssm_dt.bias— Ollama omits the.biassuffix that HuggingFace-converted GGUFs use (or vice versa).check_tensor_dims: wrong shapeonblk.N.attn_k.weight— Ollama's converter storeshead_count_kvas a per-layer array; llama.cpp's qwen35 model loader expects a scalar.
Solution: use HuggingFace GGUFs (bartowski or unsloth) instead of Ollama
blobs for any qwen35-architecture model. See Model choice above.
Upgrading llama.cpp (replacing binaries while service is running)
The service holds the binary open; cp will fail with Text file busy. Always
stop the service first:
sudo systemctl stop llama-server
sudo cp build/bin/llama-server /opt/llama-server/
sudo cp -P build/bin/lib*.so* /opt/llama-server/
sudo systemctl start llama-server
Model file permissions (service runs as ollama user)
Files downloaded as your user aren't readable by the ollama service user:
# Make model file readable by all
sudo chmod o+r ~/models/MyModel.gguf
# Make the directory traversable
sudo chmod o+x ~ ~/models