dotfiles/.agents/docs/llama-server-cuda-wsl2.md
Brydon DeWitt 6b07e4ccb2 feat: add shared agent infrastructure (.agents/)
- AGENTS.md: design principles, enforcement hierarchy, deferred loading
- agents/: brainstorm, build, orchestrator, research (auto-discovered by MCP server)
- skills/: research methodology (auto-discovered by MCP server)
- hooks/: pre-tool-use, post-tool-use (BFF block removed), session-start,
  stop, pre-compact, user-prompt-submit
- frameworks/: opencode/plugin.ts (resolves hooks via import.meta.url — works
  as project-local or global plugin), github/hooks.json
- mcp/index.ts: auto-discovers agents/*.md and skills/*.md from frontmatter
  (replaces hand-maintained registry); server renamed all-agents
- docs/: agent-infrastructure.md (generalized), research docs (7 files),
  ai_architectures.md, llama-server-cuda-wsl2.md
- install.sh: idempotent setup — Copilot global hooks, OpenCode global plugin +
  AGENTS.md + MCP entry, VS Code global MCP config
2026-05-22 13:13:43 -04:00

599 lines
20 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# llama-server with CUDA on WSL2
Guide to deploying `llama-server` (llama.cpp) as a systemd service on WSL2 with
full NVIDIA GPU offload via CUDA. Configured in **router mode** to serve
multiple GGUF models on-demand (with optional MTP speculative decoding) via an
OpenAI-compatible API.
**Target environment:**
- WSL2 (Ubuntu 24.04 Noble)
- NVIDIA RTX 3080 12GB (or similar), driver exposed via WSL2 GPU passthrough
- No separate CUDA toolkit install required to _run_; only needed when building
---
## Why not Ollama?
Ollama vendors a pinned version of llama.cpp and bundles its own CUDA runtime.
New model architectures (like `qwen35` / Qwen3-Next) may not be supported until
Ollama syncs its fork. `llama-server` from upstream llama.cpp supports them as
soon as the architecture lands in the main branch.
**Ollama does nothing special** beyond: bundling `libggml-cuda.so` alongside its
runner and setting `PATH` to include `/usr/lib/wsl/lib` (the WSL2 CUDA driver
passthrough). No flash-attention env vars, no special flags. We replicate this.
---
## Prerequisites
```bash
# Verify WSL2 CUDA driver passthrough is working
ls /usr/lib/wsl/lib/libcuda.so.1 # must exist
nvidia-smi # must show your GPU
```
---
## Step 1 — Install CUDA toolkit and build dependencies
> Only needed once per machine to compile llama.cpp. Not needed at runtime.
```bash
sudo apt-get install -y nvidia-cuda-toolkit cmake build-essential git
```
Ubuntu 24.04 ships CUDA 12.0 in the `multiverse` repo. This is sufficient to
build llama.cpp with CUDA support even when the runtime driver is newer (e.g.
CUDA 13.1 via WSL2 passthrough). Alternatively, install CUDA 12.x from
[NVIDIA's own APT repo](https://developer.nvidia.com/cuda-downloads) to get a
more recent toolkit.
Verify the compiler is available:
```bash
nvcc --version
```
---
## Step 2 — Clone and build llama.cpp from source
```bash
# Clone at a specific tag — check https://github.com/ggml-org/llama.cpp/releases for latest
# b9144+ required for qwen35 architecture (Qwen3.6, OmniCoder 2, etc.)
# b9279+ required for MTP speculative decoding (--spec-type draft-mtp)
git clone --depth 1 --branch b9279 https://github.com/ggml-org/llama.cpp.git /tmp/llama-build
cd /tmp/llama-build
# Configure with CUDA backend
cmake -B build \
-DGGML_CUDA=ON \
-DCMAKE_BUILD_TYPE=Release \
-DLLAMA_BUILD_SERVER=ON \
-DLLAMA_BUILD_TESTS=OFF \
-DLLAMA_BUILD_EXAMPLES=OFF
# Build (uses all cores; takes 10-15 min on a 12-core CPU)
cmake --build build --config Release -j$(nproc)
```
After the build completes you should see `build/bin/llama-server`.
---
## Step 3 — Install to /opt/llama-server
```bash
sudo mkdir -p /opt/llama-server
# Copy the server binary
sudo cp build/bin/llama-server /opt/llama-server/
# Copy all shared libraries (b9144+ puts them all in build/bin/)
sudo cp -P build/bin/libggml*.so* /opt/llama-server/
sudo cp -P build/bin/libllama*.so* /opt/llama-server/
sudo cp -P build/bin/libmtmd*.so* /opt/llama-server/ 2>/dev/null || true
# Register the directory so transitive .so dependencies resolve
echo "/opt/llama-server" | sudo tee /etc/ld.so.conf.d/llama-server.conf
sudo ldconfig
```
> **Note (b9144+):** The library layout changed — all `.so` files now live in
> `build/bin/` (not `build/ggml/src/` or `build/src/`). When upgrading, copy
> with `-P` to preserve versioned symlinks and overwrite the old ones.
---
## Step 4 — Create the start script
Run llama-server in **router mode** — no `--model` flag. Models are loaded
on-demand from `~/models/` when a request names them. Switching models requires
no restart and no `sudo`: just change the `model` field in `opencode.json`.
```bash
sudo tee /opt/llama-server/start.sh > /dev/null << 'SCRIPT'
#!/bin/bash
export LD_LIBRARY_PATH=/opt/llama-server${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH}
cd /opt/llama-server
exec /opt/llama-server/llama-server \
--models-dir /home/dev/models \
--models-max 1 \
--models-preset /home/dev/models/presets.ini \
--host 127.0.0.1 \
--port 8080
SCRIPT
sudo chmod +x /opt/llama-server/start.sh
```
**Key router flags:**
- `--models-dir` — directory scanned for GGUF files. Flat `.gguf` files become
model IDs using the filename **without** `.gguf`. Subdirectories become model
IDs using the directory name (used for multimodal models with a separate
mmproj file — see _Multimodal models_ below).
- `--models-max 1` — only one model resident at a time. When a different model
is requested, the current one is evicted and the new one loads (cold-start
delay). With 12GB VRAM this is required.
- `--models-preset` — path to `presets.ini` for global defaults and per-model
overrides. All inference flags belong here, not in `start.sh`.
**Per-model settings via `presets.ini`**
All inference flags (`ctx-size`, `n-predict`, `n-gpu-layers`, `flash-attn`,
`threads`, `parallel`, `jinja`, `spec-type`, etc.) live in
`~/models/presets.ini`, not in `start.sh`. The `[*]` section sets defaults
inherited by every model; named sections override individual keys.
Section names must match the router's model ID — the filename **without**
`.gguf`. Using the `.gguf` suffix in a section name creates a duplicate entry in
the model list.
```ini
version = 1
[*]
n-gpu-layers = 99
flash-attn = on
threads = 8
parallel = 1
[Qwen_Qwen3-14B-Q4_K_M]
ctx-size = 32768
n-predict = 4096
[OmniCoder-2-9B.Q8_0]
ctx-size = 32768
n-predict = 4096
[Qwen_Qwen3.6-27B-Q4_K_M]
ctx-size = 16384
n-predict = 4096
```
> **Note:** The router reads `presets.ini` **once at service startup** — it is
> not watched for changes. After editing it, run
> `sudo systemctl restart llama-server` to apply the new settings. Any
> currently-loaded model will be evicted and must cold-reload on the next
> request (~1060 s).
**On GPU layer offload:** Hybrid inference (some layers on CPU, some on GPU) is
significantly slower than full-GPU due to CPU↔GPU memory transfers each forward
pass. For interactive use, prefer models that fit entirely in VRAM. MoE models
(like Qwen3.6-35B-A3B) are an exception — their sparse activation means active
computation per token is only ~3B parameters regardless of total model size, so
partial CPU offload is less painful than with a dense model of the same file
size. See the _Model choice_ section below.
---
## Step 5 — Create the systemd service
```bash
sudo tee /etc/systemd/system/llama-server.service > /dev/null << 'EOF'
[Unit]
Description=llama-server (OmniCoder 2 / qwen35)
After=network-online.target
[Service]
ExecStart=/opt/llama-server/start.sh
User=ollama
Group=ollama
Restart=always
RestartSec=3
Environment="PATH=/home/dev/.nvm/versions/node/v24.15.0/bin:/home/dev/.opencode/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/usr/lib/wsl/lib:/snap/bin"
[Install]
WantedBy=default.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable llama-server
sudo systemctl start llama-server
```
> **Note:** The `PATH` includes `/usr/lib/wsl/lib` — this is what exposes the
> CUDA driver (`libcuda.so.1`) to the process in WSL2. Without this, the CUDA
> backend will load but fail to initialize the device.
---
## Step 6 — Verify GPU offload
```bash
# Check service is running
systemctl status llama-server
# Health endpoint
curl -s http://127.0.0.1:8080/health
# → {"status":"ok"}
# Watch GPU memory in another terminal during a request
watch -n1 nvidia-smi
# Quick inference test
curl -s http://127.0.0.1:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"local","messages":[{"role":"user","content":"Say hello."}],"max_tokens":20}' \
| python3 -m json.tool
```
During inference, `nvidia-smi` should show:
- GPU-Util: 80-100%
- GPU Memory: ~10-11GB used (model weights + KV cache)
- CPU: near idle
```bash
# Quick inference test (node instead of python3)
curl -s http://127.0.0.1:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"Qwen_Qwen3-14B-Q4_K_M","messages":[{"role":"user","content":"Say hello."}],"max_tokens":20}' \
| node -e "let d=''; process.stdin.on('data',c=>d+=c); process.stdin.on('end',()=>console.log(JSON.parse(d).choices[0].message.content))"
```
---
## Step 7 — Configure OpenCode
Edit `~/.config/opencode/opencode.json` to add the provider. Model IDs are the
filenames **without** `.gguf` (or the subdirectory name for multimodal models).
The `limit` values here inform opencode's context window tracking; the actual
server-side limits are set in `presets.ini`.
```json
"llama-server": {
"npm": "@ai-sdk/openai-compatible",
"name": "llama-server",
"options": { "baseURL": "http://127.0.0.1:8080/v1" },
"models": {
"Qwen_Qwen3-14B-Q4_K_M": {
"name": "Qwen3 14B Q4 (fast)",
"tools": true,
"limit": { "context": 32768, "output": 4096 }
},
"Qwen_Qwen3.6-27B-Q4_K_M": {
"name": "Qwen3.6 27B Q4 (deep)",
"tools": true,
"limit": { "context": 16384, "output": 4096 }
},
"OmniCoder-2-9B.Q8_0": {
"name": "OmniCoder 2 9B Q8 (vision)",
"tools": true,
"limit": { "context": 32768, "output": 4096 }
},
"Qwen3.6-35B-A3B-IQ3_S-3.06bpw": {
"name": "Qwen3.6 35B A3B IQ3 (MoE+MTP)",
"tools": true,
"limit": { "context": 8192, "output": 4096 }
}
}
}
```
In the project-level `opencode.json`, set the active model per agent:
```json
"agent": {
"orchestrator": {
"model": "llama-server/Qwen_Qwen3-14B-Q4_K_M"
}
}
```
---
## Model choice for RTX 3080 12GB
Pick based on what fits **entirely** in VRAM — hybrid inference (model too large
for VRAM) is 48× slower and makes interactive use painful. MoE models are an
exception; see note below the table.
| Model | File size | Fits in 12GB? | Speed (est.) | Notes |
| ------------------------------- | --------- | ------------- | ------------- | ---------------------------------------------------------------------------------------- |
| Qwen3-8B Q4_K_M | ~5 GB | ✅ fully | ~2535 tok/s | Fast; weaker reasoning |
| **Qwen3-14B Q4_K_M** | ~8.5 GB | ✅ fully | ~1218 tok/s | **Daily driver** — fast interactive use, good instruction following |
| OmniCoder-2-9B Q8_0 | ~9.5 GB | ✅ fully | ~1520 tok/s | Vision-capable (multimodal); subdirectory layout for auto-detected mmproj |
| **Qwen3.6-27B Q4_K_M** | 17 GB | ⚠️ partial | ~48 tok/s | **Deep reasoning** — better at vague/complex tasks; slow due to CPU offload |
| **Qwen3.6-35B-A3B IQ3_S (MTP)** | 13.6 GB | ⚠️ partial | ~2035 tok/s† | **MoE + MTP** — sparse activation (~3B active params); needs MTP-format GGUF (byteshape) |
| Qwen3-32B Q4_K_M | ~20 GB | ❌ | — | Won't fit |
† MoE speed estimate with `--spec-type draft-mtp`. Despite 13.6 GB file size,
only ~1.6 GB needs CPU offload (few dense attention layers overflow VRAM). The
sparse feed-forward experts make active-parameter compute comparable to a 3B
dense model.
All models sit in `~/models/` simultaneously and are swapped on-demand by the
router. Cold-swap time is ~10s (914B) / ~3045s (27B+).
Download from bartowski on HuggingFace (imatrix quants, standard GGUF format):
```bash
mkdir -p ~/models
wget -c "https://huggingface.co/bartowski/Qwen_Qwen3-14B-GGUF/resolve/main/Qwen_Qwen3-14B-Q4_K_M.gguf" \
-O ~/models/Qwen_Qwen3-14B-Q4_K_M.gguf
```
> **⚠️ Use HuggingFace GGUFs, not Ollama blobs for qwen35-architecture models.**
> Ollama's converter outputs different tensor names and per-layer KV-head arrays
> that are incompatible with llama.cpp's `qwen35` model loader. Symptoms:
> `missing tensor 'blk.0.ssm_dt'`, `check_tensor_dims: wrong shape`, or
> `rope.dimension_sections has wrong array length`. Always download from
> bartowski or unsloth on HuggingFace for these models.
---
## Switching models
With router mode, switching requires **no restart and no `sudo`**. Place GGUFs
in `~/models/` and reference them by model ID in `opencode.json`.
### Add a model
```bash
# Download to ~/models/ — filename without .gguf becomes the model ID
wget -c "https://huggingface.co/bartowski/Qwen_Qwen3.6-27B-GGUF/resolve/main/Qwen_Qwen3.6-27B-Q4_K_M.gguf" \
-O ~/models/Qwen_Qwen3.6-27B-Q4_K_M.gguf
```
Then add a section to `~/models/presets.ini` (name = filename without `.gguf`):
```ini
[Qwen_Qwen3.6-27B-Q4_K_M]
ctx-size = 16384
n-predict = 4096
```
And register it in `~/.config/opencode/opencode.json`:
```json
"Qwen_Qwen3.6-27B-Q4_K_M": {
"name": "Qwen3.6 27B Q4 (deep)",
"tools": true,
"limit": { "context": 16384, "output": 4096 }
}
```
### Switch active model
Edit `opencode.json` (project-level or `~/.config/opencode/opencode.json`) and
change the agent's `model` to `llama-server/<model-id>`:
```json
"agent": {
"orchestrator": {
"model": "llama-server/Qwen_Qwen3-14B-Q4_K_M"
}
}
```
The next request triggers a cold load of the new model (~1030s for 14B, ~3060s
for 27B+). No service restart needed. `--models-max 1` ensures the previous
model is evicted from VRAM automatically.
To switch from the CLI without editing files:
```bash
opencode run -m "llama-server/Qwen_Qwen3-14B-Q4_K_M" "your message here"
```
### Multimodal models
For models with a separate vision encoder (mmproj), use a **subdirectory** in
`~/models/`. The directory name becomes the model ID; llama.cpp auto-detects any
file whose name starts with `mmproj` as the projector.
```
~/models/
OmniCoder-2-9B.Q8_0/ ← model ID = "OmniCoder-2-9B.Q8_0"
OmniCoder-2-9B.Q8_0.gguf ← main weights
mmproj-Q8_0.gguf ← vision projector (auto-detected)
```
### List available models
```bash
# See what's in ~/models/ (all are immediately usable as model IDs)
ls ~/models/
# See what's currently loaded
curl -s http://127.0.0.1:8080/v1/models | node -e \
"process.stdin.resume(); let d=''; process.stdin.on('data',c=>d+=c); process.stdin.on('end',()=>JSON.parse(d).data.forEach(m=>console.log(m.id, m.meta?.loaded ? '[loaded]' : '[unloaded]')))"
# Force a rescan (picks up newly added model files)
curl -s 'http://127.0.0.1:8080/models?reload=1' | node -e \
"process.stdin.resume(); let d=''; process.stdin.on('data',c=>d+=c); process.stdin.on('end',()=>JSON.parse(d).data.forEach(m=>console.log(m.id)))"
```
### Auto-restart on presets.ini change
The router caches `presets.ini` at startup, so any edit requires a service
restart to take effect. You can automate this with a systemd **path unit** that
watches the file and triggers a restart whenever it is written:
```bash
sudo tee /etc/systemd/system/llama-server-presets.path > /dev/null << 'EOF'
[Unit]
Description=Restart llama-server when presets.ini changes
[Path]
PathChanged=/home/dev/models/presets.ini
[Install]
WantedBy=default.target
EOF
sudo tee /etc/systemd/system/llama-server-presets.service > /dev/null << 'EOF'
[Unit]
Description=Restart llama-server (triggered by presets.ini change)
[Service]
Type=oneshot
ExecStart=/bin/systemctl restart llama-server
EOF
sudo systemctl daemon-reload
sudo systemctl enable --now llama-server-presets.path
```
After this, saving `~/models/presets.ini` automatically restarts the service (~3
s) and the next inference request cold-loads the model with the new settings.
The restart is intentionally disruptive — the currently-loaded model is evicted
— so only enable this if disruptive restarts on every presets save are
acceptable.
---
## MTP speculative decoding
Multi-Token Prediction (MTP) lets the model predict several tokens per forward
pass using draft heads baked into the model weights — no separate draft model
needed. For Qwen3.6-35B-A3B this roughly doubles throughput (from ~15 tok/s to
~2535 tok/s on RTX 3080) while preserving output quality.
**Requirements:**
1. **b9279+ binary**`--spec-type draft-mtp` was added in this era. Verify:
```bash
/opt/llama-server/llama-server --help | grep spec-type
# must list draft-mtp
```
2. **MTP-format GGUF** — standard bartowski/unsloth quants do not include MTP
heads. Use byteshape's dedicated MTP GGUFs:
```bash
# IQ3_S (13.6 GB) — best quality/size for 12 GB VRAM with slight CPU offload
wget -c "https://huggingface.co/byteshape/Qwen3.6-35B-A3B-MTP-GGUF/resolve/main/Qwen3.6-35B-A3B-IQ3_S-3.06bpw.gguf" \
-O ~/models/Qwen3.6-35B-A3B-IQ3_S-3.06bpw.gguf
# IQ2_S (10 GB) — fully fits in VRAM; heavier quantization
wget -c "https://huggingface.co/byteshape/Qwen3.6-35B-A3B-MTP-GGUF/resolve/main/Qwen3.6-35B-A3B-IQ2_S-2.25bpw.gguf" \
-O ~/models/Qwen3.6-35B-A3B-IQ2_S-2.25bpw.gguf
```
**`presets.ini` section for MTP:**
```ini
[Qwen3.6-35B-A3B-IQ3_S-3.06bpw]
ctx-size = 32768
n-predict = 4096
spec-type = draft-mtp
spec-draft-p-min = 0.75
spec-draft-n-max = 3
```
- `spec-draft-p-min` — minimum draft token acceptance probability. 0.75 is a
good starting point; lower values accept more speculative tokens (faster but
may diverge from non-speculative output).
- `spec-draft-n-max` — maximum tokens to speculate per step. 3 is the sweet spot
for Qwen3.6 MTP; higher values have diminishing returns and add overhead.
**Note:** ik_llama.cpp (a fork) achieves ~1020% higher throughput with MTP than
official llama.cpp due to a more optimized MTP head implementation. Official
llama.cpp MTP is still significantly faster than non-speculative inference and
is the simpler setup.
---
## Troubleshooting
### Active model keeps resetting to the configured default
Known opencode bug [#28735](https://github.com/anomalyco/opencode/issues/28735)
(open as of May 2026): when a background subagent result is delivered back into
the main session, the active model resets to whatever `orchestrator.model` is
configured in `opencode.json`. This means any model switch made via `-m` flag or
the TUI selector gets silently reverted whenever a tool call or subagent
completes.
**Workaround:** keep `orchestrator.model` in `opencode.json` set to the model
you actually want to use. The reset lands on the configured model, so if it
matches your intent there's no observable effect.
---
### `no backends are loaded` at startup
The backend `.so` plugins must be in the same directory as the binary, or on
`LD_LIBRARY_PATH`. The `start.sh` script sets this explicitly.
### `make_cpu_buft_list: no CPU backend found`
Install `libgomp1` (OpenMP runtime — required by the CPU backend):
```bash
sudo apt-get install -y libgomp1
```
### CUDA device not found / GPU not offloading
- Confirm `/usr/lib/wsl/lib` is in `PATH` or `LD_LIBRARY_PATH` for the process
- Run `nvidia-smi` as the service user: `sudo -u ollama nvidia-smi`
- Check `journalctl -u llama-server -n 50` for lines like
`ggml_cuda_init: CUDA not found`
### High CPU / fan noise at idle
- Remove `--no-mmap` if present (forces 9GB into RAM on startup)
- Check `--n-parallel` isn't set high (default 1 is fine for single-user use)
- llama-server is permanently loaded; fans will spin during model load (~30s)
then drop to zero at idle — this is expected behavior
### `qwen35` architecture errors (rope, tensor shape, missing tensor)
These errors all indicate an **incompatible GGUF source**:
- `rope.dimension_sections has wrong array length; expected 4, got 3` — Ollama
stores a 3-element array; llama.cpp (before a patch) expects 4.
- `missing tensor 'blk.0.ssm_dt'` or `blk.0.ssm_dt.bias` — Ollama omits the
`.bias` suffix that HuggingFace-converted GGUFs use (or vice versa).
- `check_tensor_dims: wrong shape` on `blk.N.attn_k.weight` — Ollama's converter
stores `head_count_kv` as a per-layer array; llama.cpp's qwen35 model loader
expects a scalar.
**Solution:** use HuggingFace GGUFs (bartowski or unsloth) instead of Ollama
blobs for any `qwen35`-architecture model. See _Model choice_ above.
### Upgrading llama.cpp (replacing binaries while service is running)
The service holds the binary open; `cp` will fail with `Text file busy`. Always
stop the service first:
```bash
sudo systemctl stop llama-server
sudo cp build/bin/llama-server /opt/llama-server/
sudo cp -P build/bin/lib*.so* /opt/llama-server/
sudo systemctl start llama-server
```
### Model file permissions (service runs as `ollama` user)
Files downloaded as your user aren't readable by the `ollama` service user:
```bash
# Make model file readable by all
sudo chmod o+r ~/models/MyModel.gguf
# Make the directory traversable
sudo chmod o+x ~ ~/models
```