dotfiles/.agents/docs/llama-server-cuda-wsl2.md

# llama-server with CUDA on WSL2

Guide to deploying `llama-server` (llama.cpp) as a systemd service on WSL2 with
full NVIDIA GPU offload via CUDA. Configured in **router mode** to serve
multiple GGUF models on-demand (with optional MTP speculative decoding) via an
OpenAI-compatible API.

**Target environment:**

- WSL2 (Ubuntu 24.04 Noble)
- NVIDIA RTX 3080 12GB (or similar), driver exposed via WSL2 GPU passthrough
- No separate CUDA toolkit install required to _run_; only needed when building

---

## Why not Ollama?

Ollama vendors a pinned version of llama.cpp and bundles its own CUDA runtime.
New model architectures (like `qwen35` / Qwen3-Next) may not be supported until
Ollama syncs its fork. `llama-server` from upstream llama.cpp supports them as
soon as the architecture lands in the main branch.

**Ollama does nothing special** beyond: bundling `libggml-cuda.so` alongside its
runner and setting `PATH` to include `/usr/lib/wsl/lib` (the WSL2 CUDA driver
passthrough). No flash-attention env vars, no special flags. We replicate this.

---

## Prerequisites

```bash
# Verify WSL2 CUDA driver passthrough is working
ls /usr/lib/wsl/lib/libcuda.so.1   # must exist
nvidia-smi                          # must show your GPU
```

---

## Step 1 — Install CUDA toolkit and build dependencies

> Only needed once per machine to compile llama.cpp. Not needed at runtime.

```bash
sudo apt-get install -y nvidia-cuda-toolkit cmake build-essential git
```

Ubuntu 24.04 ships CUDA 12.0 in the `multiverse` repo. This is sufficient to
build llama.cpp with CUDA support even when the runtime driver is newer (e.g.
CUDA 13.1 via WSL2 passthrough). Alternatively, install CUDA 12.x from
[NVIDIA's own APT repo](https://developer.nvidia.com/cuda-downloads) to get a
more recent toolkit.

Verify the compiler is available:

```bash
nvcc --version
```

---

## Step 2 — Clone and build llama.cpp from source

```bash
# Clone at a specific tag — check https://github.com/ggml-org/llama.cpp/releases for latest
# b9144+ required for qwen35 architecture (Qwen3.6, OmniCoder 2, etc.)
# b9279+ required for MTP speculative decoding (--spec-type draft-mtp)
git clone --depth 1 --branch b9279 https://github.com/ggml-org/llama.cpp.git /tmp/llama-build
cd /tmp/llama-build

# Configure with CUDA backend
cmake -B build \
  -DGGML_CUDA=ON \
  -DCMAKE_BUILD_TYPE=Release \
  -DLLAMA_BUILD_SERVER=ON \
  -DLLAMA_BUILD_TESTS=OFF \
  -DLLAMA_BUILD_EXAMPLES=OFF

# Build (uses all cores; takes 10-15 min on a 12-core CPU)
cmake --build build --config Release -j$(nproc)
```

After the build completes you should see `build/bin/llama-server`.

---

## Step 3 — Install to /opt/llama-server

```bash
sudo mkdir -p /opt/llama-server

# Copy the server binary
sudo cp build/bin/llama-server /opt/llama-server/

# Copy all shared libraries (b9144+ puts them all in build/bin/)
sudo cp -P build/bin/libggml*.so* /opt/llama-server/
sudo cp -P build/bin/libllama*.so* /opt/llama-server/
sudo cp -P build/bin/libmtmd*.so*  /opt/llama-server/ 2>/dev/null || true

# Register the directory so transitive .so dependencies resolve
echo "/opt/llama-server" | sudo tee /etc/ld.so.conf.d/llama-server.conf
sudo ldconfig
```

> **Note (b9144+):** The library layout changed — all `.so` files now live in
> `build/bin/` (not `build/ggml/src/` or `build/src/`). When upgrading, copy
> with `-P` to preserve versioned symlinks and overwrite the old ones.

---

## Step 4 — Create the start script

Run llama-server in **router mode** — no `--model` flag. Models are loaded
on-demand from `~/models/` when a request names them. Switching models requires
no restart and no `sudo`: just change the `model` field in `opencode.json`.

```bash
sudo tee /opt/llama-server/start.sh > /dev/null << 'SCRIPT'
#!/bin/bash
export LD_LIBRARY_PATH=/opt/llama-server${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH}
cd /opt/llama-server
exec /opt/llama-server/llama-server \
  --models-dir /home/dev/models \
  --models-max 1 \
  --models-preset /home/dev/models/presets.ini \
  --host 127.0.0.1 \
  --port 8080
SCRIPT
sudo chmod +x /opt/llama-server/start.sh
```

**Key router flags:**

- `--models-dir` — directory scanned for GGUF files. Flat `.gguf` files become
  model IDs using the filename **without** `.gguf`. Subdirectories become model
  IDs using the directory name (used for multimodal models with a separate
  mmproj file — see _Multimodal models_ below).
- `--models-max 1` — only one model resident at a time. When a different model
  is requested, the current one is evicted and the new one loads (cold-start
  delay). With 12GB VRAM this is required.
- `--models-preset` — path to `presets.ini` for global defaults and per-model
  overrides. All inference flags belong here, not in `start.sh`.

**Per-model settings via `presets.ini`**

All inference flags (`ctx-size`, `n-predict`, `n-gpu-layers`, `flash-attn`,
`threads`, `parallel`, `jinja`, `spec-type`, etc.) live in
`~/models/presets.ini`, not in `start.sh`. The `[*]` section sets defaults
inherited by every model; named sections override individual keys.

Section names must match the router's model ID — the filename **without**
`.gguf`. Using the `.gguf` suffix in a section name creates a duplicate entry in
the model list.

```ini
version = 1

[*]
n-gpu-layers = 99
flash-attn = on
threads = 8
parallel = 1

[Qwen_Qwen3-14B-Q4_K_M]
ctx-size = 32768
n-predict = 4096

[OmniCoder-2-9B.Q8_0]
ctx-size = 32768
n-predict = 4096

[Qwen_Qwen3.6-27B-Q4_K_M]
ctx-size = 16384
n-predict = 4096
```

> **Note:** The router reads `presets.ini` **once at service startup** — it is
> not watched for changes. After editing it, run
> `sudo systemctl restart llama-server` to apply the new settings. Any
> currently-loaded model will be evicted and must cold-reload on the next
> request (~10–60 s).

**On GPU layer offload:** Hybrid inference (some layers on CPU, some on GPU) is
significantly slower than full-GPU due to CPU↔GPU memory transfers each forward
pass. For interactive use, prefer models that fit entirely in VRAM. MoE models
(like Qwen3.6-35B-A3B) are an exception — their sparse activation means active
computation per token is only ~3B parameters regardless of total model size, so
partial CPU offload is less painful than with a dense model of the same file
size. See the _Model choice_ section below.

---

## Step 5 — Create the systemd service

```bash
sudo tee /etc/systemd/system/llama-server.service > /dev/null << 'EOF'
[Unit]
Description=llama-server (OmniCoder 2 / qwen35)
After=network-online.target

[Service]
ExecStart=/opt/llama-server/start.sh
User=ollama
Group=ollama
Restart=always
RestartSec=3
Environment="PATH=/home/dev/.nvm/versions/node/v24.15.0/bin:/home/dev/.opencode/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/usr/lib/wsl/lib:/snap/bin"

[Install]
WantedBy=default.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable llama-server
sudo systemctl start llama-server
```

> **Note:** The `PATH` includes `/usr/lib/wsl/lib` — this is what exposes the
> CUDA driver (`libcuda.so.1`) to the process in WSL2. Without this, the CUDA
> backend will load but fail to initialize the device.

---

## Step 6 — Verify GPU offload

```bash
# Check service is running
systemctl status llama-server

# Health endpoint
curl -s http://127.0.0.1:8080/health
# → {"status":"ok"}

# Watch GPU memory in another terminal during a request
watch -n1 nvidia-smi

# Quick inference test
curl -s http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"local","messages":[{"role":"user","content":"Say hello."}],"max_tokens":20}' \
  | python3 -m json.tool
```

During inference, `nvidia-smi` should show:

- GPU-Util: 80-100%
- GPU Memory: ~10-11GB used (model weights + KV cache)
- CPU: near idle

```bash
# Quick inference test (node instead of python3)
curl -s http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"Qwen_Qwen3-14B-Q4_K_M","messages":[{"role":"user","content":"Say hello."}],"max_tokens":20}' \
  | node -e "let d=''; process.stdin.on('data',c=>d+=c); process.stdin.on('end',()=>console.log(JSON.parse(d).choices[0].message.content))"
```

---

## Step 7 — Configure OpenCode

Edit `~/.config/opencode/opencode.json` to add the provider. Model IDs are the
filenames **without** `.gguf` (or the subdirectory name for multimodal models).
The `limit` values here inform opencode's context window tracking; the actual
server-side limits are set in `presets.ini`.

```json
"llama-server": {
  "npm": "@ai-sdk/openai-compatible",
  "name": "llama-server",
  "options": { "baseURL": "http://127.0.0.1:8080/v1" },
  "models": {
    "Qwen_Qwen3-14B-Q4_K_M": {
      "name": "Qwen3 14B Q4 (fast)",
      "tools": true,
      "limit": { "context": 32768, "output": 4096 }
    },
    "Qwen_Qwen3.6-27B-Q4_K_M": {
      "name": "Qwen3.6 27B Q4 (deep)",
      "tools": true,
      "limit": { "context": 16384, "output": 4096 }
    },
    "OmniCoder-2-9B.Q8_0": {
      "name": "OmniCoder 2 9B Q8 (vision)",
      "tools": true,
      "limit": { "context": 32768, "output": 4096 }
    },
    "Qwen3.6-35B-A3B-IQ3_S-3.06bpw": {
      "name": "Qwen3.6 35B A3B IQ3 (MoE+MTP)",
      "tools": true,
      "limit": { "context": 8192, "output": 4096 }
    }
  }
}
```

In the project-level `opencode.json`, set the active model per agent:

```json
"agent": {
  "orchestrator": {
    "model": "llama-server/Qwen_Qwen3-14B-Q4_K_M"
  }
}
```

---

## Model choice for RTX 3080 12GB

Pick based on what fits **entirely** in VRAM — hybrid inference (model too large
for VRAM) is 4–8× slower and makes interactive use painful. MoE models are an
exception; see note below the table.

| Model                           | File size | Fits in 12GB? | Speed (est.)  | Notes                                                                                    |
| ------------------------------- | --------- | ------------- | ------------- | ---------------------------------------------------------------------------------------- |
| Qwen3-8B Q4_K_M                 | ~5 GB     | ✅ fully      | ~25–35 tok/s  | Fast; weaker reasoning                                                                   |
| **Qwen3-14B Q4_K_M**            | ~8.5 GB   | ✅ fully      | ~12–18 tok/s  | **Daily driver** — fast interactive use, good instruction following                      |
| OmniCoder-2-9B Q8_0             | ~9.5 GB   | ✅ fully      | ~15–20 tok/s  | Vision-capable (multimodal); subdirectory layout for auto-detected mmproj                |
| **Qwen3.6-27B Q4_K_M**          | 17 GB     | ⚠️ partial    | ~4–8 tok/s    | **Deep reasoning** — better at vague/complex tasks; slow due to CPU offload              |
| **Qwen3.6-35B-A3B IQ3_S (MTP)** | 13.6 GB   | ⚠️ partial    | ~20–35 tok/s† | **MoE + MTP** — sparse activation (~3B active params); needs MTP-format GGUF (byteshape) |
| Qwen3-32B Q4_K_M                | ~20 GB    | ❌            | —             | Won't fit                                                                                |

† MoE speed estimate with `--spec-type draft-mtp`. Despite 13.6 GB file size,
only ~1.6 GB needs CPU offload (few dense attention layers overflow VRAM). The
sparse feed-forward experts make active-parameter compute comparable to a 3B
dense model.

All models sit in `~/models/` simultaneously and are swapped on-demand by the
router. Cold-swap time is ~10s (9–14B) / ~30–45s (27B+).

Download from bartowski on HuggingFace (imatrix quants, standard GGUF format):

```bash
mkdir -p ~/models
wget -c "https://huggingface.co/bartowski/Qwen_Qwen3-14B-GGUF/resolve/main/Qwen_Qwen3-14B-Q4_K_M.gguf" \
  -O ~/models/Qwen_Qwen3-14B-Q4_K_M.gguf
```

> **⚠️ Use HuggingFace GGUFs, not Ollama blobs for qwen35-architecture models.**
> Ollama's converter outputs different tensor names and per-layer KV-head arrays
> that are incompatible with llama.cpp's `qwen35` model loader. Symptoms:
> `missing tensor 'blk.0.ssm_dt'`, `check_tensor_dims: wrong shape`, or
> `rope.dimension_sections has wrong array length`. Always download from
> bartowski or unsloth on HuggingFace for these models.

---

## Switching models

With router mode, switching requires **no restart and no `sudo`**. Place GGUFs
in `~/models/` and reference them by model ID in `opencode.json`.

### Add a model

```bash
# Download to ~/models/ — filename without .gguf becomes the model ID
wget -c "https://huggingface.co/bartowski/Qwen_Qwen3.6-27B-GGUF/resolve/main/Qwen_Qwen3.6-27B-Q4_K_M.gguf" \
  -O ~/models/Qwen_Qwen3.6-27B-Q4_K_M.gguf
```

Then add a section to `~/models/presets.ini` (name = filename without `.gguf`):

```ini
[Qwen_Qwen3.6-27B-Q4_K_M]
ctx-size = 16384
n-predict = 4096
```

And register it in `~/.config/opencode/opencode.json`:

```json
"Qwen_Qwen3.6-27B-Q4_K_M": {
  "name": "Qwen3.6 27B Q4 (deep)",
  "tools": true,
  "limit": { "context": 16384, "output": 4096 }
}
```

### Switch active model

Edit `opencode.json` (project-level or `~/.config/opencode/opencode.json`) and
change the agent's `model` to `llama-server/<model-id>`:

```json
"agent": {
  "orchestrator": {
    "model": "llama-server/Qwen_Qwen3-14B-Q4_K_M"
  }
}
```

The next request triggers a cold load of the new model (~10–30s for 14B, ~30–60s
for 27B+). No service restart needed. `--models-max 1` ensures the previous
model is evicted from VRAM automatically.

To switch from the CLI without editing files:

```bash
opencode run -m "llama-server/Qwen_Qwen3-14B-Q4_K_M" "your message here"
```

### Multimodal models

For models with a separate vision encoder (mmproj), use a **subdirectory** in
`~/models/`. The directory name becomes the model ID; llama.cpp auto-detects any
file whose name starts with `mmproj` as the projector.

```
~/models/
  OmniCoder-2-9B.Q8_0/          ← model ID = "OmniCoder-2-9B.Q8_0"
    OmniCoder-2-9B.Q8_0.gguf    ← main weights
    mmproj-Q8_0.gguf             ← vision projector (auto-detected)
```

### List available models

```bash
# See what's in ~/models/ (all are immediately usable as model IDs)
ls ~/models/

# See what's currently loaded
curl -s http://127.0.0.1:8080/v1/models | node -e \
  "process.stdin.resume(); let d=''; process.stdin.on('data',c=>d+=c); process.stdin.on('end',()=>JSON.parse(d).data.forEach(m=>console.log(m.id, m.meta?.loaded ? '[loaded]' : '[unloaded]')))"

# Force a rescan (picks up newly added model files)
curl -s 'http://127.0.0.1:8080/models?reload=1' | node -e \
  "process.stdin.resume(); let d=''; process.stdin.on('data',c=>d+=c); process.stdin.on('end',()=>JSON.parse(d).data.forEach(m=>console.log(m.id)))"
```

### Auto-restart on presets.ini change

The router caches `presets.ini` at startup, so any edit requires a service
restart to take effect. You can automate this with a systemd **path unit** that
watches the file and triggers a restart whenever it is written:

```bash
sudo tee /etc/systemd/system/llama-server-presets.path > /dev/null << 'EOF'
[Unit]
Description=Restart llama-server when presets.ini changes

[Path]
PathChanged=/home/dev/models/presets.ini

[Install]
WantedBy=default.target
EOF

sudo tee /etc/systemd/system/llama-server-presets.service > /dev/null << 'EOF'
[Unit]
Description=Restart llama-server (triggered by presets.ini change)

[Service]
Type=oneshot
ExecStart=/bin/systemctl restart llama-server
EOF

sudo systemctl daemon-reload
sudo systemctl enable --now llama-server-presets.path
```

After this, saving `~/models/presets.ini` automatically restarts the service (~3
s) and the next inference request cold-loads the model with the new settings.
The restart is intentionally disruptive — the currently-loaded model is evicted
— so only enable this if disruptive restarts on every presets save are
acceptable.

---

## MTP speculative decoding

Multi-Token Prediction (MTP) lets the model predict several tokens per forward
pass using draft heads baked into the model weights — no separate draft model
needed. For Qwen3.6-35B-A3B this roughly doubles throughput (from ~15 tok/s to
~25–35 tok/s on RTX 3080) while preserving output quality.

**Requirements:**

1. **b9279+ binary** — `--spec-type draft-mtp` was added in this era. Verify:
   ```bash
   /opt/llama-server/llama-server --help | grep spec-type
   # must list draft-mtp
   ```
2. **MTP-format GGUF** — standard bartowski/unsloth quants do not include MTP
   heads. Use byteshape's dedicated MTP GGUFs:
   ```bash
   # IQ3_S (13.6 GB) — best quality/size for 12 GB VRAM with slight CPU offload
   wget -c "https://huggingface.co/byteshape/Qwen3.6-35B-A3B-MTP-GGUF/resolve/main/Qwen3.6-35B-A3B-IQ3_S-3.06bpw.gguf" \
     -O ~/models/Qwen3.6-35B-A3B-IQ3_S-3.06bpw.gguf
   # IQ2_S (10 GB) — fully fits in VRAM; heavier quantization
   wget -c "https://huggingface.co/byteshape/Qwen3.6-35B-A3B-MTP-GGUF/resolve/main/Qwen3.6-35B-A3B-IQ2_S-2.25bpw.gguf" \
     -O ~/models/Qwen3.6-35B-A3B-IQ2_S-2.25bpw.gguf
   ```

**`presets.ini` section for MTP:**

```ini
[Qwen3.6-35B-A3B-IQ3_S-3.06bpw]
ctx-size = 32768
n-predict = 4096
spec-type = draft-mtp
spec-draft-p-min = 0.75
spec-draft-n-max = 3
```

- `spec-draft-p-min` — minimum draft token acceptance probability. 0.75 is a
  good starting point; lower values accept more speculative tokens (faster but
  may diverge from non-speculative output).
- `spec-draft-n-max` — maximum tokens to speculate per step. 3 is the sweet spot
  for Qwen3.6 MTP; higher values have diminishing returns and add overhead.

**Note:** ik_llama.cpp (a fork) achieves ~10–20% higher throughput with MTP than
official llama.cpp due to a more optimized MTP head implementation. Official
llama.cpp MTP is still significantly faster than non-speculative inference and
is the simpler setup.

---

## Troubleshooting

### Active model keeps resetting to the configured default

Known opencode bug [#28735](https://github.com/anomalyco/opencode/issues/28735)
(open as of May 2026): when a background subagent result is delivered back into
the main session, the active model resets to whatever `orchestrator.model` is
configured in `opencode.json`. This means any model switch made via `-m` flag or
the TUI selector gets silently reverted whenever a tool call or subagent
completes.

**Workaround:** keep `orchestrator.model` in `opencode.json` set to the model
you actually want to use. The reset lands on the configured model, so if it
matches your intent there's no observable effect.

---

### `no backends are loaded` at startup

The backend `.so` plugins must be in the same directory as the binary, or on
`LD_LIBRARY_PATH`. The `start.sh` script sets this explicitly.

### `make_cpu_buft_list: no CPU backend found`

Install `libgomp1` (OpenMP runtime — required by the CPU backend):

```bash
sudo apt-get install -y libgomp1
```

### CUDA device not found / GPU not offloading

- Confirm `/usr/lib/wsl/lib` is in `PATH` or `LD_LIBRARY_PATH` for the process
- Run `nvidia-smi` as the service user: `sudo -u ollama nvidia-smi`
- Check `journalctl -u llama-server -n 50` for lines like
  `ggml_cuda_init: CUDA not found`

### High CPU / fan noise at idle

- Remove `--no-mmap` if present (forces 9GB into RAM on startup)
- Check `--n-parallel` isn't set high (default 1 is fine for single-user use)
- llama-server is permanently loaded; fans will spin during model load (~30s)
  then drop to zero at idle — this is expected behavior

### `qwen35` architecture errors (rope, tensor shape, missing tensor)

These errors all indicate an **incompatible GGUF source**:

- `rope.dimension_sections has wrong array length; expected 4, got 3` — Ollama
  stores a 3-element array; llama.cpp (before a patch) expects 4.
- `missing tensor 'blk.0.ssm_dt'` or `blk.0.ssm_dt.bias` — Ollama omits the
  `.bias` suffix that HuggingFace-converted GGUFs use (or vice versa).
- `check_tensor_dims: wrong shape` on `blk.N.attn_k.weight` — Ollama's converter
  stores `head_count_kv` as a per-layer array; llama.cpp's qwen35 model loader
  expects a scalar.

**Solution:** use HuggingFace GGUFs (bartowski or unsloth) instead of Ollama
blobs for any `qwen35`-architecture model. See _Model choice_ above.

### Upgrading llama.cpp (replacing binaries while service is running)

The service holds the binary open; `cp` will fail with `Text file busy`. Always
stop the service first:

```bash
sudo systemctl stop llama-server
sudo cp build/bin/llama-server /opt/llama-server/
sudo cp -P build/bin/lib*.so* /opt/llama-server/
sudo systemctl start llama-server
```

### Model file permissions (service runs as `ollama` user)

Files downloaded as your user aren't readable by the `ollama` service user:

```bash
# Make model file readable by all
sudo chmod o+r ~/models/MyModel.gguf
# Make the directory traversable
sudo chmod o+x ~ ~/models
```