- AGENTS.md: design principles, enforcement hierarchy, deferred loading - agents/: brainstorm, build, orchestrator, research (auto-discovered by MCP server) - skills/: research methodology (auto-discovered by MCP server) - hooks/: pre-tool-use, post-tool-use (BFF block removed), session-start, stop, pre-compact, user-prompt-submit - frameworks/: opencode/plugin.ts (resolves hooks via import.meta.url — works as project-local or global plugin), github/hooks.json - mcp/index.ts: auto-discovers agents/*.md and skills/*.md from frontmatter (replaces hand-maintained registry); server renamed all-agents - docs/: agent-infrastructure.md (generalized), research docs (7 files), ai_architectures.md, llama-server-cuda-wsl2.md - install.sh: idempotent setup — Copilot global hooks, OpenCode global plugin + AGENTS.md + MCP entry, VS Code global MCP config
599 lines
20 KiB
Markdown
599 lines
20 KiB
Markdown
# llama-server with CUDA on WSL2
|
||
|
||
Guide to deploying `llama-server` (llama.cpp) as a systemd service on WSL2 with
|
||
full NVIDIA GPU offload via CUDA. Configured in **router mode** to serve
|
||
multiple GGUF models on-demand (with optional MTP speculative decoding) via an
|
||
OpenAI-compatible API.
|
||
|
||
**Target environment:**
|
||
|
||
- WSL2 (Ubuntu 24.04 Noble)
|
||
- NVIDIA RTX 3080 12GB (or similar), driver exposed via WSL2 GPU passthrough
|
||
- No separate CUDA toolkit install required to _run_; only needed when building
|
||
|
||
---
|
||
|
||
## Why not Ollama?
|
||
|
||
Ollama vendors a pinned version of llama.cpp and bundles its own CUDA runtime.
|
||
New model architectures (like `qwen35` / Qwen3-Next) may not be supported until
|
||
Ollama syncs its fork. `llama-server` from upstream llama.cpp supports them as
|
||
soon as the architecture lands in the main branch.
|
||
|
||
**Ollama does nothing special** beyond: bundling `libggml-cuda.so` alongside its
|
||
runner and setting `PATH` to include `/usr/lib/wsl/lib` (the WSL2 CUDA driver
|
||
passthrough). No flash-attention env vars, no special flags. We replicate this.
|
||
|
||
---
|
||
|
||
## Prerequisites
|
||
|
||
```bash
|
||
# Verify WSL2 CUDA driver passthrough is working
|
||
ls /usr/lib/wsl/lib/libcuda.so.1 # must exist
|
||
nvidia-smi # must show your GPU
|
||
```
|
||
|
||
---
|
||
|
||
## Step 1 — Install CUDA toolkit and build dependencies
|
||
|
||
> Only needed once per machine to compile llama.cpp. Not needed at runtime.
|
||
|
||
```bash
|
||
sudo apt-get install -y nvidia-cuda-toolkit cmake build-essential git
|
||
```
|
||
|
||
Ubuntu 24.04 ships CUDA 12.0 in the `multiverse` repo. This is sufficient to
|
||
build llama.cpp with CUDA support even when the runtime driver is newer (e.g.
|
||
CUDA 13.1 via WSL2 passthrough). Alternatively, install CUDA 12.x from
|
||
[NVIDIA's own APT repo](https://developer.nvidia.com/cuda-downloads) to get a
|
||
more recent toolkit.
|
||
|
||
Verify the compiler is available:
|
||
|
||
```bash
|
||
nvcc --version
|
||
```
|
||
|
||
---
|
||
|
||
## Step 2 — Clone and build llama.cpp from source
|
||
|
||
```bash
|
||
# Clone at a specific tag — check https://github.com/ggml-org/llama.cpp/releases for latest
|
||
# b9144+ required for qwen35 architecture (Qwen3.6, OmniCoder 2, etc.)
|
||
# b9279+ required for MTP speculative decoding (--spec-type draft-mtp)
|
||
git clone --depth 1 --branch b9279 https://github.com/ggml-org/llama.cpp.git /tmp/llama-build
|
||
cd /tmp/llama-build
|
||
|
||
# Configure with CUDA backend
|
||
cmake -B build \
|
||
-DGGML_CUDA=ON \
|
||
-DCMAKE_BUILD_TYPE=Release \
|
||
-DLLAMA_BUILD_SERVER=ON \
|
||
-DLLAMA_BUILD_TESTS=OFF \
|
||
-DLLAMA_BUILD_EXAMPLES=OFF
|
||
|
||
# Build (uses all cores; takes 10-15 min on a 12-core CPU)
|
||
cmake --build build --config Release -j$(nproc)
|
||
```
|
||
|
||
After the build completes you should see `build/bin/llama-server`.
|
||
|
||
---
|
||
|
||
## Step 3 — Install to /opt/llama-server
|
||
|
||
```bash
|
||
sudo mkdir -p /opt/llama-server
|
||
|
||
# Copy the server binary
|
||
sudo cp build/bin/llama-server /opt/llama-server/
|
||
|
||
# Copy all shared libraries (b9144+ puts them all in build/bin/)
|
||
sudo cp -P build/bin/libggml*.so* /opt/llama-server/
|
||
sudo cp -P build/bin/libllama*.so* /opt/llama-server/
|
||
sudo cp -P build/bin/libmtmd*.so* /opt/llama-server/ 2>/dev/null || true
|
||
|
||
# Register the directory so transitive .so dependencies resolve
|
||
echo "/opt/llama-server" | sudo tee /etc/ld.so.conf.d/llama-server.conf
|
||
sudo ldconfig
|
||
```
|
||
|
||
> **Note (b9144+):** The library layout changed — all `.so` files now live in
|
||
> `build/bin/` (not `build/ggml/src/` or `build/src/`). When upgrading, copy
|
||
> with `-P` to preserve versioned symlinks and overwrite the old ones.
|
||
|
||
---
|
||
|
||
## Step 4 — Create the start script
|
||
|
||
Run llama-server in **router mode** — no `--model` flag. Models are loaded
|
||
on-demand from `~/models/` when a request names them. Switching models requires
|
||
no restart and no `sudo`: just change the `model` field in `opencode.json`.
|
||
|
||
```bash
|
||
sudo tee /opt/llama-server/start.sh > /dev/null << 'SCRIPT'
|
||
#!/bin/bash
|
||
export LD_LIBRARY_PATH=/opt/llama-server${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH}
|
||
cd /opt/llama-server
|
||
exec /opt/llama-server/llama-server \
|
||
--models-dir /home/dev/models \
|
||
--models-max 1 \
|
||
--models-preset /home/dev/models/presets.ini \
|
||
--host 127.0.0.1 \
|
||
--port 8080
|
||
SCRIPT
|
||
sudo chmod +x /opt/llama-server/start.sh
|
||
```
|
||
|
||
**Key router flags:**
|
||
|
||
- `--models-dir` — directory scanned for GGUF files. Flat `.gguf` files become
|
||
model IDs using the filename **without** `.gguf`. Subdirectories become model
|
||
IDs using the directory name (used for multimodal models with a separate
|
||
mmproj file — see _Multimodal models_ below).
|
||
- `--models-max 1` — only one model resident at a time. When a different model
|
||
is requested, the current one is evicted and the new one loads (cold-start
|
||
delay). With 12GB VRAM this is required.
|
||
- `--models-preset` — path to `presets.ini` for global defaults and per-model
|
||
overrides. All inference flags belong here, not in `start.sh`.
|
||
|
||
**Per-model settings via `presets.ini`**
|
||
|
||
All inference flags (`ctx-size`, `n-predict`, `n-gpu-layers`, `flash-attn`,
|
||
`threads`, `parallel`, `jinja`, `spec-type`, etc.) live in
|
||
`~/models/presets.ini`, not in `start.sh`. The `[*]` section sets defaults
|
||
inherited by every model; named sections override individual keys.
|
||
|
||
Section names must match the router's model ID — the filename **without**
|
||
`.gguf`. Using the `.gguf` suffix in a section name creates a duplicate entry in
|
||
the model list.
|
||
|
||
```ini
|
||
version = 1
|
||
|
||
[*]
|
||
n-gpu-layers = 99
|
||
flash-attn = on
|
||
threads = 8
|
||
parallel = 1
|
||
|
||
[Qwen_Qwen3-14B-Q4_K_M]
|
||
ctx-size = 32768
|
||
n-predict = 4096
|
||
|
||
[OmniCoder-2-9B.Q8_0]
|
||
ctx-size = 32768
|
||
n-predict = 4096
|
||
|
||
[Qwen_Qwen3.6-27B-Q4_K_M]
|
||
ctx-size = 16384
|
||
n-predict = 4096
|
||
```
|
||
|
||
> **Note:** The router reads `presets.ini` **once at service startup** — it is
|
||
> not watched for changes. After editing it, run
|
||
> `sudo systemctl restart llama-server` to apply the new settings. Any
|
||
> currently-loaded model will be evicted and must cold-reload on the next
|
||
> request (~10–60 s).
|
||
|
||
**On GPU layer offload:** Hybrid inference (some layers on CPU, some on GPU) is
|
||
significantly slower than full-GPU due to CPU↔GPU memory transfers each forward
|
||
pass. For interactive use, prefer models that fit entirely in VRAM. MoE models
|
||
(like Qwen3.6-35B-A3B) are an exception — their sparse activation means active
|
||
computation per token is only ~3B parameters regardless of total model size, so
|
||
partial CPU offload is less painful than with a dense model of the same file
|
||
size. See the _Model choice_ section below.
|
||
|
||
---
|
||
|
||
## Step 5 — Create the systemd service
|
||
|
||
```bash
|
||
sudo tee /etc/systemd/system/llama-server.service > /dev/null << 'EOF'
|
||
[Unit]
|
||
Description=llama-server (OmniCoder 2 / qwen35)
|
||
After=network-online.target
|
||
|
||
[Service]
|
||
ExecStart=/opt/llama-server/start.sh
|
||
User=ollama
|
||
Group=ollama
|
||
Restart=always
|
||
RestartSec=3
|
||
Environment="PATH=/home/dev/.nvm/versions/node/v24.15.0/bin:/home/dev/.opencode/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/usr/lib/wsl/lib:/snap/bin"
|
||
|
||
[Install]
|
||
WantedBy=default.target
|
||
EOF
|
||
|
||
sudo systemctl daemon-reload
|
||
sudo systemctl enable llama-server
|
||
sudo systemctl start llama-server
|
||
```
|
||
|
||
> **Note:** The `PATH` includes `/usr/lib/wsl/lib` — this is what exposes the
|
||
> CUDA driver (`libcuda.so.1`) to the process in WSL2. Without this, the CUDA
|
||
> backend will load but fail to initialize the device.
|
||
|
||
---
|
||
|
||
## Step 6 — Verify GPU offload
|
||
|
||
```bash
|
||
# Check service is running
|
||
systemctl status llama-server
|
||
|
||
# Health endpoint
|
||
curl -s http://127.0.0.1:8080/health
|
||
# → {"status":"ok"}
|
||
|
||
# Watch GPU memory in another terminal during a request
|
||
watch -n1 nvidia-smi
|
||
|
||
# Quick inference test
|
||
curl -s http://127.0.0.1:8080/v1/chat/completions \
|
||
-H "Content-Type: application/json" \
|
||
-d '{"model":"local","messages":[{"role":"user","content":"Say hello."}],"max_tokens":20}' \
|
||
| python3 -m json.tool
|
||
```
|
||
|
||
During inference, `nvidia-smi` should show:
|
||
|
||
- GPU-Util: 80-100%
|
||
- GPU Memory: ~10-11GB used (model weights + KV cache)
|
||
- CPU: near idle
|
||
|
||
```bash
|
||
# Quick inference test (node instead of python3)
|
||
curl -s http://127.0.0.1:8080/v1/chat/completions \
|
||
-H "Content-Type: application/json" \
|
||
-d '{"model":"Qwen_Qwen3-14B-Q4_K_M","messages":[{"role":"user","content":"Say hello."}],"max_tokens":20}' \
|
||
| node -e "let d=''; process.stdin.on('data',c=>d+=c); process.stdin.on('end',()=>console.log(JSON.parse(d).choices[0].message.content))"
|
||
```
|
||
|
||
---
|
||
|
||
## Step 7 — Configure OpenCode
|
||
|
||
Edit `~/.config/opencode/opencode.json` to add the provider. Model IDs are the
|
||
filenames **without** `.gguf` (or the subdirectory name for multimodal models).
|
||
The `limit` values here inform opencode's context window tracking; the actual
|
||
server-side limits are set in `presets.ini`.
|
||
|
||
```json
|
||
"llama-server": {
|
||
"npm": "@ai-sdk/openai-compatible",
|
||
"name": "llama-server",
|
||
"options": { "baseURL": "http://127.0.0.1:8080/v1" },
|
||
"models": {
|
||
"Qwen_Qwen3-14B-Q4_K_M": {
|
||
"name": "Qwen3 14B Q4 (fast)",
|
||
"tools": true,
|
||
"limit": { "context": 32768, "output": 4096 }
|
||
},
|
||
"Qwen_Qwen3.6-27B-Q4_K_M": {
|
||
"name": "Qwen3.6 27B Q4 (deep)",
|
||
"tools": true,
|
||
"limit": { "context": 16384, "output": 4096 }
|
||
},
|
||
"OmniCoder-2-9B.Q8_0": {
|
||
"name": "OmniCoder 2 9B Q8 (vision)",
|
||
"tools": true,
|
||
"limit": { "context": 32768, "output": 4096 }
|
||
},
|
||
"Qwen3.6-35B-A3B-IQ3_S-3.06bpw": {
|
||
"name": "Qwen3.6 35B A3B IQ3 (MoE+MTP)",
|
||
"tools": true,
|
||
"limit": { "context": 8192, "output": 4096 }
|
||
}
|
||
}
|
||
}
|
||
```
|
||
|
||
In the project-level `opencode.json`, set the active model per agent:
|
||
|
||
```json
|
||
"agent": {
|
||
"orchestrator": {
|
||
"model": "llama-server/Qwen_Qwen3-14B-Q4_K_M"
|
||
}
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## Model choice for RTX 3080 12GB
|
||
|
||
Pick based on what fits **entirely** in VRAM — hybrid inference (model too large
|
||
for VRAM) is 4–8× slower and makes interactive use painful. MoE models are an
|
||
exception; see note below the table.
|
||
|
||
| Model | File size | Fits in 12GB? | Speed (est.) | Notes |
|
||
| ------------------------------- | --------- | ------------- | ------------- | ---------------------------------------------------------------------------------------- |
|
||
| Qwen3-8B Q4_K_M | ~5 GB | ✅ fully | ~25–35 tok/s | Fast; weaker reasoning |
|
||
| **Qwen3-14B Q4_K_M** | ~8.5 GB | ✅ fully | ~12–18 tok/s | **Daily driver** — fast interactive use, good instruction following |
|
||
| OmniCoder-2-9B Q8_0 | ~9.5 GB | ✅ fully | ~15–20 tok/s | Vision-capable (multimodal); subdirectory layout for auto-detected mmproj |
|
||
| **Qwen3.6-27B Q4_K_M** | 17 GB | ⚠️ partial | ~4–8 tok/s | **Deep reasoning** — better at vague/complex tasks; slow due to CPU offload |
|
||
| **Qwen3.6-35B-A3B IQ3_S (MTP)** | 13.6 GB | ⚠️ partial | ~20–35 tok/s† | **MoE + MTP** — sparse activation (~3B active params); needs MTP-format GGUF (byteshape) |
|
||
| Qwen3-32B Q4_K_M | ~20 GB | ❌ | — | Won't fit |
|
||
|
||
† MoE speed estimate with `--spec-type draft-mtp`. Despite 13.6 GB file size,
|
||
only ~1.6 GB needs CPU offload (few dense attention layers overflow VRAM). The
|
||
sparse feed-forward experts make active-parameter compute comparable to a 3B
|
||
dense model.
|
||
|
||
All models sit in `~/models/` simultaneously and are swapped on-demand by the
|
||
router. Cold-swap time is ~10s (9–14B) / ~30–45s (27B+).
|
||
|
||
Download from bartowski on HuggingFace (imatrix quants, standard GGUF format):
|
||
|
||
```bash
|
||
mkdir -p ~/models
|
||
wget -c "https://huggingface.co/bartowski/Qwen_Qwen3-14B-GGUF/resolve/main/Qwen_Qwen3-14B-Q4_K_M.gguf" \
|
||
-O ~/models/Qwen_Qwen3-14B-Q4_K_M.gguf
|
||
```
|
||
|
||
> **⚠️ Use HuggingFace GGUFs, not Ollama blobs for qwen35-architecture models.**
|
||
> Ollama's converter outputs different tensor names and per-layer KV-head arrays
|
||
> that are incompatible with llama.cpp's `qwen35` model loader. Symptoms:
|
||
> `missing tensor 'blk.0.ssm_dt'`, `check_tensor_dims: wrong shape`, or
|
||
> `rope.dimension_sections has wrong array length`. Always download from
|
||
> bartowski or unsloth on HuggingFace for these models.
|
||
|
||
---
|
||
|
||
## Switching models
|
||
|
||
With router mode, switching requires **no restart and no `sudo`**. Place GGUFs
|
||
in `~/models/` and reference them by model ID in `opencode.json`.
|
||
|
||
### Add a model
|
||
|
||
```bash
|
||
# Download to ~/models/ — filename without .gguf becomes the model ID
|
||
wget -c "https://huggingface.co/bartowski/Qwen_Qwen3.6-27B-GGUF/resolve/main/Qwen_Qwen3.6-27B-Q4_K_M.gguf" \
|
||
-O ~/models/Qwen_Qwen3.6-27B-Q4_K_M.gguf
|
||
```
|
||
|
||
Then add a section to `~/models/presets.ini` (name = filename without `.gguf`):
|
||
|
||
```ini
|
||
[Qwen_Qwen3.6-27B-Q4_K_M]
|
||
ctx-size = 16384
|
||
n-predict = 4096
|
||
```
|
||
|
||
And register it in `~/.config/opencode/opencode.json`:
|
||
|
||
```json
|
||
"Qwen_Qwen3.6-27B-Q4_K_M": {
|
||
"name": "Qwen3.6 27B Q4 (deep)",
|
||
"tools": true,
|
||
"limit": { "context": 16384, "output": 4096 }
|
||
}
|
||
```
|
||
|
||
### Switch active model
|
||
|
||
Edit `opencode.json` (project-level or `~/.config/opencode/opencode.json`) and
|
||
change the agent's `model` to `llama-server/<model-id>`:
|
||
|
||
```json
|
||
"agent": {
|
||
"orchestrator": {
|
||
"model": "llama-server/Qwen_Qwen3-14B-Q4_K_M"
|
||
}
|
||
}
|
||
```
|
||
|
||
The next request triggers a cold load of the new model (~10–30s for 14B, ~30–60s
|
||
for 27B+). No service restart needed. `--models-max 1` ensures the previous
|
||
model is evicted from VRAM automatically.
|
||
|
||
To switch from the CLI without editing files:
|
||
|
||
```bash
|
||
opencode run -m "llama-server/Qwen_Qwen3-14B-Q4_K_M" "your message here"
|
||
```
|
||
|
||
### Multimodal models
|
||
|
||
For models with a separate vision encoder (mmproj), use a **subdirectory** in
|
||
`~/models/`. The directory name becomes the model ID; llama.cpp auto-detects any
|
||
file whose name starts with `mmproj` as the projector.
|
||
|
||
```
|
||
~/models/
|
||
OmniCoder-2-9B.Q8_0/ ← model ID = "OmniCoder-2-9B.Q8_0"
|
||
OmniCoder-2-9B.Q8_0.gguf ← main weights
|
||
mmproj-Q8_0.gguf ← vision projector (auto-detected)
|
||
```
|
||
|
||
### List available models
|
||
|
||
```bash
|
||
# See what's in ~/models/ (all are immediately usable as model IDs)
|
||
ls ~/models/
|
||
|
||
# See what's currently loaded
|
||
curl -s http://127.0.0.1:8080/v1/models | node -e \
|
||
"process.stdin.resume(); let d=''; process.stdin.on('data',c=>d+=c); process.stdin.on('end',()=>JSON.parse(d).data.forEach(m=>console.log(m.id, m.meta?.loaded ? '[loaded]' : '[unloaded]')))"
|
||
|
||
# Force a rescan (picks up newly added model files)
|
||
curl -s 'http://127.0.0.1:8080/models?reload=1' | node -e \
|
||
"process.stdin.resume(); let d=''; process.stdin.on('data',c=>d+=c); process.stdin.on('end',()=>JSON.parse(d).data.forEach(m=>console.log(m.id)))"
|
||
```
|
||
|
||
### Auto-restart on presets.ini change
|
||
|
||
The router caches `presets.ini` at startup, so any edit requires a service
|
||
restart to take effect. You can automate this with a systemd **path unit** that
|
||
watches the file and triggers a restart whenever it is written:
|
||
|
||
```bash
|
||
sudo tee /etc/systemd/system/llama-server-presets.path > /dev/null << 'EOF'
|
||
[Unit]
|
||
Description=Restart llama-server when presets.ini changes
|
||
|
||
[Path]
|
||
PathChanged=/home/dev/models/presets.ini
|
||
|
||
[Install]
|
||
WantedBy=default.target
|
||
EOF
|
||
|
||
sudo tee /etc/systemd/system/llama-server-presets.service > /dev/null << 'EOF'
|
||
[Unit]
|
||
Description=Restart llama-server (triggered by presets.ini change)
|
||
|
||
[Service]
|
||
Type=oneshot
|
||
ExecStart=/bin/systemctl restart llama-server
|
||
EOF
|
||
|
||
sudo systemctl daemon-reload
|
||
sudo systemctl enable --now llama-server-presets.path
|
||
```
|
||
|
||
After this, saving `~/models/presets.ini` automatically restarts the service (~3
|
||
s) and the next inference request cold-loads the model with the new settings.
|
||
The restart is intentionally disruptive — the currently-loaded model is evicted
|
||
— so only enable this if disruptive restarts on every presets save are
|
||
acceptable.
|
||
|
||
---
|
||
|
||
## MTP speculative decoding
|
||
|
||
Multi-Token Prediction (MTP) lets the model predict several tokens per forward
|
||
pass using draft heads baked into the model weights — no separate draft model
|
||
needed. For Qwen3.6-35B-A3B this roughly doubles throughput (from ~15 tok/s to
|
||
~25–35 tok/s on RTX 3080) while preserving output quality.
|
||
|
||
**Requirements:**
|
||
|
||
1. **b9279+ binary** — `--spec-type draft-mtp` was added in this era. Verify:
|
||
```bash
|
||
/opt/llama-server/llama-server --help | grep spec-type
|
||
# must list draft-mtp
|
||
```
|
||
2. **MTP-format GGUF** — standard bartowski/unsloth quants do not include MTP
|
||
heads. Use byteshape's dedicated MTP GGUFs:
|
||
```bash
|
||
# IQ3_S (13.6 GB) — best quality/size for 12 GB VRAM with slight CPU offload
|
||
wget -c "https://huggingface.co/byteshape/Qwen3.6-35B-A3B-MTP-GGUF/resolve/main/Qwen3.6-35B-A3B-IQ3_S-3.06bpw.gguf" \
|
||
-O ~/models/Qwen3.6-35B-A3B-IQ3_S-3.06bpw.gguf
|
||
# IQ2_S (10 GB) — fully fits in VRAM; heavier quantization
|
||
wget -c "https://huggingface.co/byteshape/Qwen3.6-35B-A3B-MTP-GGUF/resolve/main/Qwen3.6-35B-A3B-IQ2_S-2.25bpw.gguf" \
|
||
-O ~/models/Qwen3.6-35B-A3B-IQ2_S-2.25bpw.gguf
|
||
```
|
||
|
||
**`presets.ini` section for MTP:**
|
||
|
||
```ini
|
||
[Qwen3.6-35B-A3B-IQ3_S-3.06bpw]
|
||
ctx-size = 32768
|
||
n-predict = 4096
|
||
spec-type = draft-mtp
|
||
spec-draft-p-min = 0.75
|
||
spec-draft-n-max = 3
|
||
```
|
||
|
||
- `spec-draft-p-min` — minimum draft token acceptance probability. 0.75 is a
|
||
good starting point; lower values accept more speculative tokens (faster but
|
||
may diverge from non-speculative output).
|
||
- `spec-draft-n-max` — maximum tokens to speculate per step. 3 is the sweet spot
|
||
for Qwen3.6 MTP; higher values have diminishing returns and add overhead.
|
||
|
||
**Note:** ik_llama.cpp (a fork) achieves ~10–20% higher throughput with MTP than
|
||
official llama.cpp due to a more optimized MTP head implementation. Official
|
||
llama.cpp MTP is still significantly faster than non-speculative inference and
|
||
is the simpler setup.
|
||
|
||
---
|
||
|
||
## Troubleshooting
|
||
|
||
### Active model keeps resetting to the configured default
|
||
|
||
Known opencode bug [#28735](https://github.com/anomalyco/opencode/issues/28735)
|
||
(open as of May 2026): when a background subagent result is delivered back into
|
||
the main session, the active model resets to whatever `orchestrator.model` is
|
||
configured in `opencode.json`. This means any model switch made via `-m` flag or
|
||
the TUI selector gets silently reverted whenever a tool call or subagent
|
||
completes.
|
||
|
||
**Workaround:** keep `orchestrator.model` in `opencode.json` set to the model
|
||
you actually want to use. The reset lands on the configured model, so if it
|
||
matches your intent there's no observable effect.
|
||
|
||
---
|
||
|
||
### `no backends are loaded` at startup
|
||
|
||
The backend `.so` plugins must be in the same directory as the binary, or on
|
||
`LD_LIBRARY_PATH`. The `start.sh` script sets this explicitly.
|
||
|
||
### `make_cpu_buft_list: no CPU backend found`
|
||
|
||
Install `libgomp1` (OpenMP runtime — required by the CPU backend):
|
||
|
||
```bash
|
||
sudo apt-get install -y libgomp1
|
||
```
|
||
|
||
### CUDA device not found / GPU not offloading
|
||
|
||
- Confirm `/usr/lib/wsl/lib` is in `PATH` or `LD_LIBRARY_PATH` for the process
|
||
- Run `nvidia-smi` as the service user: `sudo -u ollama nvidia-smi`
|
||
- Check `journalctl -u llama-server -n 50` for lines like
|
||
`ggml_cuda_init: CUDA not found`
|
||
|
||
### High CPU / fan noise at idle
|
||
|
||
- Remove `--no-mmap` if present (forces 9GB into RAM on startup)
|
||
- Check `--n-parallel` isn't set high (default 1 is fine for single-user use)
|
||
- llama-server is permanently loaded; fans will spin during model load (~30s)
|
||
then drop to zero at idle — this is expected behavior
|
||
|
||
### `qwen35` architecture errors (rope, tensor shape, missing tensor)
|
||
|
||
These errors all indicate an **incompatible GGUF source**:
|
||
|
||
- `rope.dimension_sections has wrong array length; expected 4, got 3` — Ollama
|
||
stores a 3-element array; llama.cpp (before a patch) expects 4.
|
||
- `missing tensor 'blk.0.ssm_dt'` or `blk.0.ssm_dt.bias` — Ollama omits the
|
||
`.bias` suffix that HuggingFace-converted GGUFs use (or vice versa).
|
||
- `check_tensor_dims: wrong shape` on `blk.N.attn_k.weight` — Ollama's converter
|
||
stores `head_count_kv` as a per-layer array; llama.cpp's qwen35 model loader
|
||
expects a scalar.
|
||
|
||
**Solution:** use HuggingFace GGUFs (bartowski or unsloth) instead of Ollama
|
||
blobs for any `qwen35`-architecture model. See _Model choice_ above.
|
||
|
||
### Upgrading llama.cpp (replacing binaries while service is running)
|
||
|
||
The service holds the binary open; `cp` will fail with `Text file busy`. Always
|
||
stop the service first:
|
||
|
||
```bash
|
||
sudo systemctl stop llama-server
|
||
sudo cp build/bin/llama-server /opt/llama-server/
|
||
sudo cp -P build/bin/lib*.so* /opt/llama-server/
|
||
sudo systemctl start llama-server
|
||
```
|
||
|
||
### Model file permissions (service runs as `ollama` user)
|
||
|
||
Files downloaded as your user aren't readable by the `ollama` service user:
|
||
|
||
```bash
|
||
# Make model file readable by all
|
||
sudo chmod o+r ~/models/MyModel.gguf
|
||
# Make the directory traversable
|
||
sudo chmod o+x ~ ~/models
|
||
```
|