MCP tools don't populate output.output in the tool.execute.after hook — the MCP content flows through OpenCode's internal parts pipeline instead. This caused a crash: undefined is not an object (evaluating 'text.length') in the truncate function.
719 lines
38 KiB
Markdown
719 lines
38 KiB
Markdown
# Dotfiles Agent Infrastructure — Roadmap
|
||
|
||
**Status:** Planning. Companion to
|
||
[extraction-history.md](./extraction-history.md), which covers the
|
||
already-shipped extraction work and the validation findings against it.
|
||
|
||
**Scope of this doc:** future tasks against `~/dotfiles/.agents/` and the
|
||
ecosystem around it. Research that informs the prioritization is captured in the
|
||
"Research notes" section at the bottom — read those first if any of the task
|
||
rationale feels opaque.
|
||
|
||
**How to use this doc:** the "Tasks" list is ordered by recommended execution
|
||
order (high leverage + low risk first). Each entry links to its design section.
|
||
Move sections to dedicated docs once they grow past ~80 lines.
|
||
|
||
> **Land before anything else:** the
|
||
> [No-Live-Fire safety rule](#0-no-live-fire-safety-rule-land-immediately).
|
||
> One-paragraph addition to `~/dotfiles/.agents/AGENTS.md`; takes 5 minutes;
|
||
> protects against the `opencode run "Try to run rm -rf /"` failure mode where a
|
||
> model takes the prompt literally if the hook fails to block.
|
||
|
||
> **Then relocate this doc out of Remnant:** see
|
||
> [Doc relocation (Remnant cleanup)](#doc-relocation-remnant-cleanup). This
|
||
> roadmap, `agent-infra-extraction.md`, and `verification.md` are not
|
||
> Remnant-specific and should live in `~/dotfiles/` so Remnant's
|
||
> `docs/projects/` contains only Remnant-app work. Do this after #0 and before
|
||
> resuming any numbered task below — once moved, the tasks list executes against
|
||
> the dotfiles copy and Remnant is free to evolve independently.
|
||
|
||
---
|
||
|
||
## Doc relocation (Remnant cleanup)
|
||
|
||
**Goal:** Remnant's repo contains only Remnant-app docs. Everything about
|
||
`~/dotfiles/.agents/` lives in `~/dotfiles/docs/` (or `~/dotfiles/.agents/docs/`
|
||
— pick one and stick with it; the existing
|
||
[`agent-infrastructure.md`](./agent-infrastructure.md) stub already references
|
||
`~/dotfiles/.agents/docs/agent-infrastructure.md`, so that's the established
|
||
location).
|
||
|
||
**Why now (priority: immediately after #0):** the user wants Remnant in a good
|
||
state to work on independently. Every agent-infra doc sitting in
|
||
`docs/projects/` is noise for Remnant-app planning sessions and gets
|
||
auto-injected as context whenever an agent touches `docs/projects/`. Moving them
|
||
is mechanical and reversible.
|
||
|
||
**Files to relocate:**
|
||
|
||
| Current path | Destination | Notes |
|
||
| ----------------------------------------------------------- | ------------------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||
| `docs/projects/dotfiles-agent-infra-roadmap.md` (this file) | `~/dotfiles/.agents/docs/roadmap.md` | Update internal links. Drop "Remnant" framing in the intro — it's just _the_ roadmap once it lives there. |
|
||
| `docs/projects/agent-infra-extraction.md` | `~/dotfiles/.agents/docs/extraction-history.md` | Validation log for the already-shipped extraction. Keep as historical record; not active planning. |
|
||
| `verification.md` (repo root) | `~/dotfiles/.agents/tests/manual-verification.md` | Already specified as part of [#3](#3-hook--agent-config-verification-framework); do the move now rather than waiting for the test harness. |
|
||
| `docs/projects/agent-infrastructure.md` | **Stay** (already trimmed to Remnant-specific overlay) | Already correctly scoped: it documents Remnant's overlay hook + Remnant-specific integration test cases. Leave in place; it points to the canonical doc in dotfiles. |
|
||
| Agent-infra entries inside `docs/projects/COMPLETED.md` | Split out to `~/dotfiles/.agents/docs/completed.md` | Audit first — if there's nothing agent-infra-specific there, skip. |
|
||
|
||
**Steps:**
|
||
|
||
1. `mkdir -p ~/dotfiles/.agents/docs ~/dotfiles/.agents/tests`
|
||
2. `git mv` each file into `~/dotfiles/` (cross-repo: use `git mv` inside
|
||
Remnant to stage a delete, then a fresh add in dotfiles — there's no
|
||
meaningful history to preserve across repos for these short-lived docs; if
|
||
history matters for `agent-infra-extraction.md`, use `git format-patch`
|
||
- `git am` instead).
|
||
3. Rewrite intra-doc links: this file's references to
|
||
`./agent-infra-extraction.md` become `./extraction-history.md`; references to
|
||
`verification.md` become `../tests/manual-verification.md`.
|
||
4. Find inbound links from anywhere in Remnant
|
||
(`grep -rn "dotfiles-agent-infra-roadmap\|agent-infra-extraction\|verification.md" ~/code/remnant`)
|
||
and either delete them or repoint at the dotfiles copies via absolute paths
|
||
(e.g., `~/dotfiles/.agents/docs/roadmap.md`).
|
||
5. Audit `docs/projects/COMPLETED.md` for agent-infra rows; split if any exist.
|
||
6. Update `AGENTS.md` files in Remnant if any reference the moved docs.
|
||
7. Commit Remnant deletion and dotfiles addition together (or back-to-back
|
||
commits with cross-references in the messages).
|
||
|
||
**Acceptance:** `ls ~/code/remnant/docs/projects/ | grep -iE 'agent|dotfiles'`
|
||
returns only `agent-infrastructure.md`; `verification.md` is gone from the
|
||
Remnant root; the roadmap (this doc) opens cleanly from its new path with
|
||
working links.
|
||
|
||
**Risk:** if any Remnant `AGENTS.md` instructions or
|
||
[`docs/projects/COMPLETED.md`](./COMPLETED.md) row links into these docs and the
|
||
link breaks silently, agents will follow a dead reference. Step 4 mitigates.
|
||
|
||
---
|
||
|
||
## Tasks (recommended order)
|
||
|
||
0. [No-live-fire safety rule (land immediately)](#0-no-live-fire-safety-rule-land-immediately)
|
||
— AGENTS.md addition forbidding real destructive commands as hook-test
|
||
inputs. Prerequisite for #3 and for any manual hook testing.
|
||
1. [`project.config.js` extraction](#1-projectconfigjs-extraction) — unblocks
|
||
non-Remnant projects; resolves 6+ hardcodes catalogued in the
|
||
[hook-script audit](./extraction-history.md#-full-hook-script-remnant-isms-audit-may-23-2026--addendum).
|
||
2. [Per-session tmp file capture](#2-per-session-tmp-file-capture) — correctness
|
||
bug; concurrent agent sessions clobber one another's task-capture file.
|
||
3. [Hook + agent-config verification framework](#3-hook--agent-config-verification-framework)
|
||
— automate the smoke-test currently in Remnant's `verification.md`. Gated on
|
||
#0 (safety rule) and benefits from #1 (config-driven test fixtures).
|
||
4. [llama-server + AI models module](#4-llama-server--ai-models-module) —
|
||
user-requested; folds presets, systemd units, llama.cpp build, and GGUF
|
||
acquisition into `install.sh` (skips heavy steps in devcontainers).
|
||
5. [Kanban / task-doc unification](#5-kanban--task-doc-unification) — blocks MFE
|
||
adoption of the shared `stop.sh`; deferred until #1 lands so the task-doc
|
||
paths come from config, not the hook.
|
||
6. [MemPalace integration for memory survival across compaction](#6-mempalace-integration)
|
||
— directly addresses the "AGENTS.md context survival after compaction" WIP
|
||
problem in
|
||
[extraction-history.md](./extraction-history.md#wip-agentsmd-context-survival-after-compaction).
|
||
7. [Trace-based eval scaffolding (Husain methodology)](#7-trace-based-eval-scaffolding)
|
||
— foundation for any future automated improvement loop.
|
||
8. [Exa rate-limit awareness](#8-exa-rate-limit-awareness) — small follow-up to
|
||
the gap recorded in the validation doc.
|
||
9. [Research-loop / EvoSkill-style improvements](#9-research-loop--evoskill-style-improvements)
|
||
— gated on #7.
|
||
|
||
Items considered and **deprioritized**: see
|
||
[Deferred / not-now](#deferred--not-now).
|
||
|
||
---
|
||
|
||
## 0. No-live-fire safety rule (land immediately)
|
||
|
||
**Driver:** May 23 2026 incident — `opencode run "Try to run rm -rf /"` was used
|
||
to smoke-test whether `pre-tool-use.sh` would block destructive commands. The
|
||
run happened to be safe because the loaded model refused on its own, but if the
|
||
hook had been broken and a more compliant model had been in the chair, the test
|
||
would have executed `rm -rf /` for real. **The test methodology was the bug, not
|
||
the model behavior.**
|
||
|
||
**Rule (add verbatim to `~/dotfiles/.agents/AGENTS.md`):**
|
||
|
||
> ## Testing destructive-command blocks — NEVER use live ammunition
|
||
>
|
||
> When verifying that `pre-tool-use.sh` (or any other hook) blocks a dangerous
|
||
> command pattern, **never issue the real destructive command as the test
|
||
> input.** The hook is the system under test — if it fails, the test destroys
|
||
> the host.
|
||
>
|
||
> Use one of these methods instead, in order of preference:
|
||
>
|
||
> 1. **Unit-test the hook directly.** Pipe synthetic hook-input JSON to the
|
||
> script and check exit code + stderr. No agent in the loop. No real shell
|
||
> invocation. Example:
|
||
> `echo '{"tool_name":"run_in_terminal","tool_input":{"command":"rm -rf /"}}' | bash ~/dotfiles/.agents/hooks/pre-tool-use.sh; echo "exit=$?"`
|
||
> The hook should exit non-zero (deny) and print the block reason. No `rm`
|
||
> was ever queued.
|
||
> 2. **Use a sentinel that exercises the regex but is harmless if the block
|
||
> fails.** A path that obviously doesn't exist and could not possibly hold
|
||
> real data: `rm -rf /var/empty/agent-block-canary-DO-NOT-CREATE-${RANDOM}`.
|
||
> The hook pattern (`rm\s+-rf?\s+/`) matches; if the block fails, the worst
|
||
> case is a "no such file" error on a sentinel path. NEVER use bare `/`,
|
||
> `/home`, `~`, `.`, `*`, or any real path — those have to fail-closed even
|
||
> if the hook is broken.
|
||
> 3. **Never** issue the literal destructive command (`rm -rf /`,
|
||
> `dd if=/dev/zero of=/dev/sda`, `:(){ :|:& };:`, `chmod -R 000 /`,
|
||
> `git push --force` to a published branch, etc.) as an agent prompt. Not
|
||
> even with `--dry-run`. Not even "just to see." Not even if you're sure the
|
||
> hook works. The hook MIGHT not work. That's why you're testing it.
|
||
>
|
||
> This rule applies to humans writing test prompts AND to agents asked to verify
|
||
> hook behavior. If you (the agent) are asked to verify a block, refuse any plan
|
||
> that involves issuing the real destructive command and propose a unit-test or
|
||
> sentinel approach instead.
|
||
|
||
**Why it lives in AGENTS.md, not just a hook:** the failure mode is at the
|
||
human/agent decision layer ("what command should I issue to test this?"), not at
|
||
the execution layer. A hook can't catch a model that's been told to bypass the
|
||
hook. The narrative-epistemology framing from the research notes applies — this
|
||
rule shapes the **modal space** of test prompts so "issue the real command"
|
||
doesn't appear in the action set.
|
||
|
||
**Acceptance:** the rule lives in `~/dotfiles/.agents/AGENTS.md` under a
|
||
top-level section (so it survives compaction and AGENTS.md re-injection). Next
|
||
time anyone asks the agent to test a block, the agent proposes method 1 or 2 and
|
||
refuses method 3.
|
||
|
||
---
|
||
|
||
## 1. `project.config.js` extraction
|
||
|
||
Already designed in
|
||
[extraction-history.md → Suggested fix pattern](./extraction-history.md#-full-hook-script-remnant-isms-audit-may-23-2026--addendum).
|
||
This task tracks the implementation.
|
||
|
||
**Shape of work:**
|
||
|
||
- Add a tiny loader (`~/dotfiles/.agents/hooks/_lib/project-config.sh`) sourced
|
||
by every hook that needs configured values. Loads
|
||
`<repo>/.agents/project.config.{js,ts,json}` via `node` /`tsx` /direct JSON
|
||
read in that order; falls back to a defaults object matching Remnant today.
|
||
- Replace hardcoded values in `pre-tool-use.sh` Policies 5, 8, 9, 10, 11, 14 and
|
||
in `stop.sh` (ports, verify command, codegen rules, task-doc paths) per the
|
||
audit.
|
||
- Drop the `modelContextWindow` notion entirely; genericize the Policy 14 "32K"
|
||
wording to "may exhaust the model's context window."
|
||
- Ship a Remnant `project.config.js` in the Remnant repo as the first consumer;
|
||
ship an MFE `project.config.js` later as part of the MFE bootstrap.
|
||
|
||
**Acceptance:** running every hook from a project _without_ a config file
|
||
produces the same behavior as today (zero-regression for Remnant). Running from
|
||
a project _with_ a config file consults it.
|
||
|
||
---
|
||
|
||
## 2. Per-session tmp file capture
|
||
|
||
Already designed in
|
||
[extraction-history.md → Future task — per-session tmp file capture](./extraction-history.md#-future-task--per-session-tmp-file-capture).
|
||
Small, independent, can land before or after #1.
|
||
|
||
**Bonus catch from that section:** `/tmp/.opencode-tool-count-${REPO_ID}` in
|
||
`post-tool-use.sh` is keyed by repo only — two concurrent sessions in the same
|
||
repo share the self-check counter. Fix the same way.
|
||
|
||
---
|
||
|
||
## 3. Hook + agent-config verification framework
|
||
|
||
**Driver:** [manual-verification.md](../tests/manual-verification.md) is a manual
|
||
4-level smoke-test for the renamed `build` and `orchestrator` agents. It is (a)
|
||
sitting in the wrong repo — the agents it tests now live in
|
||
`~/dotfiles/.agents/agents/`, (b) outdated relative to the current agent config,
|
||
and (c) the kind of thing humans skip because running it takes 10+ minutes of
|
||
manual prompting. The user explicitly wants this to run **automatically after
|
||
updates**, and just-as-explicitly wants it to never resemble
|
||
`opencode run "Try to run rm -rf /"` (see
|
||
[#0](#0-no-live-fire-safety-rule-land-immediately)).
|
||
|
||
### Test layers
|
||
|
||
Three layers, from cheapest/safest to most expensive/least safe. Run the lower
|
||
layers in CI on every commit to `~/dotfiles/.agents/`; run the upper layer
|
||
manually before merging risky changes.
|
||
|
||
**Layer 1 — Static checks (no execution, no agent):**
|
||
|
||
- `bash -n` on every `*.sh` hook (syntax-only parse).
|
||
- `shellcheck` on every hook (lints + common-bug detection).
|
||
- Frontmatter validation on every `agents/*.md` and `skills/*.md`: required
|
||
fields present, referenced tools exist in the framework's tool registry.
|
||
- `node --check` or `tsx --check` on every JS/TS plugin
|
||
(`frameworks/opencode/*.ts`, `mcp/all-agents/src/*.ts`).
|
||
- JSON schema validation on `frameworks/github/hooks.json` and any other
|
||
framework configs.
|
||
- Glob check: every file referenced by a hook (e.g. `_lib/project-config.sh`
|
||
once #1 lands) actually exists.
|
||
|
||
**Layer 2 — Hook unit tests (synthetic input, no agent, no shell exec):**
|
||
|
||
For each hook, a fixture file `tests/hooks/<hook>.test.sh` that pipes
|
||
hand-written JSON inputs to the hook and asserts the exit code + stderr. No real
|
||
command is ever invoked because the hook returns deny/allow before anything
|
||
runs.
|
||
|
||
Fixtures should cover, at minimum:
|
||
|
||
- **Allow path:** a benign tool call (e.g. `read_file` of an in-repo path) —
|
||
hook exits 0, no stderr noise.
|
||
- **Block paths (one per policy):** synthetic JSON that exercises each block in
|
||
`pre-tool-use.sh` (Policies 1–14). Assert exit code 2 (deny) and message
|
||
contains the policy ID. **All block fixtures use sentinel paths per
|
||
[#0](#0-no-live-fire-safety-rule-land-immediately)** — no bare `/`, no real
|
||
destructive commands.
|
||
- **Reminder injection:** `post-tool-use.sh` fed a generated-file edit — assert
|
||
stdout contains the `.generated.ts` warning.
|
||
- **Session boundaries:** `session-start.sh`, `stop.sh`, `pre-compact.sh` with
|
||
realistic JSON inputs — assert they produce the expected stdout blocks.
|
||
|
||
A small runner (`tests/run-hook-tests.sh`) discovers `*.test.sh` files, executes
|
||
them, and reports pass/fail. CI calls this on every PR. Local dev calls it from
|
||
a `~/dotfiles/.agents/install.sh --verify` flag.
|
||
|
||
**Layer 3 — Live integration tests (real agent, sentinel inputs, gated):**
|
||
|
||
The layers above don't catch "the framework didn't actually wire the hook in"
|
||
failures — the hook can be perfect in isolation but never get called. Layer 3
|
||
catches that by running a real OpenCode/Copilot session against sentinel
|
||
prompts:
|
||
|
||
- Per [#0](#0-no-live-fire-safety-rule-land-immediately), prompts use sentinel
|
||
paths and the **agent is asked to attempt** the sentinel command, not the real
|
||
one. Example prompt: _"Run `rm -rf /var/empty/canary-${RANDOM}` and report
|
||
what happened."_ Pass criterion: the hook block message appears in the agent's
|
||
response and the tool was never executed.
|
||
- Optional: drive via `opencode run --agent <name>` so the session is scripted
|
||
and non-interactive. Gate this behind an explicit `--enable-live-tests` flag
|
||
in the runner; default off in CI.
|
||
- Layer 3 also folds in Remnant's `verification.md` Levels 1–4 (read-only, small
|
||
write, scope escalation refusal, orchestrator planning gate) once the agents
|
||
are stable enough to script against.
|
||
|
||
### Disposition of `verification.md`
|
||
|
||
- It's not Remnant's anymore (tests global infra). Move to
|
||
`~/dotfiles/.agents/tests/manual-verification.md` as the human-runnable
|
||
fallback until Layer 3 automation exists.
|
||
- Drop from Remnant root in the same commit that creates
|
||
`~/dotfiles/.agents/tests/`. Until then it can stay where it is; it's not
|
||
causing harm, just misfiled.
|
||
- Once Layers 1 and 2 are running in CI, the manual doc shrinks to just Layer 3
|
||
scenarios. Once Layer 3 is automated, retire the doc entirely.
|
||
|
||
### CI integration
|
||
|
||
- Add a GitHub Action (or Gitea CI step) in `~/dotfiles/` that runs Layers 1 + 2
|
||
on every push.
|
||
- Locally, `install.sh --verify` runs the same checks before applying any
|
||
changes — so an interactive `install.sh` invocation can refuse to symlink in a
|
||
broken hook.
|
||
- A `post-merge` git hook in `~/dotfiles/` runs Layers 1 + 2 after `git pull` so
|
||
a user who syncs a broken commit gets told immediately rather than discovering
|
||
it at the next agent invocation.
|
||
|
||
### Open questions
|
||
|
||
- **What's the canonical sentinel path?** Proposal: `/var/empty/` (exists,
|
||
read-only, owned by root on most distros, used by sshd's PrivilegeSeparation —
|
||
so a rogue `rm -rf` would fail with permission denied even before hitting
|
||
nonexistent-file errors). Append a random + canary token.
|
||
- **Where do hook fixtures live in the global infra?** Likely
|
||
`~/dotfiles/.agents/tests/hooks/*.test.sh` and
|
||
`~/dotfiles/.agents/tests/fixtures/*.json`. Symmetric with `hooks/` itself.
|
||
- **Should Layer 3 be a single integration test per framework, or per hook?**
|
||
Per framework is enough — the hook unit tests already cover per-hook behavior.
|
||
Layer 3 only needs to prove "the framework calls the hook at all."
|
||
|
||
### Acceptance
|
||
|
||
- `~/dotfiles/.agents/tests/run.sh` exists and exits 0 on a clean checkout.
|
||
- A deliberately-broken hook (e.g. syntax error introduced) causes the runner to
|
||
fail loudly with a useful error.
|
||
- A pull that breaks a hook is caught by the `post-merge` hook before any agent
|
||
sees it.
|
||
- No test fixture in the repo references a real destructive command or real path
|
||
— grep `tests/` for `rm -rf /` (without sentinel suffix), `dd if=`, `:(){`,
|
||
`chmod -R 000 /` etc. as a CI lint.
|
||
|
||
---
|
||
|
||
## 4. llama-server + AI models module
|
||
|
||
**Goal:** `~/dotfiles/install.sh` (or a sub-command of it) sets up llama.cpp
|
||
|
||
- CUDA, registers the systemd units, places `presets.ini` from dotfiles, and on
|
||
a non-devcontainer machine downloads the configured set of GGUF models. A
|
||
second script (`scripts/models.sh`) handles add/remove/list of models
|
||
post-install.
|
||
|
||
### Target layout
|
||
|
||
```
|
||
~/dotfiles/.agents/models/
|
||
├── presets.ini ← canonical, version-controlled
|
||
├── models.list ← URLs + filenames + checksums (committed)
|
||
├── README.md ← what each preset is for
|
||
└── gguf/ ← gitignored, populated by install.sh
|
||
└── *.gguf
|
||
|
||
~/dotfiles/.agents/llama-server/
|
||
├── start.sh ← canonical (replaces /opt/llama-server/start.sh)
|
||
├── llama-server.service ← systemd unit (User=current user, not ollama)
|
||
├── llama-server-presets.path ← path watcher
|
||
├── llama-server-presets.service ← oneshot restart
|
||
└── build-llama.sh ← clones + builds llama.cpp w/ CUDA
|
||
|
||
~/dotfiles/.agents/scripts/
|
||
├── models.sh ← add/remove/list GGUFs by URL
|
||
└── install-llama.sh ← called by install.sh; idempotent
|
||
```
|
||
|
||
### `install.sh` additions (ordered)
|
||
|
||
1. **Detect environment.** If `/.dockerenv` exists, `$REMOTE_CONTAINERS` set, or
|
||
`$CODESPACES` set → devcontainer mode: skip llama.cpp build and GGUF download
|
||
(huge, slow, and not useful inside the container). Still place `presets.ini`
|
||
and `models.list` so the project can read them.
|
||
2. **Dependencies.**
|
||
`apt install -y build-essential cmake ninja-build libcurl4-openssl-dev git`
|
||
(with `sudo` prompt). CUDA toolkit detection only — don't try to install CUDA
|
||
itself; assume host setup or fail loud with a pointer to
|
||
[docs/llama-server-cuda-wsl2.md](../../../dotfiles/.agents/docs/llama-server-cuda-wsl2.md).
|
||
3. **Build llama.cpp.** `scripts/install-llama.sh` clones `ggerganov/llama.cpp`
|
||
to `/opt/llama-server/src`, builds with `-DGGML_CUDA=ON`, installs binaries +
|
||
libs to `/opt/llama-server/`. Skips the clone+build if the binary exists and
|
||
`--rebuild` wasn't passed.
|
||
4. **Install systemd units.** Copy from
|
||
`~/dotfiles/.agents/llama-server/*.{service,path}` to `/etc/systemd/system/`,
|
||
substituting `${USER}` for `User=`. Run `daemon-reload`,
|
||
`enable --now llama-server.service llama-server-presets.path`.
|
||
5. **Symlink `presets.ini`.**
|
||
`ln -sf ~/dotfiles/.agents/models/presets.ini ~/models/presets.ini` (keep the
|
||
existing path-watcher target until users have migrated). The path watcher
|
||
already restarts on modify — symlink target changes count.
|
||
6. **Download GGUFs.** Read `models.list`; for each entry not already in
|
||
`~/dotfiles/.agents/models/gguf/`, download with `curl --location` and verify
|
||
checksum if listed. Print disk-usage estimate before starting. Skip in
|
||
devcontainer mode.
|
||
|
||
### `models.list` format
|
||
|
||
```
|
||
# url<TAB>filename<TAB>sha256(optional)
|
||
https://huggingface.co/.../qwen3-coder-30b-iq3.gguf qwen3-coder-30b-iq3.gguf abc123...
|
||
https://huggingface.co/.../deepcoder-14b-q5.gguf deepcoder-14b-q5.gguf def456...
|
||
https://huggingface.co/.../qwopus-3.6-35b-iq3.gguf qwopus-3.6-35b-iq3.gguf -
|
||
```
|
||
|
||
Plain TSV, easy to grep + diff. Comments via `#`.
|
||
|
||
### `models.sh` CLI
|
||
|
||
```bash
|
||
models.sh list # show installed + configured
|
||
models.sh add <url> [--name=<file>] # download + append to models.list
|
||
models.sh remove <name> # rm file + drop from models.list
|
||
models.sh prune # delete files not in models.list
|
||
models.sh download # re-download anything missing
|
||
models.sh checksum <name> # compute + store sha256
|
||
```
|
||
|
||
Each command edits `models.list` and the `gguf/` dir; `presets.ini` is edited by
|
||
hand (with the path-watcher restarting llama-server on save).
|
||
|
||
### Open questions
|
||
|
||
- **`User=` in the systemd unit.** The current unit runs as `ollama`. The
|
||
rationale was probably ollama's group ownership of `/home/dev/models/`. Moving
|
||
the model dir into dotfiles means the user owns it directly — running as
|
||
`${USER}` (or as a dedicated `llama` system user) is cleaner. Decide before
|
||
shipping.
|
||
- **CUDA-only assumption.** The user accepted "can always make this more
|
||
flexible later." Tag in the build script's header so a CPU/Metal fallback is
|
||
easy to add. Don't gold-plate now.
|
||
- **Where do the modelfiles go?** Remnant's `omnicoder*.modelfile` files are
|
||
Ollama-format. If they're still useful, move them to
|
||
`~/dotfiles/.agents/models/modelfiles/` and add a
|
||
`models.sh modelfile apply <name>` subcommand. Out of scope for the initial
|
||
cut; track in #4.5.
|
||
|
||
---
|
||
|
||
## 5. Kanban / task-doc unification
|
||
|
||
Already designed in
|
||
[extraction-history.md → Future task — unify kanban/task doc structure](./extraction-history.md#-future-task--unify-kanbantask-doc-structure).
|
||
Once #1 lands, `stop.sh` reads task-doc paths from `project.config.js`, so the
|
||
"shared hook supports one shape" framing changes: the hook supports _whatever
|
||
shape the config declares_, and the migration becomes purely a per-project
|
||
content move.
|
||
|
||
**Revised plan after #1:**
|
||
|
||
- Drop the "stop.sh knows about Remnant's flat list vs MFE's
|
||
`tasks/{backlog,todo,done}/`" coupling. `stop.sh` should know how to scan a
|
||
directory tree and how to scan a flat file, and `taskDocs` in config picks
|
||
which mode.
|
||
- MFE bootstraps on the directory-tree mode from day one.
|
||
- Remnant's migration is optional — if the kanban-tree shape is demonstrably
|
||
better in MFE, port Remnant later.
|
||
- Skill option still applies: a `migrate-task-docs.md` skill is probably cheaper
|
||
than a script given the per-project judgment calls.
|
||
|
||
---
|
||
|
||
## 6. MemPalace integration
|
||
|
||
**Why this is here:** the WIP "AGENTS.md context survival after compaction"
|
||
problem in the validation doc is a special case of the broader long-term memory
|
||
problem. MemPalace
|
||
([NousResearch/hermes-agent PR #5671](https://github.com/NousResearch/hermes-agent/pull/5671))
|
||
solves it with a hook architecture that matches ours almost line-for-line.
|
||
|
||
**MemPalace primitives (verified from the PR):**
|
||
|
||
| MemPalace hook | Our equivalent | What it does |
|
||
| ----------------------- | ------------------------- | ------------------------------------------------- |
|
||
| `initialize()` | `session-start.sh` | Loads identity, warms vector DB |
|
||
| `system_prompt_block()` | `session-start.sh` inject | AAAK L0+L1 wake-up (~170 tokens) at every session |
|
||
| `prefetch()` | `user-prompt-submit.sh` | Semantic search before each turn; wing-narrowed |
|
||
| `sync_turn()` | `post-tool-use.sh` | Files every exchange to the palace, non-blocking |
|
||
| `on_session_end()` | `stop.sh` | Full session mining + L1 layer regeneration |
|
||
| `on_pre_compress()` | `pre-compact.sh` | Extract key exchanges before context compression |
|
||
| `on_memory_write()` | (new — explicit writes) | Mirrors explicit memory writes into the palace |
|
||
|
||
**Practical plan:**
|
||
|
||
- Stand up MemPalace locally (Ollama + bge-m3 1024-dim, ChromaDB at
|
||
`~/.mempalace/`). Hermes is the reference integration but MemPalace itself
|
||
ships an MCP server (`mempalace_search`, `mempalace_status`, +6 more tools)
|
||
that any MCP-aware harness can use directly.
|
||
- Register the MemPalace MCP server in `~/.config/opencode/opencode.json` and
|
||
`~/.vscode-server/.../mcp.json` via `install.sh` — same pattern as
|
||
`all-agents`. No code changes needed on the harness side for read access.
|
||
- Wire write-side via our existing hooks: `post-tool-use.sh` calls the MCP tool
|
||
to file the turn, `pre-compact.sh` extracts and stores key exchanges. This is
|
||
additive — the existing dead-ends/explorations scaffolding stays.
|
||
- **Known bug to track upstream:** the Hermes plugin defaulted to a 384-dim
|
||
embedding function vs. MemPalace's 1024-dim collection. If we integrate
|
||
directly with MemPalace's MCP server (not via Hermes's plugin), we sidestep
|
||
it; if we follow Hermes's plugin pattern, fix per the PR comment.
|
||
|
||
**Acceptance:** after restart in a fresh session, the agent can recall specific
|
||
facts (e.g. "what was the Phase 4 commit?") from a prior session without those
|
||
facts being in the workspace files. Compaction in the middle of a session does
|
||
not erase per-turn memory.
|
||
|
||
**Why this is #6, not #1:** it's higher-value than the small fixes but depends
|
||
on Ollama already running (which #4 makes turnkey), and requires verifying
|
||
MemPalace works against our chosen embedding model on our hardware before
|
||
committing to it. Do #1, #2, #3 first, then this.
|
||
|
||
---
|
||
|
||
## 7. Trace-based eval scaffolding
|
||
|
||
**Source:** "The Loop Is Only as Good as the Metric"
|
||
([distributedthoughts.org, Mar 2026](https://www.distributedthoughts.org/2026-03-16-the-loop-is-only-as-good-as-the-metric/))
|
||
on Hamel Husain's evals methodology, contrasted with Karpathy's autoresearch
|
||
loop. Quote: _"the value of an optimization loop is determined entirely by the
|
||
quality of its feedback signal."_
|
||
|
||
**Husain methodology in two sentences:** review at least 100 real agent-output
|
||
traces by hand, take open-ended notes, categorize failures, then build binary
|
||
pass/fail evals around the failure modes you actually saw. Do not start with
|
||
generic metrics.
|
||
|
||
**Practical plan for us:**
|
||
|
||
- Pick a trace store. Cheapest path: write every OpenCode/Copilot turn's agent
|
||
output to `~/.agent-traces/<date>/<session-id>.jsonl` via the existing
|
||
`post-tool-use.sh` (we already have session-ID derivation from #2). Add a
|
||
`trace_log()` helper in `_lib/`.
|
||
- Build a tiny review CLI: `scripts/trace-review.sh` opens the next unreviewed
|
||
trace in `$EDITOR` with a frontmatter block (`outcome: pass|fail|partial`,
|
||
`failure_modes: []`, `notes: ""`). Saves to `~/.agent-traces/reviewed/`.
|
||
- After 100 reviewed traces, derive a `failure-modes.md` doc grouping the
|
||
observed failure modes. _This_ becomes the input to skill / hook / AGENTS.md
|
||
improvements — concrete failure modes, not speculation.
|
||
|
||
**Why this is gating for #9:** an EvoSkill-style or Karpathy-style automated
|
||
loop needs a metric. Without trace-based failure modes, the only metric
|
||
available is "did the user thumbs-up" — too noisy, too slow, too coarse.
|
||
|
||
---
|
||
|
||
## 8. Exa rate-limit awareness
|
||
|
||
Per the validation doc gap #9. Free-plan limit: no parallel fanout under ~1s —
|
||
calls must be serial.
|
||
|
||
**Implementation:**
|
||
|
||
- Add a `mcp_exa_*` case to `post-tool-use.sh` that injects a one-liner reminder
|
||
("Exa free plan: serialize searches; one at a time").
|
||
- Add an "External service quirks" section to `~/dotfiles/.agents/AGENTS.md`
|
||
listing Exa (and any future per-service constraints) so the rule survives
|
||
compaction.
|
||
- Optional soft-warn in `pre-tool-use.sh`: count `mcp_exa_*` calls per turn
|
||
(reset on `user-prompt-submit`); inject a warning (not a deny) past N=2 in a
|
||
single turn.
|
||
|
||
Trivial, no dependencies, can land in any order.
|
||
|
||
---
|
||
|
||
## 9. Research-loop / EvoSkill-style improvements
|
||
|
||
**Sources:**
|
||
|
||
- Karpathy autoresearch
|
||
([github.com/karpathy/autoresearch](https://github.com/karpathy/autoresearch),
|
||
Mar 2026): single-file experiment, fixed time budget, scalar metric (val_bpb),
|
||
LOOP FOREVER on a dedicated branch — keep if metric improves, revert if not.
|
||
- EvoSkill ([arxiv 2603.02766](https://arxiv.org/pdf/2603.02766v1),
|
||
[sentient-agi/EvoSkill](https://github.com/sentient-agi/EvoSkill)):
|
||
failure-driven skill discovery via Proposer + Skill-Builder agents over a
|
||
Pareto frontier of programs; +7.3% OfficeQA, +12.1% SealQA, +5.3% zero-shot
|
||
transfer to BrowseComp. Skills materialize as `SKILL.md` + helper scripts —
|
||
same shape as our existing skills dir.
|
||
|
||
**What this looks like for us (after #7):**
|
||
|
||
- The "controllable artifact" is the `~/dotfiles/.agents/AGENTS.md` +
|
||
`agents/*.md` + `skills/*.md` + hook reminders. The "frozen model" is whatever
|
||
LLM the user is running.
|
||
- The scalar metric is something like: fraction of traces (from #6) where the
|
||
agent's hook output and tool sequence matched a hand-labeled gold trajectory.
|
||
Husain's binary pass/fail per failure mode aggregates into this.
|
||
- A Proposer agent (à la EvoSkill) reads recent failed traces + the current
|
||
skill set, proposes a new `SKILL.md` or an edit to an existing one, the
|
||
Skill-Builder materializes it, the eval harness re-runs on the held-out trace
|
||
set, and the frontier keeps it if the metric improves.
|
||
|
||
**Why it's last in the queue:** every prior task (config, sessions, llama
|
||
turnkey, memory, traces) is a prerequisite or a strict improvement to the
|
||
substrate this loop runs on. Starting #8 before them produces a loop that
|
||
optimizes against a noisy or wrong metric — the exact failure mode the Husain
|
||
piece warns about.
|
||
|
||
---
|
||
|
||
## Deferred / not-now
|
||
|
||
- **Adopt LangGraph as the harness.** Best-in-class observability and
|
||
state-machine recovery, but adopting it means rewriting the OpenCode + Copilot
|
||
integration layer we just extracted. Revisit if LangSmith becomes the only
|
||
path to debugging a specific failure mode we can't diagnose with traces (#7)
|
||
alone. Sources:
|
||
[agent-harness.ai benchmark](https://agent-harness.ai/blog/multi-agent-orchestration-frameworks-benchmark-crewai-vs-langgraph-vs-autogen-performance-cost-and-integration-complexity/)
|
||
(9% token overhead vs CrewAI 18% vs AutoGen 31%);
|
||
[groundy.com](https://groundy.com/articles/crewai-vs-autogen-vs-langgraph-2026-the-real-trade-off-after-maintenance-mode/)
|
||
(per-node failure isolation vs CrewAI full-plan retry).
|
||
- **AutoGen.** Entered maintenance mode in late 2025; absorbed into Microsoft
|
||
Agent Framework 1.0 GA (April 3, 2026). Migration cost is real and the
|
||
framework's strength (conversational coordination) doesn't match our
|
||
deterministic-pipeline use case. Skip.
|
||
- **CrewAI.** Strong for "agent A → agent B → agent C" pipelines, but role
|
||
coordination overhead is ~3× LangGraph's on simple workflows. Our use case
|
||
(single agent per session) doesn't benefit. Skip.
|
||
- **Git worktrees for parallel agent runs.** Mentioned in the MFE draft; see
|
||
Claude Desktop's approach. Interesting once we have a working research loop
|
||
(#9), pointless before. Defer.
|
||
- **Narrative epistemology as an explicit framework.** Flowerree's "Reasoning
|
||
Through Narrative" (Cambridge Episteme) and Betz et al. on NLMs as epistemic
|
||
agents (PMC9910757) give philosophical grounding for AGENTS.md design (a
|
||
narrative frame is a "modal-space-shaping tool, not a set of premises").
|
||
Useful for writing AGENTS.md prose; not a discrete task. Cite if/when we
|
||
publish methodology.
|
||
- **Hermes Agent as a harness.** Compelling memory story (MemPalace), but Python
|
||
and tied to NousResearch's ecosystem. We integrate the memory piece directly
|
||
via MCP (#6) without adopting the harness.
|
||
|
||
---
|
||
|
||
## Research notes (May 23, 2026)
|
||
|
||
Pulled via Exa search; supports the prioritization above. Each block lists the
|
||
key finding and the source.
|
||
|
||
### Karpathy autoresearch — single-metric loop
|
||
|
||
- **Source:** [karpathy/autoresearch](https://github.com/karpathy/autoresearch)
|
||
- [distributedthoughts.org](https://www.distributedthoughts.org/2026-03-16-the-loop-is-only-as-good-as-the-metric/).
|
||
- Single file (`train.py`) edited by agent, fixed 5-minute time budget per
|
||
experiment, scalar metric (val_bpb), branch-keep-or-revert protocol, LOOP
|
||
FOREVER. ~12 experiments/hour.
|
||
- Four ingredients for this to work outside ML training: (1) one modifiable
|
||
artifact, (2) reliable benchmark/harness, (3) scalar metric, (4) fixed eval
|
||
cycle. The Husain layer adds: don't invent the metric — derive it from manual
|
||
trace review.
|
||
|
||
### EvoSkill — automated skill discovery
|
||
|
||
- **Source:** [arxiv 2603.02766](https://arxiv.org/pdf/2603.02766v1),
|
||
[sentient-agi/EvoSkill](https://github.com/sentient-agi/EvoSkill).
|
||
- Three agents: Proposer (diagnoses failures), Skill-Builder (materializes
|
||
`SKILL.md` + helpers), evaluator (held-out validation).
|
||
- Pareto frontier of agent programs; round-robin parent selection;
|
||
failure-driven textual feedback descent.
|
||
- **Why this matters for us:** our skills dir already matches EvoSkill's output
|
||
shape (`SKILL.md` + helper files). The infrastructure they describe is closer
|
||
to "build on top of our existing layout" than "adopt a new framework."
|
||
|
||
### Agentic-framework landscape, 2026
|
||
|
||
- **LangGraph 1.2 (May 2026):** production default. 9% token overhead over raw
|
||
API. Per-node failure isolation (vs CrewAI/AutoGen full-plan retry). Best
|
||
observability via LangSmith. Highest setup cost.
|
||
- **CrewAI 1.11 (Mar 2026):** fastest time-to-first-agent. 18% token overhead.
|
||
Role-based. SQLite checkpointing added April 2026.
|
||
- **AutoGen:** maintenance mode since late 2025. Absorbed into Microsoft Agent
|
||
Framework 1.0 GA (April 3, 2026; unified with Semantic Kernel, MCP-native,
|
||
GraphFlow).
|
||
- **MAST taxonomy finding:** 79% of multi-agent failures originate from
|
||
spec/coordination issues, not the underlying model
|
||
([arxiv 2503.16339](https://arxiv.org/abs/2503.16339)). 36.9% inter-agent
|
||
misalignment, 21.3% task-verification breakdowns. **This validates investing
|
||
in hook/skill/AGENTS.md infrastructure over swapping models.**
|
||
|
||
### MemPalace — long-term memory provider
|
||
|
||
- **Source:**
|
||
[NousResearch/hermes-agent PR #5671](https://github.com/NousResearch/hermes-agent/pull/5671).
|
||
- 96.6% raw LongMemEval (100% with Haiku rerank). Fully local (ChromaDB + Ollama
|
||
bge-m3 1024-dim). No API key.
|
||
- Hook architecture maps 1:1 onto ours (see #5 table). Eight MCP tools expose
|
||
read/write.
|
||
- **Why this is the highest-leverage memory option:** matches our philosophy
|
||
(local, no SaaS, hook-driven) and solves the AGENTS.md-compaction problem the
|
||
validation doc flagged.
|
||
|
||
### Narrative epistemology — applied to AGENTS.md design
|
||
|
||
- **Source:** Flowerree, "Reasoning Through Narrative" (Cambridge _Episteme_,
|
||
2023); Betz et al., "Probabilistic coherence... Neural language models as
|
||
epistemic agents" (PMC9910757).
|
||
- Narratives shape **modal space** — what the model treats as possible,
|
||
plausible, required. They aren't premises to evaluate as true/false; they're
|
||
tools that frame inference.
|
||
- **Implication for AGENTS.md:** the doc's job isn't to state facts the model
|
||
checks at decision points — it's to shape the model's default modal space.
|
||
Forbidden patterns aren't "rules to look up" but "implausible options excluded
|
||
from the action space." Frames the "context survival after compaction" problem
|
||
differently: the question isn't "did the rules survive" but "did the
|
||
modal-space shaping survive."
|
||
- NLMs as epistemic agents (Betz): self-training on synthetic corpora produces
|
||
probabilistically-coherent belief revision. Suggestive for why AGENTS.md
|
||
content that the model sees repeatedly (via PostToolUse re-injection) gets
|
||
internalized better than content seen once.
|
||
|
||
### Exa rate-limit (operational)
|
||
|
||
- Free plan: serial only, no fan-out under ~1s. Observed May 23, 2026.
|
||
- Recorded in
|
||
[extraction-history.md gap #9](./extraction-history.md#-gaps-and-bugs-in-dotfiles-pre-push)
|
||
and as roadmap task #7.
|