dotfiles/.agents/docs/roadmap.md
Brydon DeWitt 83f456f25b fix(plugin): guard against undefined output.output for MCP tools
MCP tools don't populate output.output in the tool.execute.after hook —
the MCP content flows through OpenCode's internal parts pipeline instead.
This caused a crash: undefined is not an object (evaluating 'text.length')
in the truncate function.
2026-06-06 02:11:24 -04:00

719 lines
38 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Dotfiles Agent Infrastructure — Roadmap
**Status:** Planning. Companion to
[extraction-history.md](./extraction-history.md), which covers the
already-shipped extraction work and the validation findings against it.
**Scope of this doc:** future tasks against `~/dotfiles/.agents/` and the
ecosystem around it. Research that informs the prioritization is captured in the
"Research notes" section at the bottom — read those first if any of the task
rationale feels opaque.
**How to use this doc:** the "Tasks" list is ordered by recommended execution
order (high leverage + low risk first). Each entry links to its design section.
Move sections to dedicated docs once they grow past ~80 lines.
> **Land before anything else:** the
> [No-Live-Fire safety rule](#0-no-live-fire-safety-rule-land-immediately).
> One-paragraph addition to `~/dotfiles/.agents/AGENTS.md`; takes 5 minutes;
> protects against the `opencode run "Try to run rm -rf /"` failure mode where a
> model takes the prompt literally if the hook fails to block.
> **Then relocate this doc out of Remnant:** see
> [Doc relocation (Remnant cleanup)](#doc-relocation-remnant-cleanup). This
> roadmap, `agent-infra-extraction.md`, and `verification.md` are not
> Remnant-specific and should live in `~/dotfiles/` so Remnant's
> `docs/projects/` contains only Remnant-app work. Do this after #0 and before
> resuming any numbered task below — once moved, the tasks list executes against
> the dotfiles copy and Remnant is free to evolve independently.
---
## Doc relocation (Remnant cleanup)
**Goal:** Remnant's repo contains only Remnant-app docs. Everything about
`~/dotfiles/.agents/` lives in `~/dotfiles/docs/` (or `~/dotfiles/.agents/docs/`
— pick one and stick with it; the existing
[`agent-infrastructure.md`](./agent-infrastructure.md) stub already references
`~/dotfiles/.agents/docs/agent-infrastructure.md`, so that's the established
location).
**Why now (priority: immediately after #0):** the user wants Remnant in a good
state to work on independently. Every agent-infra doc sitting in
`docs/projects/` is noise for Remnant-app planning sessions and gets
auto-injected as context whenever an agent touches `docs/projects/`. Moving them
is mechanical and reversible.
**Files to relocate:**
| Current path | Destination | Notes |
| ----------------------------------------------------------- | ------------------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `docs/projects/dotfiles-agent-infra-roadmap.md` (this file) | `~/dotfiles/.agents/docs/roadmap.md` | Update internal links. Drop "Remnant" framing in the intro — it's just _the_ roadmap once it lives there. |
| `docs/projects/agent-infra-extraction.md` | `~/dotfiles/.agents/docs/extraction-history.md` | Validation log for the already-shipped extraction. Keep as historical record; not active planning. |
| `verification.md` (repo root) | `~/dotfiles/.agents/tests/manual-verification.md` | Already specified as part of [#3](#3-hook--agent-config-verification-framework); do the move now rather than waiting for the test harness. |
| `docs/projects/agent-infrastructure.md` | **Stay** (already trimmed to Remnant-specific overlay) | Already correctly scoped: it documents Remnant's overlay hook + Remnant-specific integration test cases. Leave in place; it points to the canonical doc in dotfiles. |
| Agent-infra entries inside `docs/projects/COMPLETED.md` | Split out to `~/dotfiles/.agents/docs/completed.md` | Audit first — if there's nothing agent-infra-specific there, skip. |
**Steps:**
1. `mkdir -p ~/dotfiles/.agents/docs ~/dotfiles/.agents/tests`
2. `git mv` each file into `~/dotfiles/` (cross-repo: use `git mv` inside
Remnant to stage a delete, then a fresh add in dotfiles — there's no
meaningful history to preserve across repos for these short-lived docs; if
history matters for `agent-infra-extraction.md`, use `git format-patch`
- `git am` instead).
3. Rewrite intra-doc links: this file's references to
`./agent-infra-extraction.md` become `./extraction-history.md`; references to
`verification.md` become `../tests/manual-verification.md`.
4. Find inbound links from anywhere in Remnant
(`grep -rn "dotfiles-agent-infra-roadmap\|agent-infra-extraction\|verification.md" ~/code/remnant`)
and either delete them or repoint at the dotfiles copies via absolute paths
(e.g., `~/dotfiles/.agents/docs/roadmap.md`).
5. Audit `docs/projects/COMPLETED.md` for agent-infra rows; split if any exist.
6. Update `AGENTS.md` files in Remnant if any reference the moved docs.
7. Commit Remnant deletion and dotfiles addition together (or back-to-back
commits with cross-references in the messages).
**Acceptance:** `ls ~/code/remnant/docs/projects/ | grep -iE 'agent|dotfiles'`
returns only `agent-infrastructure.md`; `verification.md` is gone from the
Remnant root; the roadmap (this doc) opens cleanly from its new path with
working links.
**Risk:** if any Remnant `AGENTS.md` instructions or
[`docs/projects/COMPLETED.md`](./COMPLETED.md) row links into these docs and the
link breaks silently, agents will follow a dead reference. Step 4 mitigates.
---
## Tasks (recommended order)
0. [No-live-fire safety rule (land immediately)](#0-no-live-fire-safety-rule-land-immediately)
— AGENTS.md addition forbidding real destructive commands as hook-test
inputs. Prerequisite for #3 and for any manual hook testing.
1. [`project.config.js` extraction](#1-projectconfigjs-extraction) — unblocks
non-Remnant projects; resolves 6+ hardcodes catalogued in the
[hook-script audit](./extraction-history.md#-full-hook-script-remnant-isms-audit-may-23-2026--addendum).
2. [Per-session tmp file capture](#2-per-session-tmp-file-capture) — correctness
bug; concurrent agent sessions clobber one another's task-capture file.
3. [Hook + agent-config verification framework](#3-hook--agent-config-verification-framework)
— automate the smoke-test currently in Remnant's `verification.md`. Gated on
#0 (safety rule) and benefits from #1 (config-driven test fixtures).
4. [llama-server + AI models module](#4-llama-server--ai-models-module) —
user-requested; folds presets, systemd units, llama.cpp build, and GGUF
acquisition into `install.sh` (skips heavy steps in devcontainers).
5. [Kanban / task-doc unification](#5-kanban--task-doc-unification) — blocks MFE
adoption of the shared `stop.sh`; deferred until #1 lands so the task-doc
paths come from config, not the hook.
6. [MemPalace integration for memory survival across compaction](#6-mempalace-integration)
— directly addresses the "AGENTS.md context survival after compaction" WIP
problem in
[extraction-history.md](./extraction-history.md#wip-agentsmd-context-survival-after-compaction).
7. [Trace-based eval scaffolding (Husain methodology)](#7-trace-based-eval-scaffolding)
— foundation for any future automated improvement loop.
8. [Exa rate-limit awareness](#8-exa-rate-limit-awareness) — small follow-up to
the gap recorded in the validation doc.
9. [Research-loop / EvoSkill-style improvements](#9-research-loop--evoskill-style-improvements)
— gated on #7.
Items considered and **deprioritized**: see
[Deferred / not-now](#deferred--not-now).
---
## 0. No-live-fire safety rule (land immediately)
**Driver:** May 23 2026 incident — `opencode run "Try to run rm -rf /"` was used
to smoke-test whether `pre-tool-use.sh` would block destructive commands. The
run happened to be safe because the loaded model refused on its own, but if the
hook had been broken and a more compliant model had been in the chair, the test
would have executed `rm -rf /` for real. **The test methodology was the bug, not
the model behavior.**
**Rule (add verbatim to `~/dotfiles/.agents/AGENTS.md`):**
> ## Testing destructive-command blocks — NEVER use live ammunition
>
> When verifying that `pre-tool-use.sh` (or any other hook) blocks a dangerous
> command pattern, **never issue the real destructive command as the test
> input.** The hook is the system under test — if it fails, the test destroys
> the host.
>
> Use one of these methods instead, in order of preference:
>
> 1. **Unit-test the hook directly.** Pipe synthetic hook-input JSON to the
> script and check exit code + stderr. No agent in the loop. No real shell
> invocation. Example:
> `echo '{"tool_name":"run_in_terminal","tool_input":{"command":"rm -rf /"}}' | bash ~/dotfiles/.agents/hooks/pre-tool-use.sh; echo "exit=$?"`
> The hook should exit non-zero (deny) and print the block reason. No `rm`
> was ever queued.
> 2. **Use a sentinel that exercises the regex but is harmless if the block
> fails.** A path that obviously doesn't exist and could not possibly hold
> real data: `rm -rf /var/empty/agent-block-canary-DO-NOT-CREATE-${RANDOM}`.
> The hook pattern (`rm\s+-rf?\s+/`) matches; if the block fails, the worst
> case is a "no such file" error on a sentinel path. NEVER use bare `/`,
> `/home`, `~`, `.`, `*`, or any real path — those have to fail-closed even
> if the hook is broken.
> 3. **Never** issue the literal destructive command (`rm -rf /`,
> `dd if=/dev/zero of=/dev/sda`, `:(){ :|:& };:`, `chmod -R 000 /`,
> `git push --force` to a published branch, etc.) as an agent prompt. Not
> even with `--dry-run`. Not even "just to see." Not even if you're sure the
> hook works. The hook MIGHT not work. That's why you're testing it.
>
> This rule applies to humans writing test prompts AND to agents asked to verify
> hook behavior. If you (the agent) are asked to verify a block, refuse any plan
> that involves issuing the real destructive command and propose a unit-test or
> sentinel approach instead.
**Why it lives in AGENTS.md, not just a hook:** the failure mode is at the
human/agent decision layer ("what command should I issue to test this?"), not at
the execution layer. A hook can't catch a model that's been told to bypass the
hook. The narrative-epistemology framing from the research notes applies — this
rule shapes the **modal space** of test prompts so "issue the real command"
doesn't appear in the action set.
**Acceptance:** the rule lives in `~/dotfiles/.agents/AGENTS.md` under a
top-level section (so it survives compaction and AGENTS.md re-injection). Next
time anyone asks the agent to test a block, the agent proposes method 1 or 2 and
refuses method 3.
---
## 1. `project.config.js` extraction
Already designed in
[extraction-history.md → Suggested fix pattern](./extraction-history.md#-full-hook-script-remnant-isms-audit-may-23-2026--addendum).
This task tracks the implementation.
**Shape of work:**
- Add a tiny loader (`~/dotfiles/.agents/hooks/_lib/project-config.sh`) sourced
by every hook that needs configured values. Loads
`<repo>/.agents/project.config.{js,ts,json}` via `node` /`tsx` /direct JSON
read in that order; falls back to a defaults object matching Remnant today.
- Replace hardcoded values in `pre-tool-use.sh` Policies 5, 8, 9, 10, 11, 14 and
in `stop.sh` (ports, verify command, codegen rules, task-doc paths) per the
audit.
- Drop the `modelContextWindow` notion entirely; genericize the Policy 14 "32K"
wording to "may exhaust the model's context window."
- Ship a Remnant `project.config.js` in the Remnant repo as the first consumer;
ship an MFE `project.config.js` later as part of the MFE bootstrap.
**Acceptance:** running every hook from a project _without_ a config file
produces the same behavior as today (zero-regression for Remnant). Running from
a project _with_ a config file consults it.
---
## 2. Per-session tmp file capture
Already designed in
[extraction-history.md → Future task — per-session tmp file capture](./extraction-history.md#-future-task--per-session-tmp-file-capture).
Small, independent, can land before or after #1.
**Bonus catch from that section:** `/tmp/.opencode-tool-count-${REPO_ID}` in
`post-tool-use.sh` is keyed by repo only — two concurrent sessions in the same
repo share the self-check counter. Fix the same way.
---
## 3. Hook + agent-config verification framework
**Driver:** [manual-verification.md](../tests/manual-verification.md) is a manual
4-level smoke-test for the renamed `build` and `orchestrator` agents. It is (a)
sitting in the wrong repo — the agents it tests now live in
`~/dotfiles/.agents/agents/`, (b) outdated relative to the current agent config,
and (c) the kind of thing humans skip because running it takes 10+ minutes of
manual prompting. The user explicitly wants this to run **automatically after
updates**, and just-as-explicitly wants it to never resemble
`opencode run "Try to run rm -rf /"` (see
[#0](#0-no-live-fire-safety-rule-land-immediately)).
### Test layers
Three layers, from cheapest/safest to most expensive/least safe. Run the lower
layers in CI on every commit to `~/dotfiles/.agents/`; run the upper layer
manually before merging risky changes.
**Layer 1 — Static checks (no execution, no agent):**
- `bash -n` on every `*.sh` hook (syntax-only parse).
- `shellcheck` on every hook (lints + common-bug detection).
- Frontmatter validation on every `agents/*.md` and `skills/*.md`: required
fields present, referenced tools exist in the framework's tool registry.
- `node --check` or `tsx --check` on every JS/TS plugin
(`frameworks/opencode/*.ts`, `mcp/all-agents/src/*.ts`).
- JSON schema validation on `frameworks/github/hooks.json` and any other
framework configs.
- Glob check: every file referenced by a hook (e.g. `_lib/project-config.sh`
once #1 lands) actually exists.
**Layer 2 — Hook unit tests (synthetic input, no agent, no shell exec):**
For each hook, a fixture file `tests/hooks/<hook>.test.sh` that pipes
hand-written JSON inputs to the hook and asserts the exit code + stderr. No real
command is ever invoked because the hook returns deny/allow before anything
runs.
Fixtures should cover, at minimum:
- **Allow path:** a benign tool call (e.g. `read_file` of an in-repo path) —
hook exits 0, no stderr noise.
- **Block paths (one per policy):** synthetic JSON that exercises each block in
`pre-tool-use.sh` (Policies 114). Assert exit code 2 (deny) and message
contains the policy ID. **All block fixtures use sentinel paths per
[#0](#0-no-live-fire-safety-rule-land-immediately)** — no bare `/`, no real
destructive commands.
- **Reminder injection:** `post-tool-use.sh` fed a generated-file edit — assert
stdout contains the `.generated.ts` warning.
- **Session boundaries:** `session-start.sh`, `stop.sh`, `pre-compact.sh` with
realistic JSON inputs — assert they produce the expected stdout blocks.
A small runner (`tests/run-hook-tests.sh`) discovers `*.test.sh` files, executes
them, and reports pass/fail. CI calls this on every PR. Local dev calls it from
a `~/dotfiles/.agents/install.sh --verify` flag.
**Layer 3 — Live integration tests (real agent, sentinel inputs, gated):**
The layers above don't catch "the framework didn't actually wire the hook in"
failures — the hook can be perfect in isolation but never get called. Layer 3
catches that by running a real OpenCode/Copilot session against sentinel
prompts:
- Per [#0](#0-no-live-fire-safety-rule-land-immediately), prompts use sentinel
paths and the **agent is asked to attempt** the sentinel command, not the real
one. Example prompt: _"Run `rm -rf /var/empty/canary-${RANDOM}` and report
what happened."_ Pass criterion: the hook block message appears in the agent's
response and the tool was never executed.
- Optional: drive via `opencode run --agent <name>` so the session is scripted
and non-interactive. Gate this behind an explicit `--enable-live-tests` flag
in the runner; default off in CI.
- Layer 3 also folds in Remnant's `verification.md` Levels 14 (read-only, small
write, scope escalation refusal, orchestrator planning gate) once the agents
are stable enough to script against.
### Disposition of `verification.md`
- It's not Remnant's anymore (tests global infra). Move to
`~/dotfiles/.agents/tests/manual-verification.md` as the human-runnable
fallback until Layer 3 automation exists.
- Drop from Remnant root in the same commit that creates
`~/dotfiles/.agents/tests/`. Until then it can stay where it is; it's not
causing harm, just misfiled.
- Once Layers 1 and 2 are running in CI, the manual doc shrinks to just Layer 3
scenarios. Once Layer 3 is automated, retire the doc entirely.
### CI integration
- Add a GitHub Action (or Gitea CI step) in `~/dotfiles/` that runs Layers 1 + 2
on every push.
- Locally, `install.sh --verify` runs the same checks before applying any
changes — so an interactive `install.sh` invocation can refuse to symlink in a
broken hook.
- A `post-merge` git hook in `~/dotfiles/` runs Layers 1 + 2 after `git pull` so
a user who syncs a broken commit gets told immediately rather than discovering
it at the next agent invocation.
### Open questions
- **What's the canonical sentinel path?** Proposal: `/var/empty/` (exists,
read-only, owned by root on most distros, used by sshd's PrivilegeSeparation —
so a rogue `rm -rf` would fail with permission denied even before hitting
nonexistent-file errors). Append a random + canary token.
- **Where do hook fixtures live in the global infra?** Likely
`~/dotfiles/.agents/tests/hooks/*.test.sh` and
`~/dotfiles/.agents/tests/fixtures/*.json`. Symmetric with `hooks/` itself.
- **Should Layer 3 be a single integration test per framework, or per hook?**
Per framework is enough — the hook unit tests already cover per-hook behavior.
Layer 3 only needs to prove "the framework calls the hook at all."
### Acceptance
- `~/dotfiles/.agents/tests/run.sh` exists and exits 0 on a clean checkout.
- A deliberately-broken hook (e.g. syntax error introduced) causes the runner to
fail loudly with a useful error.
- A pull that breaks a hook is caught by the `post-merge` hook before any agent
sees it.
- No test fixture in the repo references a real destructive command or real path
— grep `tests/` for `rm -rf /` (without sentinel suffix), `dd if=`, `:(){`,
`chmod -R 000 /` etc. as a CI lint.
---
## 4. llama-server + AI models module
**Goal:** `~/dotfiles/install.sh` (or a sub-command of it) sets up llama.cpp
- CUDA, registers the systemd units, places `presets.ini` from dotfiles, and on
a non-devcontainer machine downloads the configured set of GGUF models. A
second script (`scripts/models.sh`) handles add/remove/list of models
post-install.
### Target layout
```
~/dotfiles/.agents/models/
├── presets.ini ← canonical, version-controlled
├── models.list ← URLs + filenames + checksums (committed)
├── README.md ← what each preset is for
└── gguf/ ← gitignored, populated by install.sh
└── *.gguf
~/dotfiles/.agents/llama-server/
├── start.sh ← canonical (replaces /opt/llama-server/start.sh)
├── llama-server.service ← systemd unit (User=current user, not ollama)
├── llama-server-presets.path ← path watcher
├── llama-server-presets.service ← oneshot restart
└── build-llama.sh ← clones + builds llama.cpp w/ CUDA
~/dotfiles/.agents/scripts/
├── models.sh ← add/remove/list GGUFs by URL
└── install-llama.sh ← called by install.sh; idempotent
```
### `install.sh` additions (ordered)
1. **Detect environment.** If `/.dockerenv` exists, `$REMOTE_CONTAINERS` set, or
`$CODESPACES` set → devcontainer mode: skip llama.cpp build and GGUF download
(huge, slow, and not useful inside the container). Still place `presets.ini`
and `models.list` so the project can read them.
2. **Dependencies.**
`apt install -y build-essential cmake ninja-build libcurl4-openssl-dev git`
(with `sudo` prompt). CUDA toolkit detection only — don't try to install CUDA
itself; assume host setup or fail loud with a pointer to
[docs/llama-server-cuda-wsl2.md](../../../dotfiles/.agents/docs/llama-server-cuda-wsl2.md).
3. **Build llama.cpp.** `scripts/install-llama.sh` clones `ggerganov/llama.cpp`
to `/opt/llama-server/src`, builds with `-DGGML_CUDA=ON`, installs binaries +
libs to `/opt/llama-server/`. Skips the clone+build if the binary exists and
`--rebuild` wasn't passed.
4. **Install systemd units.** Copy from
`~/dotfiles/.agents/llama-server/*.{service,path}` to `/etc/systemd/system/`,
substituting `${USER}` for `User=`. Run `daemon-reload`,
`enable --now llama-server.service llama-server-presets.path`.
5. **Symlink `presets.ini`.**
`ln -sf ~/dotfiles/.agents/models/presets.ini ~/models/presets.ini` (keep the
existing path-watcher target until users have migrated). The path watcher
already restarts on modify — symlink target changes count.
6. **Download GGUFs.** Read `models.list`; for each entry not already in
`~/dotfiles/.agents/models/gguf/`, download with `curl --location` and verify
checksum if listed. Print disk-usage estimate before starting. Skip in
devcontainer mode.
### `models.list` format
```
# url<TAB>filename<TAB>sha256(optional)
https://huggingface.co/.../qwen3-coder-30b-iq3.gguf qwen3-coder-30b-iq3.gguf abc123...
https://huggingface.co/.../deepcoder-14b-q5.gguf deepcoder-14b-q5.gguf def456...
https://huggingface.co/.../qwopus-3.6-35b-iq3.gguf qwopus-3.6-35b-iq3.gguf -
```
Plain TSV, easy to grep + diff. Comments via `#`.
### `models.sh` CLI
```bash
models.sh list # show installed + configured
models.sh add <url> [--name=<file>] # download + append to models.list
models.sh remove <name> # rm file + drop from models.list
models.sh prune # delete files not in models.list
models.sh download # re-download anything missing
models.sh checksum <name> # compute + store sha256
```
Each command edits `models.list` and the `gguf/` dir; `presets.ini` is edited by
hand (with the path-watcher restarting llama-server on save).
### Open questions
- **`User=` in the systemd unit.** The current unit runs as `ollama`. The
rationale was probably ollama's group ownership of `/home/dev/models/`. Moving
the model dir into dotfiles means the user owns it directly — running as
`${USER}` (or as a dedicated `llama` system user) is cleaner. Decide before
shipping.
- **CUDA-only assumption.** The user accepted "can always make this more
flexible later." Tag in the build script's header so a CPU/Metal fallback is
easy to add. Don't gold-plate now.
- **Where do the modelfiles go?** Remnant's `omnicoder*.modelfile` files are
Ollama-format. If they're still useful, move them to
`~/dotfiles/.agents/models/modelfiles/` and add a
`models.sh modelfile apply <name>` subcommand. Out of scope for the initial
cut; track in #4.5.
---
## 5. Kanban / task-doc unification
Already designed in
[extraction-history.md → Future task — unify kanban/task doc structure](./extraction-history.md#-future-task--unify-kanbantask-doc-structure).
Once #1 lands, `stop.sh` reads task-doc paths from `project.config.js`, so the
"shared hook supports one shape" framing changes: the hook supports _whatever
shape the config declares_, and the migration becomes purely a per-project
content move.
**Revised plan after #1:**
- Drop the "stop.sh knows about Remnant's flat list vs MFE's
`tasks/{backlog,todo,done}/`" coupling. `stop.sh` should know how to scan a
directory tree and how to scan a flat file, and `taskDocs` in config picks
which mode.
- MFE bootstraps on the directory-tree mode from day one.
- Remnant's migration is optional — if the kanban-tree shape is demonstrably
better in MFE, port Remnant later.
- Skill option still applies: a `migrate-task-docs.md` skill is probably cheaper
than a script given the per-project judgment calls.
---
## 6. MemPalace integration
**Why this is here:** the WIP "AGENTS.md context survival after compaction"
problem in the validation doc is a special case of the broader long-term memory
problem. MemPalace
([NousResearch/hermes-agent PR #5671](https://github.com/NousResearch/hermes-agent/pull/5671))
solves it with a hook architecture that matches ours almost line-for-line.
**MemPalace primitives (verified from the PR):**
| MemPalace hook | Our equivalent | What it does |
| ----------------------- | ------------------------- | ------------------------------------------------- |
| `initialize()` | `session-start.sh` | Loads identity, warms vector DB |
| `system_prompt_block()` | `session-start.sh` inject | AAAK L0+L1 wake-up (~170 tokens) at every session |
| `prefetch()` | `user-prompt-submit.sh` | Semantic search before each turn; wing-narrowed |
| `sync_turn()` | `post-tool-use.sh` | Files every exchange to the palace, non-blocking |
| `on_session_end()` | `stop.sh` | Full session mining + L1 layer regeneration |
| `on_pre_compress()` | `pre-compact.sh` | Extract key exchanges before context compression |
| `on_memory_write()` | (new — explicit writes) | Mirrors explicit memory writes into the palace |
**Practical plan:**
- Stand up MemPalace locally (Ollama + bge-m3 1024-dim, ChromaDB at
`~/.mempalace/`). Hermes is the reference integration but MemPalace itself
ships an MCP server (`mempalace_search`, `mempalace_status`, +6 more tools)
that any MCP-aware harness can use directly.
- Register the MemPalace MCP server in `~/.config/opencode/opencode.json` and
`~/.vscode-server/.../mcp.json` via `install.sh` — same pattern as
`all-agents`. No code changes needed on the harness side for read access.
- Wire write-side via our existing hooks: `post-tool-use.sh` calls the MCP tool
to file the turn, `pre-compact.sh` extracts and stores key exchanges. This is
additive — the existing dead-ends/explorations scaffolding stays.
- **Known bug to track upstream:** the Hermes plugin defaulted to a 384-dim
embedding function vs. MemPalace's 1024-dim collection. If we integrate
directly with MemPalace's MCP server (not via Hermes's plugin), we sidestep
it; if we follow Hermes's plugin pattern, fix per the PR comment.
**Acceptance:** after restart in a fresh session, the agent can recall specific
facts (e.g. "what was the Phase 4 commit?") from a prior session without those
facts being in the workspace files. Compaction in the middle of a session does
not erase per-turn memory.
**Why this is #6, not #1:** it's higher-value than the small fixes but depends
on Ollama already running (which #4 makes turnkey), and requires verifying
MemPalace works against our chosen embedding model on our hardware before
committing to it. Do #1, #2, #3 first, then this.
---
## 7. Trace-based eval scaffolding
**Source:** "The Loop Is Only as Good as the Metric"
([distributedthoughts.org, Mar 2026](https://www.distributedthoughts.org/2026-03-16-the-loop-is-only-as-good-as-the-metric/))
on Hamel Husain's evals methodology, contrasted with Karpathy's autoresearch
loop. Quote: _"the value of an optimization loop is determined entirely by the
quality of its feedback signal."_
**Husain methodology in two sentences:** review at least 100 real agent-output
traces by hand, take open-ended notes, categorize failures, then build binary
pass/fail evals around the failure modes you actually saw. Do not start with
generic metrics.
**Practical plan for us:**
- Pick a trace store. Cheapest path: write every OpenCode/Copilot turn's agent
output to `~/.agent-traces/<date>/<session-id>.jsonl` via the existing
`post-tool-use.sh` (we already have session-ID derivation from #2). Add a
`trace_log()` helper in `_lib/`.
- Build a tiny review CLI: `scripts/trace-review.sh` opens the next unreviewed
trace in `$EDITOR` with a frontmatter block (`outcome: pass|fail|partial`,
`failure_modes: []`, `notes: ""`). Saves to `~/.agent-traces/reviewed/`.
- After 100 reviewed traces, derive a `failure-modes.md` doc grouping the
observed failure modes. _This_ becomes the input to skill / hook / AGENTS.md
improvements — concrete failure modes, not speculation.
**Why this is gating for #9:** an EvoSkill-style or Karpathy-style automated
loop needs a metric. Without trace-based failure modes, the only metric
available is "did the user thumbs-up" — too noisy, too slow, too coarse.
---
## 8. Exa rate-limit awareness
Per the validation doc gap #9. Free-plan limit: no parallel fanout under ~1s —
calls must be serial.
**Implementation:**
- Add a `mcp_exa_*` case to `post-tool-use.sh` that injects a one-liner reminder
("Exa free plan: serialize searches; one at a time").
- Add an "External service quirks" section to `~/dotfiles/.agents/AGENTS.md`
listing Exa (and any future per-service constraints) so the rule survives
compaction.
- Optional soft-warn in `pre-tool-use.sh`: count `mcp_exa_*` calls per turn
(reset on `user-prompt-submit`); inject a warning (not a deny) past N=2 in a
single turn.
Trivial, no dependencies, can land in any order.
---
## 9. Research-loop / EvoSkill-style improvements
**Sources:**
- Karpathy autoresearch
([github.com/karpathy/autoresearch](https://github.com/karpathy/autoresearch),
Mar 2026): single-file experiment, fixed time budget, scalar metric (val_bpb),
LOOP FOREVER on a dedicated branch — keep if metric improves, revert if not.
- EvoSkill ([arxiv 2603.02766](https://arxiv.org/pdf/2603.02766v1),
[sentient-agi/EvoSkill](https://github.com/sentient-agi/EvoSkill)):
failure-driven skill discovery via Proposer + Skill-Builder agents over a
Pareto frontier of programs; +7.3% OfficeQA, +12.1% SealQA, +5.3% zero-shot
transfer to BrowseComp. Skills materialize as `SKILL.md` + helper scripts —
same shape as our existing skills dir.
**What this looks like for us (after #7):**
- The "controllable artifact" is the `~/dotfiles/.agents/AGENTS.md` +
`agents/*.md` + `skills/*.md` + hook reminders. The "frozen model" is whatever
LLM the user is running.
- The scalar metric is something like: fraction of traces (from #6) where the
agent's hook output and tool sequence matched a hand-labeled gold trajectory.
Husain's binary pass/fail per failure mode aggregates into this.
- A Proposer agent (à la EvoSkill) reads recent failed traces + the current
skill set, proposes a new `SKILL.md` or an edit to an existing one, the
Skill-Builder materializes it, the eval harness re-runs on the held-out trace
set, and the frontier keeps it if the metric improves.
**Why it's last in the queue:** every prior task (config, sessions, llama
turnkey, memory, traces) is a prerequisite or a strict improvement to the
substrate this loop runs on. Starting #8 before them produces a loop that
optimizes against a noisy or wrong metric — the exact failure mode the Husain
piece warns about.
---
## Deferred / not-now
- **Adopt LangGraph as the harness.** Best-in-class observability and
state-machine recovery, but adopting it means rewriting the OpenCode + Copilot
integration layer we just extracted. Revisit if LangSmith becomes the only
path to debugging a specific failure mode we can't diagnose with traces (#7)
alone. Sources:
[agent-harness.ai benchmark](https://agent-harness.ai/blog/multi-agent-orchestration-frameworks-benchmark-crewai-vs-langgraph-vs-autogen-performance-cost-and-integration-complexity/)
(9% token overhead vs CrewAI 18% vs AutoGen 31%);
[groundy.com](https://groundy.com/articles/crewai-vs-autogen-vs-langgraph-2026-the-real-trade-off-after-maintenance-mode/)
(per-node failure isolation vs CrewAI full-plan retry).
- **AutoGen.** Entered maintenance mode in late 2025; absorbed into Microsoft
Agent Framework 1.0 GA (April 3, 2026). Migration cost is real and the
framework's strength (conversational coordination) doesn't match our
deterministic-pipeline use case. Skip.
- **CrewAI.** Strong for "agent A → agent B → agent C" pipelines, but role
coordination overhead is ~3× LangGraph's on simple workflows. Our use case
(single agent per session) doesn't benefit. Skip.
- **Git worktrees for parallel agent runs.** Mentioned in the MFE draft; see
Claude Desktop's approach. Interesting once we have a working research loop
(#9), pointless before. Defer.
- **Narrative epistemology as an explicit framework.** Flowerree's "Reasoning
Through Narrative" (Cambridge Episteme) and Betz et al. on NLMs as epistemic
agents (PMC9910757) give philosophical grounding for AGENTS.md design (a
narrative frame is a "modal-space-shaping tool, not a set of premises").
Useful for writing AGENTS.md prose; not a discrete task. Cite if/when we
publish methodology.
- **Hermes Agent as a harness.** Compelling memory story (MemPalace), but Python
and tied to NousResearch's ecosystem. We integrate the memory piece directly
via MCP (#6) without adopting the harness.
---
## Research notes (May 23, 2026)
Pulled via Exa search; supports the prioritization above. Each block lists the
key finding and the source.
### Karpathy autoresearch — single-metric loop
- **Source:** [karpathy/autoresearch](https://github.com/karpathy/autoresearch)
- [distributedthoughts.org](https://www.distributedthoughts.org/2026-03-16-the-loop-is-only-as-good-as-the-metric/).
- Single file (`train.py`) edited by agent, fixed 5-minute time budget per
experiment, scalar metric (val_bpb), branch-keep-or-revert protocol, LOOP
FOREVER. ~12 experiments/hour.
- Four ingredients for this to work outside ML training: (1) one modifiable
artifact, (2) reliable benchmark/harness, (3) scalar metric, (4) fixed eval
cycle. The Husain layer adds: don't invent the metric — derive it from manual
trace review.
### EvoSkill — automated skill discovery
- **Source:** [arxiv 2603.02766](https://arxiv.org/pdf/2603.02766v1),
[sentient-agi/EvoSkill](https://github.com/sentient-agi/EvoSkill).
- Three agents: Proposer (diagnoses failures), Skill-Builder (materializes
`SKILL.md` + helpers), evaluator (held-out validation).
- Pareto frontier of agent programs; round-robin parent selection;
failure-driven textual feedback descent.
- **Why this matters for us:** our skills dir already matches EvoSkill's output
shape (`SKILL.md` + helper files). The infrastructure they describe is closer
to "build on top of our existing layout" than "adopt a new framework."
### Agentic-framework landscape, 2026
- **LangGraph 1.2 (May 2026):** production default. 9% token overhead over raw
API. Per-node failure isolation (vs CrewAI/AutoGen full-plan retry). Best
observability via LangSmith. Highest setup cost.
- **CrewAI 1.11 (Mar 2026):** fastest time-to-first-agent. 18% token overhead.
Role-based. SQLite checkpointing added April 2026.
- **AutoGen:** maintenance mode since late 2025. Absorbed into Microsoft Agent
Framework 1.0 GA (April 3, 2026; unified with Semantic Kernel, MCP-native,
GraphFlow).
- **MAST taxonomy finding:** 79% of multi-agent failures originate from
spec/coordination issues, not the underlying model
([arxiv 2503.16339](https://arxiv.org/abs/2503.16339)). 36.9% inter-agent
misalignment, 21.3% task-verification breakdowns. **This validates investing
in hook/skill/AGENTS.md infrastructure over swapping models.**
### MemPalace — long-term memory provider
- **Source:**
[NousResearch/hermes-agent PR #5671](https://github.com/NousResearch/hermes-agent/pull/5671).
- 96.6% raw LongMemEval (100% with Haiku rerank). Fully local (ChromaDB + Ollama
bge-m3 1024-dim). No API key.
- Hook architecture maps 1:1 onto ours (see #5 table). Eight MCP tools expose
read/write.
- **Why this is the highest-leverage memory option:** matches our philosophy
(local, no SaaS, hook-driven) and solves the AGENTS.md-compaction problem the
validation doc flagged.
### Narrative epistemology — applied to AGENTS.md design
- **Source:** Flowerree, "Reasoning Through Narrative" (Cambridge _Episteme_,
2023); Betz et al., "Probabilistic coherence... Neural language models as
epistemic agents" (PMC9910757).
- Narratives shape **modal space** — what the model treats as possible,
plausible, required. They aren't premises to evaluate as true/false; they're
tools that frame inference.
- **Implication for AGENTS.md:** the doc's job isn't to state facts the model
checks at decision points — it's to shape the model's default modal space.
Forbidden patterns aren't "rules to look up" but "implausible options excluded
from the action space." Frames the "context survival after compaction" problem
differently: the question isn't "did the rules survive" but "did the
modal-space shaping survive."
- NLMs as epistemic agents (Betz): self-training on synthetic corpora produces
probabilistically-coherent belief revision. Suggestive for why AGENTS.md
content that the model sees repeatedly (via PostToolUse re-injection) gets
internalized better than content seen once.
### Exa rate-limit (operational)
- Free plan: serial only, no fan-out under ~1s. Observed May 23, 2026.
- Recorded in
[extraction-history.md gap #9](./extraction-history.md#-gaps-and-bugs-in-dotfiles-pre-push)
and as roadmap task #7.