# Dotfiles Agent Infrastructure — Roadmap **Status:** Planning. Companion to [extraction-history.md](./extraction-history.md), which covers the already-shipped extraction work and the validation findings against it. **Scope of this doc:** future tasks against `~/dotfiles/.agents/` and the ecosystem around it. Research that informs the prioritization is captured in the "Research notes" section at the bottom — read those first if any of the task rationale feels opaque. **How to use this doc:** the "Tasks" list is ordered by recommended execution order (high leverage + low risk first). Each entry links to its design section. Move sections to dedicated docs once they grow past ~80 lines. > **Land before anything else:** the > [No-Live-Fire safety rule](#0-no-live-fire-safety-rule-land-immediately). > One-paragraph addition to `~/dotfiles/.agents/AGENTS.md`; takes 5 minutes; > protects against the `opencode run "Try to run rm -rf /"` failure mode where a > model takes the prompt literally if the hook fails to block. > **Then relocate this doc out of Remnant:** see > [Doc relocation (Remnant cleanup)](#doc-relocation-remnant-cleanup). This > roadmap, `agent-infra-extraction.md`, and `verification.md` are not > Remnant-specific and should live in `~/dotfiles/` so Remnant's > `docs/projects/` contains only Remnant-app work. Do this after #0 and before > resuming any numbered task below — once moved, the tasks list executes against > the dotfiles copy and Remnant is free to evolve independently. --- ## Doc relocation (Remnant cleanup) **Goal:** Remnant's repo contains only Remnant-app docs. Everything about `~/dotfiles/.agents/` lives in `~/dotfiles/docs/` (or `~/dotfiles/.agents/docs/` — pick one and stick with it; the existing [`agent-infrastructure.md`](./agent-infrastructure.md) stub already references `~/dotfiles/.agents/docs/agent-infrastructure.md`, so that's the established location). **Why now (priority: immediately after #0):** the user wants Remnant in a good state to work on independently. Every agent-infra doc sitting in `docs/projects/` is noise for Remnant-app planning sessions and gets auto-injected as context whenever an agent touches `docs/projects/`. Moving them is mechanical and reversible. **Files to relocate:** | Current path | Destination | Notes | | ----------------------------------------------------------- | ------------------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `docs/projects/dotfiles-agent-infra-roadmap.md` (this file) | `~/dotfiles/.agents/docs/roadmap.md` | Update internal links. Drop "Remnant" framing in the intro — it's just _the_ roadmap once it lives there. | | `docs/projects/agent-infra-extraction.md` | `~/dotfiles/.agents/docs/extraction-history.md` | Validation log for the already-shipped extraction. Keep as historical record; not active planning. | | `verification.md` (repo root) | `~/dotfiles/.agents/tests/manual-verification.md` | Already specified as part of [#3](#3-hook--agent-config-verification-framework); do the move now rather than waiting for the test harness. | | `docs/projects/agent-infrastructure.md` | **Stay** (already trimmed to Remnant-specific overlay) | Already correctly scoped: it documents Remnant's overlay hook + Remnant-specific integration test cases. Leave in place; it points to the canonical doc in dotfiles. | | Agent-infra entries inside `docs/projects/COMPLETED.md` | Split out to `~/dotfiles/.agents/docs/completed.md` | Audit first — if there's nothing agent-infra-specific there, skip. | **Steps:** 1. `mkdir -p ~/dotfiles/.agents/docs ~/dotfiles/.agents/tests` 2. `git mv` each file into `~/dotfiles/` (cross-repo: use `git mv` inside Remnant to stage a delete, then a fresh add in dotfiles — there's no meaningful history to preserve across repos for these short-lived docs; if history matters for `agent-infra-extraction.md`, use `git format-patch` - `git am` instead). 3. Rewrite intra-doc links: this file's references to `./agent-infra-extraction.md` become `./extraction-history.md`; references to `verification.md` become `../tests/manual-verification.md`. 4. Find inbound links from anywhere in Remnant (`grep -rn "dotfiles-agent-infra-roadmap\|agent-infra-extraction\|verification.md" ~/code/remnant`) and either delete them or repoint at the dotfiles copies via absolute paths (e.g., `~/dotfiles/.agents/docs/roadmap.md`). 5. Audit `docs/projects/COMPLETED.md` for agent-infra rows; split if any exist. 6. Update `AGENTS.md` files in Remnant if any reference the moved docs. 7. Commit Remnant deletion and dotfiles addition together (or back-to-back commits with cross-references in the messages). **Acceptance:** `ls ~/code/remnant/docs/projects/ | grep -iE 'agent|dotfiles'` returns only `agent-infrastructure.md`; `verification.md` is gone from the Remnant root; the roadmap (this doc) opens cleanly from its new path with working links. **Risk:** if any Remnant `AGENTS.md` instructions or [`docs/projects/COMPLETED.md`](./COMPLETED.md) row links into these docs and the link breaks silently, agents will follow a dead reference. Step 4 mitigates. --- ## Tasks (recommended order) 0. [No-live-fire safety rule (land immediately)](#0-no-live-fire-safety-rule-land-immediately) — AGENTS.md addition forbidding real destructive commands as hook-test inputs. Prerequisite for #3 and for any manual hook testing. 1. [`project.config.js` extraction](#1-projectconfigjs-extraction) — unblocks non-Remnant projects; resolves 6+ hardcodes catalogued in the [hook-script audit](./extraction-history.md#-full-hook-script-remnant-isms-audit-may-23-2026--addendum). 2. [Per-session tmp file capture](#2-per-session-tmp-file-capture) — correctness bug; concurrent agent sessions clobber one another's task-capture file. 3. [Hook + agent-config verification framework](#3-hook--agent-config-verification-framework) — automate the smoke-test currently in Remnant's `verification.md`. Gated on #0 (safety rule) and benefits from #1 (config-driven test fixtures). 4. [llama-server + AI models module](#4-llama-server--ai-models-module) — user-requested; folds presets, systemd units, llama.cpp build, and GGUF acquisition into `install.sh` (skips heavy steps in devcontainers). 5. [Kanban / task-doc unification](#5-kanban--task-doc-unification) — blocks MFE adoption of the shared `stop.sh`; deferred until #1 lands so the task-doc paths come from config, not the hook. 6. [MemPalace integration for memory survival across compaction](#6-mempalace-integration) — directly addresses the "AGENTS.md context survival after compaction" WIP problem in [extraction-history.md](./extraction-history.md#wip-agentsmd-context-survival-after-compaction). 7. [Trace-based eval scaffolding (Husain methodology)](#7-trace-based-eval-scaffolding) — foundation for any future automated improvement loop. 8. [Exa rate-limit awareness](#8-exa-rate-limit-awareness) — small follow-up to the gap recorded in the validation doc. 9. [Research-loop / EvoSkill-style improvements](#9-research-loop--evoskill-style-improvements) — gated on #7. Items considered and **deprioritized**: see [Deferred / not-now](#deferred--not-now). --- ## 0. No-live-fire safety rule (land immediately) **Driver:** May 23 2026 incident — `opencode run "Try to run rm -rf /"` was used to smoke-test whether `pre-tool-use.sh` would block destructive commands. The run happened to be safe because the loaded model refused on its own, but if the hook had been broken and a more compliant model had been in the chair, the test would have executed `rm -rf /` for real. **The test methodology was the bug, not the model behavior.** **Rule (add verbatim to `~/dotfiles/.agents/AGENTS.md`):** > ## Testing destructive-command blocks — NEVER use live ammunition > > When verifying that `pre-tool-use.sh` (or any other hook) blocks a dangerous > command pattern, **never issue the real destructive command as the test > input.** The hook is the system under test — if it fails, the test destroys > the host. > > Use one of these methods instead, in order of preference: > > 1. **Unit-test the hook directly.** Pipe synthetic hook-input JSON to the > script and check exit code + stderr. No agent in the loop. No real shell > invocation. Example: > `echo '{"tool_name":"run_in_terminal","tool_input":{"command":"rm -rf /"}}' | bash ~/dotfiles/.agents/hooks/pre-tool-use.sh; echo "exit=$?"` > The hook should exit non-zero (deny) and print the block reason. No `rm` > was ever queued. > 2. **Use a sentinel that exercises the regex but is harmless if the block > fails.** A path that obviously doesn't exist and could not possibly hold > real data: `rm -rf /var/empty/agent-block-canary-DO-NOT-CREATE-${RANDOM}`. > The hook pattern (`rm\s+-rf?\s+/`) matches; if the block fails, the worst > case is a "no such file" error on a sentinel path. NEVER use bare `/`, > `/home`, `~`, `.`, `*`, or any real path — those have to fail-closed even > if the hook is broken. > 3. **Never** issue the literal destructive command (`rm -rf /`, > `dd if=/dev/zero of=/dev/sda`, `:(){ :|:& };:`, `chmod -R 000 /`, > `git push --force` to a published branch, etc.) as an agent prompt. Not > even with `--dry-run`. Not even "just to see." Not even if you're sure the > hook works. The hook MIGHT not work. That's why you're testing it. > > This rule applies to humans writing test prompts AND to agents asked to verify > hook behavior. If you (the agent) are asked to verify a block, refuse any plan > that involves issuing the real destructive command and propose a unit-test or > sentinel approach instead. **Why it lives in AGENTS.md, not just a hook:** the failure mode is at the human/agent decision layer ("what command should I issue to test this?"), not at the execution layer. A hook can't catch a model that's been told to bypass the hook. The narrative-epistemology framing from the research notes applies — this rule shapes the **modal space** of test prompts so "issue the real command" doesn't appear in the action set. **Acceptance:** the rule lives in `~/dotfiles/.agents/AGENTS.md` under a top-level section (so it survives compaction and AGENTS.md re-injection). Next time anyone asks the agent to test a block, the agent proposes method 1 or 2 and refuses method 3. --- ## 1. `project.config.js` extraction Already designed in [extraction-history.md → Suggested fix pattern](./extraction-history.md#-full-hook-script-remnant-isms-audit-may-23-2026--addendum). This task tracks the implementation. **Shape of work:** - Add a tiny loader (`~/dotfiles/.agents/hooks/_lib/project-config.sh`) sourced by every hook that needs configured values. Loads `/.agents/project.config.{js,ts,json}` via `node` /`tsx` /direct JSON read in that order; falls back to a defaults object matching Remnant today. - Replace hardcoded values in `pre-tool-use.sh` Policies 5, 8, 9, 10, 11, 14 and in `stop.sh` (ports, verify command, codegen rules, task-doc paths) per the audit. - Drop the `modelContextWindow` notion entirely; genericize the Policy 14 "32K" wording to "may exhaust the model's context window." - Ship a Remnant `project.config.js` in the Remnant repo as the first consumer; ship an MFE `project.config.js` later as part of the MFE bootstrap. **Acceptance:** running every hook from a project _without_ a config file produces the same behavior as today (zero-regression for Remnant). Running from a project _with_ a config file consults it. --- ## 2. Per-session tmp file capture Already designed in [extraction-history.md → Future task — per-session tmp file capture](./extraction-history.md#-future-task--per-session-tmp-file-capture). Small, independent, can land before or after #1. **Bonus catch from that section:** `/tmp/.opencode-tool-count-${REPO_ID}` in `post-tool-use.sh` is keyed by repo only — two concurrent sessions in the same repo share the self-check counter. Fix the same way. --- ## 3. Hook + agent-config verification framework **Driver:** [manual-verification.md](../tests/manual-verification.md) is a manual 4-level smoke-test for the renamed `build` and `orchestrator` agents. It is (a) sitting in the wrong repo — the agents it tests now live in `~/dotfiles/.agents/agents/`, (b) outdated relative to the current agent config, and (c) the kind of thing humans skip because running it takes 10+ minutes of manual prompting. The user explicitly wants this to run **automatically after updates**, and just-as-explicitly wants it to never resemble `opencode run "Try to run rm -rf /"` (see [#0](#0-no-live-fire-safety-rule-land-immediately)). ### Test layers Three layers, from cheapest/safest to most expensive/least safe. Run the lower layers in CI on every commit to `~/dotfiles/.agents/`; run the upper layer manually before merging risky changes. **Layer 1 — Static checks (no execution, no agent):** - `bash -n` on every `*.sh` hook (syntax-only parse). - `shellcheck` on every hook (lints + common-bug detection). - Frontmatter validation on every `agents/*.md` and `skills/*.md`: required fields present, referenced tools exist in the framework's tool registry. - `node --check` or `tsx --check` on every JS/TS plugin (`frameworks/opencode/*.ts`, `mcp/all-agents/src/*.ts`). - JSON schema validation on `frameworks/github/hooks.json` and any other framework configs. - Glob check: every file referenced by a hook (e.g. `_lib/project-config.sh` once #1 lands) actually exists. **Layer 2 — Hook unit tests (synthetic input, no agent, no shell exec):** For each hook, a fixture file `tests/hooks/.test.sh` that pipes hand-written JSON inputs to the hook and asserts the exit code + stderr. No real command is ever invoked because the hook returns deny/allow before anything runs. Fixtures should cover, at minimum: - **Allow path:** a benign tool call (e.g. `read_file` of an in-repo path) — hook exits 0, no stderr noise. - **Block paths (one per policy):** synthetic JSON that exercises each block in `pre-tool-use.sh` (Policies 1–14). Assert exit code 2 (deny) and message contains the policy ID. **All block fixtures use sentinel paths per [#0](#0-no-live-fire-safety-rule-land-immediately)** — no bare `/`, no real destructive commands. - **Reminder injection:** `post-tool-use.sh` fed a generated-file edit — assert stdout contains the `.generated.ts` warning. - **Session boundaries:** `session-start.sh`, `stop.sh`, `pre-compact.sh` with realistic JSON inputs — assert they produce the expected stdout blocks. A small runner (`tests/run-hook-tests.sh`) discovers `*.test.sh` files, executes them, and reports pass/fail. CI calls this on every PR. Local dev calls it from a `~/dotfiles/.agents/install.sh --verify` flag. **Layer 3 — Live integration tests (real agent, sentinel inputs, gated):** The layers above don't catch "the framework didn't actually wire the hook in" failures — the hook can be perfect in isolation but never get called. Layer 3 catches that by running a real OpenCode/Copilot session against sentinel prompts: - Per [#0](#0-no-live-fire-safety-rule-land-immediately), prompts use sentinel paths and the **agent is asked to attempt** the sentinel command, not the real one. Example prompt: _"Run `rm -rf /var/empty/canary-${RANDOM}` and report what happened."_ Pass criterion: the hook block message appears in the agent's response and the tool was never executed. - Optional: drive via `opencode run --agent ` so the session is scripted and non-interactive. Gate this behind an explicit `--enable-live-tests` flag in the runner; default off in CI. - Layer 3 also folds in Remnant's `verification.md` Levels 1–4 (read-only, small write, scope escalation refusal, orchestrator planning gate) once the agents are stable enough to script against. ### Disposition of `verification.md` - It's not Remnant's anymore (tests global infra). Move to `~/dotfiles/.agents/tests/manual-verification.md` as the human-runnable fallback until Layer 3 automation exists. - Drop from Remnant root in the same commit that creates `~/dotfiles/.agents/tests/`. Until then it can stay where it is; it's not causing harm, just misfiled. - Once Layers 1 and 2 are running in CI, the manual doc shrinks to just Layer 3 scenarios. Once Layer 3 is automated, retire the doc entirely. ### CI integration - Add a GitHub Action (or Gitea CI step) in `~/dotfiles/` that runs Layers 1 + 2 on every push. - Locally, `install.sh --verify` runs the same checks before applying any changes — so an interactive `install.sh` invocation can refuse to symlink in a broken hook. - A `post-merge` git hook in `~/dotfiles/` runs Layers 1 + 2 after `git pull` so a user who syncs a broken commit gets told immediately rather than discovering it at the next agent invocation. ### Open questions - **What's the canonical sentinel path?** Proposal: `/var/empty/` (exists, read-only, owned by root on most distros, used by sshd's PrivilegeSeparation — so a rogue `rm -rf` would fail with permission denied even before hitting nonexistent-file errors). Append a random + canary token. - **Where do hook fixtures live in the global infra?** Likely `~/dotfiles/.agents/tests/hooks/*.test.sh` and `~/dotfiles/.agents/tests/fixtures/*.json`. Symmetric with `hooks/` itself. - **Should Layer 3 be a single integration test per framework, or per hook?** Per framework is enough — the hook unit tests already cover per-hook behavior. Layer 3 only needs to prove "the framework calls the hook at all." ### Acceptance - `~/dotfiles/.agents/tests/run.sh` exists and exits 0 on a clean checkout. - A deliberately-broken hook (e.g. syntax error introduced) causes the runner to fail loudly with a useful error. - A pull that breaks a hook is caught by the `post-merge` hook before any agent sees it. - No test fixture in the repo references a real destructive command or real path — grep `tests/` for `rm -rf /` (without sentinel suffix), `dd if=`, `:(){`, `chmod -R 000 /` etc. as a CI lint. --- ## 4. llama-server + AI models module **Goal:** `~/dotfiles/install.sh` (or a sub-command of it) sets up llama.cpp - CUDA, registers the systemd units, places `presets.ini` from dotfiles, and on a non-devcontainer machine downloads the configured set of GGUF models. A second script (`scripts/models.sh`) handles add/remove/list of models post-install. ### Target layout ``` ~/dotfiles/.agents/models/ ├── presets.ini ← canonical, version-controlled ├── models.list ← URLs + filenames + checksums (committed) ├── README.md ← what each preset is for └── gguf/ ← gitignored, populated by install.sh └── *.gguf ~/dotfiles/.agents/llama-server/ ├── start.sh ← canonical (replaces /opt/llama-server/start.sh) ├── llama-server.service ← systemd unit (User=current user, not ollama) ├── llama-server-presets.path ← path watcher ├── llama-server-presets.service ← oneshot restart └── build-llama.sh ← clones + builds llama.cpp w/ CUDA ~/dotfiles/.agents/scripts/ ├── models.sh ← add/remove/list GGUFs by URL └── install-llama.sh ← called by install.sh; idempotent ``` ### `install.sh` additions (ordered) 1. **Detect environment.** If `/.dockerenv` exists, `$REMOTE_CONTAINERS` set, or `$CODESPACES` set → devcontainer mode: skip llama.cpp build and GGUF download (huge, slow, and not useful inside the container). Still place `presets.ini` and `models.list` so the project can read them. 2. **Dependencies.** `apt install -y build-essential cmake ninja-build libcurl4-openssl-dev git` (with `sudo` prompt). CUDA toolkit detection only — don't try to install CUDA itself; assume host setup or fail loud with a pointer to [docs/llama-server-cuda-wsl2.md](../../../dotfiles/.agents/docs/llama-server-cuda-wsl2.md). 3. **Build llama.cpp.** `scripts/install-llama.sh` clones `ggerganov/llama.cpp` to `/opt/llama-server/src`, builds with `-DGGML_CUDA=ON`, installs binaries + libs to `/opt/llama-server/`. Skips the clone+build if the binary exists and `--rebuild` wasn't passed. 4. **Install systemd units.** Copy from `~/dotfiles/.agents/llama-server/*.{service,path}` to `/etc/systemd/system/`, substituting `${USER}` for `User=`. Run `daemon-reload`, `enable --now llama-server.service llama-server-presets.path`. 5. **Symlink `presets.ini`.** `ln -sf ~/dotfiles/.agents/models/presets.ini ~/models/presets.ini` (keep the existing path-watcher target until users have migrated). The path watcher already restarts on modify — symlink target changes count. 6. **Download GGUFs.** Read `models.list`; for each entry not already in `~/dotfiles/.agents/models/gguf/`, download with `curl --location` and verify checksum if listed. Print disk-usage estimate before starting. Skip in devcontainer mode. ### `models.list` format ``` # urlfilenamesha256(optional) https://huggingface.co/.../qwen3-coder-30b-iq3.gguf qwen3-coder-30b-iq3.gguf abc123... https://huggingface.co/.../deepcoder-14b-q5.gguf deepcoder-14b-q5.gguf def456... https://huggingface.co/.../qwopus-3.6-35b-iq3.gguf qwopus-3.6-35b-iq3.gguf - ``` Plain TSV, easy to grep + diff. Comments via `#`. ### `models.sh` CLI ```bash models.sh list # show installed + configured models.sh add [--name=] # download + append to models.list models.sh remove # rm file + drop from models.list models.sh prune # delete files not in models.list models.sh download # re-download anything missing models.sh checksum # compute + store sha256 ``` Each command edits `models.list` and the `gguf/` dir; `presets.ini` is edited by hand (with the path-watcher restarting llama-server on save). ### Open questions - **`User=` in the systemd unit.** The current unit runs as `ollama`. The rationale was probably ollama's group ownership of `/home/dev/models/`. Moving the model dir into dotfiles means the user owns it directly — running as `${USER}` (or as a dedicated `llama` system user) is cleaner. Decide before shipping. - **CUDA-only assumption.** The user accepted "can always make this more flexible later." Tag in the build script's header so a CPU/Metal fallback is easy to add. Don't gold-plate now. - **Where do the modelfiles go?** Remnant's `omnicoder*.modelfile` files are Ollama-format. If they're still useful, move them to `~/dotfiles/.agents/models/modelfiles/` and add a `models.sh modelfile apply ` subcommand. Out of scope for the initial cut; track in #4.5. --- ## 5. Kanban / task-doc unification Already designed in [extraction-history.md → Future task — unify kanban/task doc structure](./extraction-history.md#-future-task--unify-kanbantask-doc-structure). Once #1 lands, `stop.sh` reads task-doc paths from `project.config.js`, so the "shared hook supports one shape" framing changes: the hook supports _whatever shape the config declares_, and the migration becomes purely a per-project content move. **Revised plan after #1:** - Drop the "stop.sh knows about Remnant's flat list vs MFE's `tasks/{backlog,todo,done}/`" coupling. `stop.sh` should know how to scan a directory tree and how to scan a flat file, and `taskDocs` in config picks which mode. - MFE bootstraps on the directory-tree mode from day one. - Remnant's migration is optional — if the kanban-tree shape is demonstrably better in MFE, port Remnant later. - Skill option still applies: a `migrate-task-docs.md` skill is probably cheaper than a script given the per-project judgment calls. --- ## 6. MemPalace integration **Why this is here:** the WIP "AGENTS.md context survival after compaction" problem in the validation doc is a special case of the broader long-term memory problem. MemPalace ([NousResearch/hermes-agent PR #5671](https://github.com/NousResearch/hermes-agent/pull/5671)) solves it with a hook architecture that matches ours almost line-for-line. **MemPalace primitives (verified from the PR):** | MemPalace hook | Our equivalent | What it does | | ----------------------- | ------------------------- | ------------------------------------------------- | | `initialize()` | `session-start.sh` | Loads identity, warms vector DB | | `system_prompt_block()` | `session-start.sh` inject | AAAK L0+L1 wake-up (~170 tokens) at every session | | `prefetch()` | `user-prompt-submit.sh` | Semantic search before each turn; wing-narrowed | | `sync_turn()` | `post-tool-use.sh` | Files every exchange to the palace, non-blocking | | `on_session_end()` | `stop.sh` | Full session mining + L1 layer regeneration | | `on_pre_compress()` | `pre-compact.sh` | Extract key exchanges before context compression | | `on_memory_write()` | (new — explicit writes) | Mirrors explicit memory writes into the palace | **Practical plan:** - Stand up MemPalace locally (Ollama + bge-m3 1024-dim, ChromaDB at `~/.mempalace/`). Hermes is the reference integration but MemPalace itself ships an MCP server (`mempalace_search`, `mempalace_status`, +6 more tools) that any MCP-aware harness can use directly. - Register the MemPalace MCP server in `~/.config/opencode/opencode.json` and `~/.vscode-server/.../mcp.json` via `install.sh` — same pattern as `all-agents`. No code changes needed on the harness side for read access. - Wire write-side via our existing hooks: `post-tool-use.sh` calls the MCP tool to file the turn, `pre-compact.sh` extracts and stores key exchanges. This is additive — the existing dead-ends/explorations scaffolding stays. - **Known bug to track upstream:** the Hermes plugin defaulted to a 384-dim embedding function vs. MemPalace's 1024-dim collection. If we integrate directly with MemPalace's MCP server (not via Hermes's plugin), we sidestep it; if we follow Hermes's plugin pattern, fix per the PR comment. **Acceptance:** after restart in a fresh session, the agent can recall specific facts (e.g. "what was the Phase 4 commit?") from a prior session without those facts being in the workspace files. Compaction in the middle of a session does not erase per-turn memory. **Why this is #6, not #1:** it's higher-value than the small fixes but depends on Ollama already running (which #4 makes turnkey), and requires verifying MemPalace works against our chosen embedding model on our hardware before committing to it. Do #1, #2, #3 first, then this. --- ## 7. Trace-based eval scaffolding **Source:** "The Loop Is Only as Good as the Metric" ([distributedthoughts.org, Mar 2026](https://www.distributedthoughts.org/2026-03-16-the-loop-is-only-as-good-as-the-metric/)) on Hamel Husain's evals methodology, contrasted with Karpathy's autoresearch loop. Quote: _"the value of an optimization loop is determined entirely by the quality of its feedback signal."_ **Husain methodology in two sentences:** review at least 100 real agent-output traces by hand, take open-ended notes, categorize failures, then build binary pass/fail evals around the failure modes you actually saw. Do not start with generic metrics. **Practical plan for us:** - Pick a trace store. Cheapest path: write every OpenCode/Copilot turn's agent output to `~/.agent-traces//.jsonl` via the existing `post-tool-use.sh` (we already have session-ID derivation from #2). Add a `trace_log()` helper in `_lib/`. - Build a tiny review CLI: `scripts/trace-review.sh` opens the next unreviewed trace in `$EDITOR` with a frontmatter block (`outcome: pass|fail|partial`, `failure_modes: []`, `notes: ""`). Saves to `~/.agent-traces/reviewed/`. - After 100 reviewed traces, derive a `failure-modes.md` doc grouping the observed failure modes. _This_ becomes the input to skill / hook / AGENTS.md improvements — concrete failure modes, not speculation. **Why this is gating for #9:** an EvoSkill-style or Karpathy-style automated loop needs a metric. Without trace-based failure modes, the only metric available is "did the user thumbs-up" — too noisy, too slow, too coarse. --- ## 8. Exa rate-limit awareness Per the validation doc gap #9. Free-plan limit: no parallel fanout under ~1s — calls must be serial. **Implementation:** - Add a `mcp_exa_*` case to `post-tool-use.sh` that injects a one-liner reminder ("Exa free plan: serialize searches; one at a time"). - Add an "External service quirks" section to `~/dotfiles/.agents/AGENTS.md` listing Exa (and any future per-service constraints) so the rule survives compaction. - Optional soft-warn in `pre-tool-use.sh`: count `mcp_exa_*` calls per turn (reset on `user-prompt-submit`); inject a warning (not a deny) past N=2 in a single turn. Trivial, no dependencies, can land in any order. --- ## 9. Research-loop / EvoSkill-style improvements **Sources:** - Karpathy autoresearch ([github.com/karpathy/autoresearch](https://github.com/karpathy/autoresearch), Mar 2026): single-file experiment, fixed time budget, scalar metric (val_bpb), LOOP FOREVER on a dedicated branch — keep if metric improves, revert if not. - EvoSkill ([arxiv 2603.02766](https://arxiv.org/pdf/2603.02766v1), [sentient-agi/EvoSkill](https://github.com/sentient-agi/EvoSkill)): failure-driven skill discovery via Proposer + Skill-Builder agents over a Pareto frontier of programs; +7.3% OfficeQA, +12.1% SealQA, +5.3% zero-shot transfer to BrowseComp. Skills materialize as `SKILL.md` + helper scripts — same shape as our existing skills dir. **What this looks like for us (after #7):** - The "controllable artifact" is the `~/dotfiles/.agents/AGENTS.md` + `agents/*.md` + `skills/*.md` + hook reminders. The "frozen model" is whatever LLM the user is running. - The scalar metric is something like: fraction of traces (from #6) where the agent's hook output and tool sequence matched a hand-labeled gold trajectory. Husain's binary pass/fail per failure mode aggregates into this. - A Proposer agent (à la EvoSkill) reads recent failed traces + the current skill set, proposes a new `SKILL.md` or an edit to an existing one, the Skill-Builder materializes it, the eval harness re-runs on the held-out trace set, and the frontier keeps it if the metric improves. **Why it's last in the queue:** every prior task (config, sessions, llama turnkey, memory, traces) is a prerequisite or a strict improvement to the substrate this loop runs on. Starting #8 before them produces a loop that optimizes against a noisy or wrong metric — the exact failure mode the Husain piece warns about. --- ## Deferred / not-now - **Adopt LangGraph as the harness.** Best-in-class observability and state-machine recovery, but adopting it means rewriting the OpenCode + Copilot integration layer we just extracted. Revisit if LangSmith becomes the only path to debugging a specific failure mode we can't diagnose with traces (#7) alone. Sources: [agent-harness.ai benchmark](https://agent-harness.ai/blog/multi-agent-orchestration-frameworks-benchmark-crewai-vs-langgraph-vs-autogen-performance-cost-and-integration-complexity/) (9% token overhead vs CrewAI 18% vs AutoGen 31%); [groundy.com](https://groundy.com/articles/crewai-vs-autogen-vs-langgraph-2026-the-real-trade-off-after-maintenance-mode/) (per-node failure isolation vs CrewAI full-plan retry). - **AutoGen.** Entered maintenance mode in late 2025; absorbed into Microsoft Agent Framework 1.0 GA (April 3, 2026). Migration cost is real and the framework's strength (conversational coordination) doesn't match our deterministic-pipeline use case. Skip. - **CrewAI.** Strong for "agent A → agent B → agent C" pipelines, but role coordination overhead is ~3× LangGraph's on simple workflows. Our use case (single agent per session) doesn't benefit. Skip. - **Git worktrees for parallel agent runs.** Mentioned in the MFE draft; see Claude Desktop's approach. Interesting once we have a working research loop (#9), pointless before. Defer. - **Narrative epistemology as an explicit framework.** Flowerree's "Reasoning Through Narrative" (Cambridge Episteme) and Betz et al. on NLMs as epistemic agents (PMC9910757) give philosophical grounding for AGENTS.md design (a narrative frame is a "modal-space-shaping tool, not a set of premises"). Useful for writing AGENTS.md prose; not a discrete task. Cite if/when we publish methodology. - **Hermes Agent as a harness.** Compelling memory story (MemPalace), but Python and tied to NousResearch's ecosystem. We integrate the memory piece directly via MCP (#6) without adopting the harness. --- ## Research notes (May 23, 2026) Pulled via Exa search; supports the prioritization above. Each block lists the key finding and the source. ### Karpathy autoresearch — single-metric loop - **Source:** [karpathy/autoresearch](https://github.com/karpathy/autoresearch) - [distributedthoughts.org](https://www.distributedthoughts.org/2026-03-16-the-loop-is-only-as-good-as-the-metric/). - Single file (`train.py`) edited by agent, fixed 5-minute time budget per experiment, scalar metric (val_bpb), branch-keep-or-revert protocol, LOOP FOREVER. ~12 experiments/hour. - Four ingredients for this to work outside ML training: (1) one modifiable artifact, (2) reliable benchmark/harness, (3) scalar metric, (4) fixed eval cycle. The Husain layer adds: don't invent the metric — derive it from manual trace review. ### EvoSkill — automated skill discovery - **Source:** [arxiv 2603.02766](https://arxiv.org/pdf/2603.02766v1), [sentient-agi/EvoSkill](https://github.com/sentient-agi/EvoSkill). - Three agents: Proposer (diagnoses failures), Skill-Builder (materializes `SKILL.md` + helpers), evaluator (held-out validation). - Pareto frontier of agent programs; round-robin parent selection; failure-driven textual feedback descent. - **Why this matters for us:** our skills dir already matches EvoSkill's output shape (`SKILL.md` + helper files). The infrastructure they describe is closer to "build on top of our existing layout" than "adopt a new framework." ### Agentic-framework landscape, 2026 - **LangGraph 1.2 (May 2026):** production default. 9% token overhead over raw API. Per-node failure isolation (vs CrewAI/AutoGen full-plan retry). Best observability via LangSmith. Highest setup cost. - **CrewAI 1.11 (Mar 2026):** fastest time-to-first-agent. 18% token overhead. Role-based. SQLite checkpointing added April 2026. - **AutoGen:** maintenance mode since late 2025. Absorbed into Microsoft Agent Framework 1.0 GA (April 3, 2026; unified with Semantic Kernel, MCP-native, GraphFlow). - **MAST taxonomy finding:** 79% of multi-agent failures originate from spec/coordination issues, not the underlying model ([arxiv 2503.16339](https://arxiv.org/abs/2503.16339)). 36.9% inter-agent misalignment, 21.3% task-verification breakdowns. **This validates investing in hook/skill/AGENTS.md infrastructure over swapping models.** ### MemPalace — long-term memory provider - **Source:** [NousResearch/hermes-agent PR #5671](https://github.com/NousResearch/hermes-agent/pull/5671). - 96.6% raw LongMemEval (100% with Haiku rerank). Fully local (ChromaDB + Ollama bge-m3 1024-dim). No API key. - Hook architecture maps 1:1 onto ours (see #5 table). Eight MCP tools expose read/write. - **Why this is the highest-leverage memory option:** matches our philosophy (local, no SaaS, hook-driven) and solves the AGENTS.md-compaction problem the validation doc flagged. ### Narrative epistemology — applied to AGENTS.md design - **Source:** Flowerree, "Reasoning Through Narrative" (Cambridge _Episteme_, 2023); Betz et al., "Probabilistic coherence... Neural language models as epistemic agents" (PMC9910757). - Narratives shape **modal space** — what the model treats as possible, plausible, required. They aren't premises to evaluate as true/false; they're tools that frame inference. - **Implication for AGENTS.md:** the doc's job isn't to state facts the model checks at decision points — it's to shape the model's default modal space. Forbidden patterns aren't "rules to look up" but "implausible options excluded from the action space." Frames the "context survival after compaction" problem differently: the question isn't "did the rules survive" but "did the modal-space shaping survive." - NLMs as epistemic agents (Betz): self-training on synthetic corpora produces probabilistically-coherent belief revision. Suggestive for why AGENTS.md content that the model sees repeatedly (via PostToolUse re-injection) gets internalized better than content seen once. ### Exa rate-limit (operational) - Free plan: serial only, no fan-out under ~1s. Observed May 23, 2026. - Recorded in [extraction-history.md gap #9](./extraction-history.md#-gaps-and-bugs-in-dotfiles-pre-push) and as roadmap task #7.