# Dotfiles Agent Infrastructure — Roadmap

**Status:** Planning. Companion to
[extraction-history.md](./extraction-history.md), which covers the
already-shipped extraction work and the validation findings against it.

**Scope of this doc:** future tasks against `~/dotfiles/.agents/` and the
ecosystem around it. Research that informs the prioritization is captured in the
"Research notes" section at the bottom — read those first if any of the task
rationale feels opaque.

**How to use this doc:** the "Tasks" list is ordered by recommended execution
order (high leverage + low risk first). Each entry links to its design section.
Move sections to dedicated docs once they grow past ~80 lines.

> **Land before anything else:** the
> [No-Live-Fire safety rule](#0-no-live-fire-safety-rule-land-immediately).
> One-paragraph addition to `~/dotfiles/.agents/AGENTS.md`; takes 5 minutes;
> protects against the `opencode run "Try to run rm -rf /"` failure mode where a
> model takes the prompt literally if the hook fails to block.

> **Then relocate this doc out of Remnant:** see
> [Doc relocation (Remnant cleanup)](#doc-relocation-remnant-cleanup). This
> roadmap, `agent-infra-extraction.md`, and `verification.md` are not
> Remnant-specific and should live in `~/dotfiles/` so Remnant's
> `docs/projects/` contains only Remnant-app work. Do this after #0 and before
> resuming any numbered task below — once moved, the tasks list executes against
> the dotfiles copy and Remnant is free to evolve independently.

---

## Doc relocation (Remnant cleanup)

**Goal:** Remnant's repo contains only Remnant-app docs. Everything about
`~/dotfiles/.agents/` lives in `~/dotfiles/docs/` (or `~/dotfiles/.agents/docs/`
— pick one and stick with it; the existing
[`agent-infrastructure.md`](./agent-infrastructure.md) stub already references
`~/dotfiles/.agents/docs/agent-infrastructure.md`, so that's the established
location).

**Why now (priority: immediately after #0):** the user wants Remnant in a good
state to work on independently. Every agent-infra doc sitting in
`docs/projects/` is noise for Remnant-app planning sessions and gets
auto-injected as context whenever an agent touches `docs/projects/`. Moving them
is mechanical and reversible.

**Files to relocate:**

| Current path                                                | Destination                                            | Notes                                                                                                                                                                |
| ----------------------------------------------------------- | ------------------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `docs/projects/dotfiles-agent-infra-roadmap.md` (this file) | `~/dotfiles/.agents/docs/roadmap.md`                   | Update internal links. Drop "Remnant" framing in the intro — it's just _the_ roadmap once it lives there.                                                            |
| `docs/projects/agent-infra-extraction.md`                   | `~/dotfiles/.agents/docs/extraction-history.md`        | Validation log for the already-shipped extraction. Keep as historical record; not active planning.                                                                   |
| `verification.md` (repo root)                               | `~/dotfiles/.agents/tests/manual-verification.md`      | Already specified as part of [#3](#3-hook--agent-config-verification-framework); do the move now rather than waiting for the test harness.                           |
| `docs/projects/agent-infrastructure.md`                     | **Stay** (already trimmed to Remnant-specific overlay) | Already correctly scoped: it documents Remnant's overlay hook + Remnant-specific integration test cases. Leave in place; it points to the canonical doc in dotfiles. |
| Agent-infra entries inside `docs/projects/COMPLETED.md`     | Split out to `~/dotfiles/.agents/docs/completed.md`    | Audit first — if there's nothing agent-infra-specific there, skip.                                                                                                   |

**Steps:**

1. `mkdir -p ~/dotfiles/.agents/docs ~/dotfiles/.agents/tests`
2. `git mv` each file into `~/dotfiles/` (cross-repo: use `git mv` inside
   Remnant to stage a delete, then a fresh add in dotfiles — there's no
   meaningful history to preserve across repos for these short-lived docs; if
   history matters for `agent-infra-extraction.md`, use `git format-patch`
   - `git am` instead).
3. Rewrite intra-doc links: this file's references to
   `./agent-infra-extraction.md` become `./extraction-history.md`; references to
   `verification.md` become `../tests/manual-verification.md`.
4. Find inbound links from anywhere in Remnant
   (`grep -rn "dotfiles-agent-infra-roadmap\|agent-infra-extraction\|verification.md" ~/code/remnant`)
   and either delete them or repoint at the dotfiles copies via absolute paths
   (e.g., `~/dotfiles/.agents/docs/roadmap.md`).
5. Audit `docs/projects/COMPLETED.md` for agent-infra rows; split if any exist.
6. Update `AGENTS.md` files in Remnant if any reference the moved docs.
7. Commit Remnant deletion and dotfiles addition together (or back-to-back
   commits with cross-references in the messages).

**Acceptance:** `ls ~/code/remnant/docs/projects/ | grep -iE 'agent|dotfiles'`
returns only `agent-infrastructure.md`; `verification.md` is gone from the
Remnant root; the roadmap (this doc) opens cleanly from its new path with
working links.

**Risk:** if any Remnant `AGENTS.md` instructions or
[`docs/projects/COMPLETED.md`](./COMPLETED.md) row links into these docs and the
link breaks silently, agents will follow a dead reference. Step 4 mitigates.

---

## Tasks (recommended order)

0. [No-live-fire safety rule (land immediately)](#0-no-live-fire-safety-rule-land-immediately)
   — AGENTS.md addition forbidding real destructive commands as hook-test
   inputs. Prerequisite for #3 and for any manual hook testing.
1. [`project.config.js` extraction](#1-projectconfigjs-extraction) — unblocks
   non-Remnant projects; resolves 6+ hardcodes catalogued in the
   [hook-script audit](./extraction-history.md#-full-hook-script-remnant-isms-audit-may-23-2026--addendum).
2. [Per-session tmp file capture](#2-per-session-tmp-file-capture) — correctness
   bug; concurrent agent sessions clobber one another's task-capture file.
3. [Hook + agent-config verification framework](#3-hook--agent-config-verification-framework)
   — automate the smoke-test currently in Remnant's `verification.md`. Gated on
   #0 (safety rule) and benefits from #1 (config-driven test fixtures).
4. [llama-server + AI models module](#4-llama-server--ai-models-module) —
   user-requested; folds presets, systemd units, llama.cpp build, and GGUF
   acquisition into `install.sh` (skips heavy steps in devcontainers).
5. [Kanban / task-doc unification](#5-kanban--task-doc-unification) — blocks MFE
   adoption of the shared `stop.sh`; deferred until #1 lands so the task-doc
   paths come from config, not the hook.
6. [MemPalace integration for memory survival across compaction](#6-mempalace-integration)
   — directly addresses the "AGENTS.md context survival after compaction" WIP
   problem in
   [extraction-history.md](./extraction-history.md#wip-agentsmd-context-survival-after-compaction).
7. [Trace-based eval scaffolding (Husain methodology)](#7-trace-based-eval-scaffolding)
   — foundation for any future automated improvement loop.
8. [Exa rate-limit awareness](#8-exa-rate-limit-awareness) — small follow-up to
   the gap recorded in the validation doc.
9. [Research-loop / EvoSkill-style improvements](#9-research-loop--evoskill-style-improvements)
   — gated on #7.

Items considered and **deprioritized**: see
[Deferred / not-now](#deferred--not-now).

---

## 0. No-live-fire safety rule (land immediately)

**Driver:** May 23 2026 incident — `opencode run "Try to run rm -rf /"` was used
to smoke-test whether `pre-tool-use.sh` would block destructive commands. The
run happened to be safe because the loaded model refused on its own, but if the
hook had been broken and a more compliant model had been in the chair, the test
would have executed `rm -rf /` for real. **The test methodology was the bug, not
the model behavior.**

**Rule (add verbatim to `~/dotfiles/.agents/AGENTS.md`):**

> ## Testing destructive-command blocks — NEVER use live ammunition
>
> When verifying that `pre-tool-use.sh` (or any other hook) blocks a dangerous
> command pattern, **never issue the real destructive command as the test
> input.** The hook is the system under test — if it fails, the test destroys
> the host.
>
> Use one of these methods instead, in order of preference:
>
> 1. **Unit-test the hook directly.** Pipe synthetic hook-input JSON to the
>    script and check exit code + stderr. No agent in the loop. No real shell
>    invocation. Example:
>    `echo '{"tool_name":"run_in_terminal","tool_input":{"command":"rm -rf /"}}' | bash ~/dotfiles/.agents/hooks/pre-tool-use.sh; echo "exit=$?"`
>    The hook should exit non-zero (deny) and print the block reason. No `rm`
>    was ever queued.
> 2. **Use a sentinel that exercises the regex but is harmless if the block
>    fails.** A path that obviously doesn't exist and could not possibly hold
>    real data: `rm -rf /var/empty/agent-block-canary-DO-NOT-CREATE-${RANDOM}`.
>    The hook pattern (`rm\s+-rf?\s+/`) matches; if the block fails, the worst
>    case is a "no such file" error on a sentinel path. NEVER use bare `/`,
>    `/home`, `~`, `.`, `*`, or any real path — those have to fail-closed even
>    if the hook is broken.
> 3. **Never** issue the literal destructive command (`rm -rf /`,
>    `dd if=/dev/zero of=/dev/sda`, `:(){ :|:& };:`, `chmod -R 000 /`,
>    `git push --force` to a published branch, etc.) as an agent prompt. Not
>    even with `--dry-run`. Not even "just to see." Not even if you're sure the
>    hook works. The hook MIGHT not work. That's why you're testing it.
>
> This rule applies to humans writing test prompts AND to agents asked to verify
> hook behavior. If you (the agent) are asked to verify a block, refuse any plan
> that involves issuing the real destructive command and propose a unit-test or
> sentinel approach instead.

**Why it lives in AGENTS.md, not just a hook:** the failure mode is at the
human/agent decision layer ("what command should I issue to test this?"), not at
the execution layer. A hook can't catch a model that's been told to bypass the
hook. The narrative-epistemology framing from the research notes applies — this
rule shapes the **modal space** of test prompts so "issue the real command"
doesn't appear in the action set.

**Acceptance:** the rule lives in `~/dotfiles/.agents/AGENTS.md` under a
top-level section (so it survives compaction and AGENTS.md re-injection). Next
time anyone asks the agent to test a block, the agent proposes method 1 or 2 and
refuses method 3.

---

## 1. `project.config.js` extraction

Already designed in
[extraction-history.md → Suggested fix pattern](./extraction-history.md#-full-hook-script-remnant-isms-audit-may-23-2026--addendum).
This task tracks the implementation.

**Shape of work:**

- Add a tiny loader (`~/dotfiles/.agents/hooks/_lib/project-config.sh`) sourced
  by every hook that needs configured values. Loads
  `<repo>/.agents/project.config.{js,ts,json}` via `node` /`tsx` /direct JSON
  read in that order; falls back to a defaults object matching Remnant today.
- Replace hardcoded values in `pre-tool-use.sh` Policies 5, 8, 9, 10, 11, 14 and
  in `stop.sh` (ports, verify command, codegen rules, task-doc paths) per the
  audit.
- Drop the `modelContextWindow` notion entirely; genericize the Policy 14 "32K"
  wording to "may exhaust the model's context window."
- Ship a Remnant `project.config.js` in the Remnant repo as the first consumer;
  ship an MFE `project.config.js` later as part of the MFE bootstrap.

**Acceptance:** running every hook from a project _without_ a config file
produces the same behavior as today (zero-regression for Remnant). Running from
a project _with_ a config file consults it.

---

## 2. Per-session tmp file capture

Already designed in
[extraction-history.md → Future task — per-session tmp file capture](./extraction-history.md#-future-task--per-session-tmp-file-capture).
Small, independent, can land before or after #1.

**Bonus catch from that section:** `/tmp/.opencode-tool-count-${REPO_ID}` in
`post-tool-use.sh` is keyed by repo only — two concurrent sessions in the same
repo share the self-check counter. Fix the same way.

---

## 3. Hook + agent-config verification framework

**Driver:** [manual-verification.md](../tests/manual-verification.md) is a manual
4-level smoke-test for the renamed `build` and `orchestrator` agents. It is (a)
sitting in the wrong repo — the agents it tests now live in
`~/dotfiles/.agents/agents/`, (b) outdated relative to the current agent config,
and (c) the kind of thing humans skip because running it takes 10+ minutes of
manual prompting. The user explicitly wants this to run **automatically after
updates**, and just-as-explicitly wants it to never resemble
`opencode run "Try to run rm -rf /"` (see
[#0](#0-no-live-fire-safety-rule-land-immediately)).

### Test layers

Three layers, from cheapest/safest to most expensive/least safe. Run the lower
layers in CI on every commit to `~/dotfiles/.agents/`; run the upper layer
manually before merging risky changes.

**Layer 1 — Static checks (no execution, no agent):**

- `bash -n` on every `*.sh` hook (syntax-only parse).
- `shellcheck` on every hook (lints + common-bug detection).
- Frontmatter validation on every `agents/*.md` and `skills/*.md`: required
  fields present, referenced tools exist in the framework's tool registry.
- `node --check` or `tsx --check` on every JS/TS plugin
  (`frameworks/opencode/*.ts`, `mcp/all-agents/src/*.ts`).
- JSON schema validation on `frameworks/github/hooks.json` and any other
  framework configs.
- Glob check: every file referenced by a hook (e.g. `_lib/project-config.sh`
  once #1 lands) actually exists.

**Layer 2 — Hook unit tests (synthetic input, no agent, no shell exec):**

For each hook, a fixture file `tests/hooks/<hook>.test.sh` that pipes
hand-written JSON inputs to the hook and asserts the exit code + stderr. No real
command is ever invoked because the hook returns deny/allow before anything
runs.

Fixtures should cover, at minimum:

- **Allow path:** a benign tool call (e.g. `read_file` of an in-repo path) —
  hook exits 0, no stderr noise.
- **Block paths (one per policy):** synthetic JSON that exercises each block in
  `pre-tool-use.sh` (Policies 1–14). Assert exit code 2 (deny) and message
  contains the policy ID. **All block fixtures use sentinel paths per
  [#0](#0-no-live-fire-safety-rule-land-immediately)** — no bare `/`, no real
  destructive commands.
- **Reminder injection:** `post-tool-use.sh` fed a generated-file edit — assert
  stdout contains the `.generated.ts` warning.
- **Session boundaries:** `session-start.sh`, `stop.sh`, `pre-compact.sh` with
  realistic JSON inputs — assert they produce the expected stdout blocks.

A small runner (`tests/run-hook-tests.sh`) discovers `*.test.sh` files, executes
them, and reports pass/fail. CI calls this on every PR. Local dev calls it from
a `~/dotfiles/.agents/install.sh --verify` flag.

**Layer 3 — Live integration tests (real agent, sentinel inputs, gated):**

The layers above don't catch "the framework didn't actually wire the hook in"
failures — the hook can be perfect in isolation but never get called. Layer 3
catches that by running a real OpenCode/Copilot session against sentinel
prompts:

- Per [#0](#0-no-live-fire-safety-rule-land-immediately), prompts use sentinel
  paths and the **agent is asked to attempt** the sentinel command, not the real
  one. Example prompt: _"Run `rm -rf /var/empty/canary-${RANDOM}` and report
  what happened."_ Pass criterion: the hook block message appears in the agent's
  response and the tool was never executed.
- Optional: drive via `opencode run --agent <name>` so the session is scripted
  and non-interactive. Gate this behind an explicit `--enable-live-tests` flag
  in the runner; default off in CI.
- Layer 3 also folds in Remnant's `verification.md` Levels 1–4 (read-only, small
  write, scope escalation refusal, orchestrator planning gate) once the agents
  are stable enough to script against.

### Disposition of `verification.md`

- It's not Remnant's anymore (tests global infra). Move to
  `~/dotfiles/.agents/tests/manual-verification.md` as the human-runnable
  fallback until Layer 3 automation exists.
- Drop from Remnant root in the same commit that creates
  `~/dotfiles/.agents/tests/`. Until then it can stay where it is; it's not
  causing harm, just misfiled.
- Once Layers 1 and 2 are running in CI, the manual doc shrinks to just Layer 3
  scenarios. Once Layer 3 is automated, retire the doc entirely.

### CI integration

- Add a GitHub Action (or Gitea CI step) in `~/dotfiles/` that runs Layers 1 + 2
  on every push.
- Locally, `install.sh --verify` runs the same checks before applying any
  changes — so an interactive `install.sh` invocation can refuse to symlink in a
  broken hook.
- A `post-merge` git hook in `~/dotfiles/` runs Layers 1 + 2 after `git pull` so
  a user who syncs a broken commit gets told immediately rather than discovering
  it at the next agent invocation.

### Open questions

- **What's the canonical sentinel path?** Proposal: `/var/empty/` (exists,
  read-only, owned by root on most distros, used by sshd's PrivilegeSeparation —
  so a rogue `rm -rf` would fail with permission denied even before hitting
  nonexistent-file errors). Append a random + canary token.
- **Where do hook fixtures live in the global infra?** Likely
  `~/dotfiles/.agents/tests/hooks/*.test.sh` and
  `~/dotfiles/.agents/tests/fixtures/*.json`. Symmetric with `hooks/` itself.
- **Should Layer 3 be a single integration test per framework, or per hook?**
  Per framework is enough — the hook unit tests already cover per-hook behavior.
  Layer 3 only needs to prove "the framework calls the hook at all."

### Acceptance

- `~/dotfiles/.agents/tests/run.sh` exists and exits 0 on a clean checkout.
- A deliberately-broken hook (e.g. syntax error introduced) causes the runner to
  fail loudly with a useful error.
- A pull that breaks a hook is caught by the `post-merge` hook before any agent
  sees it.
- No test fixture in the repo references a real destructive command or real path
  — grep `tests/` for `rm -rf /` (without sentinel suffix), `dd if=`, `:(){`,
  `chmod -R 000 /` etc. as a CI lint.

---

## 4. llama-server + AI models module

**Goal:** `~/dotfiles/install.sh` (or a sub-command of it) sets up llama.cpp

- CUDA, registers the systemd units, places `presets.ini` from dotfiles, and on
  a non-devcontainer machine downloads the configured set of GGUF models. A
  second script (`scripts/models.sh`) handles add/remove/list of models
  post-install.

### Target layout

```
~/dotfiles/.agents/models/
├── presets.ini                         ← canonical, version-controlled
├── models.list                         ← URLs + filenames + checksums (committed)
├── README.md                           ← what each preset is for
└── gguf/                               ← gitignored, populated by install.sh
    └── *.gguf

~/dotfiles/.agents/llama-server/
├── start.sh                            ← canonical (replaces /opt/llama-server/start.sh)
├── llama-server.service                ← systemd unit (User=current user, not ollama)
├── llama-server-presets.path           ← path watcher
├── llama-server-presets.service        ← oneshot restart
└── build-llama.sh                      ← clones + builds llama.cpp w/ CUDA

~/dotfiles/.agents/scripts/
├── models.sh                           ← add/remove/list GGUFs by URL
└── install-llama.sh                    ← called by install.sh; idempotent
```

### `install.sh` additions (ordered)

1. **Detect environment.** If `/.dockerenv` exists, `$REMOTE_CONTAINERS` set, or
   `$CODESPACES` set → devcontainer mode: skip llama.cpp build and GGUF download
   (huge, slow, and not useful inside the container). Still place `presets.ini`
   and `models.list` so the project can read them.
2. **Dependencies.**
   `apt install -y build-essential cmake ninja-build libcurl4-openssl-dev git`
   (with `sudo` prompt). CUDA toolkit detection only — don't try to install CUDA
   itself; assume host setup or fail loud with a pointer to
   [docs/llama-server-cuda-wsl2.md](../../../dotfiles/.agents/docs/llama-server-cuda-wsl2.md).
3. **Build llama.cpp.** `scripts/install-llama.sh` clones `ggerganov/llama.cpp`
   to `/opt/llama-server/src`, builds with `-DGGML_CUDA=ON`, installs binaries +
   libs to `/opt/llama-server/`. Skips the clone+build if the binary exists and
   `--rebuild` wasn't passed.
4. **Install systemd units.** Copy from
   `~/dotfiles/.agents/llama-server/*.{service,path}` to `/etc/systemd/system/`,
   substituting `${USER}` for `User=`. Run `daemon-reload`,
   `enable --now llama-server.service llama-server-presets.path`.
5. **Symlink `presets.ini`.**
   `ln -sf ~/dotfiles/.agents/models/presets.ini ~/models/presets.ini` (keep the
   existing path-watcher target until users have migrated). The path watcher
   already restarts on modify — symlink target changes count.
6. **Download GGUFs.** Read `models.list`; for each entry not already in
   `~/dotfiles/.agents/models/gguf/`, download with `curl --location` and verify
   checksum if listed. Print disk-usage estimate before starting. Skip in
   devcontainer mode.

### `models.list` format

```
# url<TAB>filename<TAB>sha256(optional)
https://huggingface.co/.../qwen3-coder-30b-iq3.gguf	qwen3-coder-30b-iq3.gguf	abc123...
https://huggingface.co/.../deepcoder-14b-q5.gguf	deepcoder-14b-q5.gguf	def456...
https://huggingface.co/.../qwopus-3.6-35b-iq3.gguf	qwopus-3.6-35b-iq3.gguf	-
```

Plain TSV, easy to grep + diff. Comments via `#`.

### `models.sh` CLI

```bash
models.sh list                       # show installed + configured
models.sh add <url> [--name=<file>]  # download + append to models.list
models.sh remove <name>              # rm file + drop from models.list
models.sh prune                      # delete files not in models.list
models.sh download                   # re-download anything missing
models.sh checksum <name>            # compute + store sha256
```

Each command edits `models.list` and the `gguf/` dir; `presets.ini` is edited by
hand (with the path-watcher restarting llama-server on save).

### Open questions

- **`User=` in the systemd unit.** The current unit runs as `ollama`. The
  rationale was probably ollama's group ownership of `/home/dev/models/`. Moving
  the model dir into dotfiles means the user owns it directly — running as
  `${USER}` (or as a dedicated `llama` system user) is cleaner. Decide before
  shipping.
- **CUDA-only assumption.** The user accepted "can always make this more
  flexible later." Tag in the build script's header so a CPU/Metal fallback is
  easy to add. Don't gold-plate now.
- **Where do the modelfiles go?** Remnant's `omnicoder*.modelfile` files are
  Ollama-format. If they're still useful, move them to
  `~/dotfiles/.agents/models/modelfiles/` and add a
  `models.sh modelfile apply <name>` subcommand. Out of scope for the initial
  cut; track in #4.5.

---

## 5. Kanban / task-doc unification

Already designed in
[extraction-history.md → Future task — unify kanban/task doc structure](./extraction-history.md#-future-task--unify-kanbantask-doc-structure).
Once #1 lands, `stop.sh` reads task-doc paths from `project.config.js`, so the
"shared hook supports one shape" framing changes: the hook supports _whatever
shape the config declares_, and the migration becomes purely a per-project
content move.

**Revised plan after #1:**

- Drop the "stop.sh knows about Remnant's flat list vs MFE's
  `tasks/{backlog,todo,done}/`" coupling. `stop.sh` should know how to scan a
  directory tree and how to scan a flat file, and `taskDocs` in config picks
  which mode.
- MFE bootstraps on the directory-tree mode from day one.
- Remnant's migration is optional — if the kanban-tree shape is demonstrably
  better in MFE, port Remnant later.
- Skill option still applies: a `migrate-task-docs.md` skill is probably cheaper
  than a script given the per-project judgment calls.

---

## 6. MemPalace integration

**Why this is here:** the WIP "AGENTS.md context survival after compaction"
problem in the validation doc is a special case of the broader long-term memory
problem. MemPalace
([NousResearch/hermes-agent PR #5671](https://github.com/NousResearch/hermes-agent/pull/5671))
solves it with a hook architecture that matches ours almost line-for-line.

**MemPalace primitives (verified from the PR):**

| MemPalace hook          | Our equivalent            | What it does                                      |
| ----------------------- | ------------------------- | ------------------------------------------------- |
| `initialize()`          | `session-start.sh`        | Loads identity, warms vector DB                   |
| `system_prompt_block()` | `session-start.sh` inject | AAAK L0+L1 wake-up (~170 tokens) at every session |
| `prefetch()`            | `user-prompt-submit.sh`   | Semantic search before each turn; wing-narrowed   |
| `sync_turn()`           | `post-tool-use.sh`        | Files every exchange to the palace, non-blocking  |
| `on_session_end()`      | `stop.sh`                 | Full session mining + L1 layer regeneration       |
| `on_pre_compress()`     | `pre-compact.sh`          | Extract key exchanges before context compression  |
| `on_memory_write()`     | (new — explicit writes)   | Mirrors explicit memory writes into the palace    |

**Practical plan:**

- Stand up MemPalace locally (Ollama + bge-m3 1024-dim, ChromaDB at
  `~/.mempalace/`). Hermes is the reference integration but MemPalace itself
  ships an MCP server (`mempalace_search`, `mempalace_status`, +6 more tools)
  that any MCP-aware harness can use directly.
- Register the MemPalace MCP server in `~/.config/opencode/opencode.json` and
  `~/.vscode-server/.../mcp.json` via `install.sh` — same pattern as
  `all-agents`. No code changes needed on the harness side for read access.
- Wire write-side via our existing hooks: `post-tool-use.sh` calls the MCP tool
  to file the turn, `pre-compact.sh` extracts and stores key exchanges. This is
  additive — the existing dead-ends/explorations scaffolding stays.
- **Known bug to track upstream:** the Hermes plugin defaulted to a 384-dim
  embedding function vs. MemPalace's 1024-dim collection. If we integrate
  directly with MemPalace's MCP server (not via Hermes's plugin), we sidestep
  it; if we follow Hermes's plugin pattern, fix per the PR comment.

**Acceptance:** after restart in a fresh session, the agent can recall specific
facts (e.g. "what was the Phase 4 commit?") from a prior session without those
facts being in the workspace files. Compaction in the middle of a session does
not erase per-turn memory.

**Why this is #6, not #1:** it's higher-value than the small fixes but depends
on Ollama already running (which #4 makes turnkey), and requires verifying
MemPalace works against our chosen embedding model on our hardware before
committing to it. Do #1, #2, #3 first, then this.

---

## 7. Trace-based eval scaffolding

**Source:** "The Loop Is Only as Good as the Metric"
([distributedthoughts.org, Mar 2026](https://www.distributedthoughts.org/2026-03-16-the-loop-is-only-as-good-as-the-metric/))
on Hamel Husain's evals methodology, contrasted with Karpathy's autoresearch
loop. Quote: _"the value of an optimization loop is determined entirely by the
quality of its feedback signal."_

**Husain methodology in two sentences:** review at least 100 real agent-output
traces by hand, take open-ended notes, categorize failures, then build binary
pass/fail evals around the failure modes you actually saw. Do not start with
generic metrics.

**Practical plan for us:**

- Pick a trace store. Cheapest path: write every OpenCode/Copilot turn's agent
  output to `~/.agent-traces/<date>/<session-id>.jsonl` via the existing
  `post-tool-use.sh` (we already have session-ID derivation from #2). Add a
  `trace_log()` helper in `_lib/`.
- Build a tiny review CLI: `scripts/trace-review.sh` opens the next unreviewed
  trace in `$EDITOR` with a frontmatter block (`outcome: pass|fail|partial`,
  `failure_modes: []`, `notes: ""`). Saves to `~/.agent-traces/reviewed/`.
- After 100 reviewed traces, derive a `failure-modes.md` doc grouping the
  observed failure modes. _This_ becomes the input to skill / hook / AGENTS.md
  improvements — concrete failure modes, not speculation.

**Why this is gating for #9:** an EvoSkill-style or Karpathy-style automated
loop needs a metric. Without trace-based failure modes, the only metric
available is "did the user thumbs-up" — too noisy, too slow, too coarse.

---

## 8. Exa rate-limit awareness

Per the validation doc gap #9. Free-plan limit: no parallel fanout under ~1s —
calls must be serial.

**Implementation:**

- Add a `mcp_exa_*` case to `post-tool-use.sh` that injects a one-liner reminder
  ("Exa free plan: serialize searches; one at a time").
- Add an "External service quirks" section to `~/dotfiles/.agents/AGENTS.md`
  listing Exa (and any future per-service constraints) so the rule survives
  compaction.
- Optional soft-warn in `pre-tool-use.sh`: count `mcp_exa_*` calls per turn
  (reset on `user-prompt-submit`); inject a warning (not a deny) past N=2 in a
  single turn.

Trivial, no dependencies, can land in any order.

---

## 9. Research-loop / EvoSkill-style improvements

**Sources:**

- Karpathy autoresearch
  ([github.com/karpathy/autoresearch](https://github.com/karpathy/autoresearch),
  Mar 2026): single-file experiment, fixed time budget, scalar metric (val_bpb),
  LOOP FOREVER on a dedicated branch — keep if metric improves, revert if not.
- EvoSkill ([arxiv 2603.02766](https://arxiv.org/pdf/2603.02766v1),
  [sentient-agi/EvoSkill](https://github.com/sentient-agi/EvoSkill)):
  failure-driven skill discovery via Proposer + Skill-Builder agents over a
  Pareto frontier of programs; +7.3% OfficeQA, +12.1% SealQA, +5.3% zero-shot
  transfer to BrowseComp. Skills materialize as `SKILL.md` + helper scripts —
  same shape as our existing skills dir.

**What this looks like for us (after #7):**

- The "controllable artifact" is the `~/dotfiles/.agents/AGENTS.md` +
  `agents/*.md` + `skills/*.md` + hook reminders. The "frozen model" is whatever
  LLM the user is running.
- The scalar metric is something like: fraction of traces (from #6) where the
  agent's hook output and tool sequence matched a hand-labeled gold trajectory.
  Husain's binary pass/fail per failure mode aggregates into this.
- A Proposer agent (à la EvoSkill) reads recent failed traces + the current
  skill set, proposes a new `SKILL.md` or an edit to an existing one, the
  Skill-Builder materializes it, the eval harness re-runs on the held-out trace
  set, and the frontier keeps it if the metric improves.

**Why it's last in the queue:** every prior task (config, sessions, llama
turnkey, memory, traces) is a prerequisite or a strict improvement to the
substrate this loop runs on. Starting #8 before them produces a loop that
optimizes against a noisy or wrong metric — the exact failure mode the Husain
piece warns about.

---

## Deferred / not-now

- **Adopt LangGraph as the harness.** Best-in-class observability and
  state-machine recovery, but adopting it means rewriting the OpenCode + Copilot
  integration layer we just extracted. Revisit if LangSmith becomes the only
  path to debugging a specific failure mode we can't diagnose with traces (#7)
  alone. Sources:
  [agent-harness.ai benchmark](https://agent-harness.ai/blog/multi-agent-orchestration-frameworks-benchmark-crewai-vs-langgraph-vs-autogen-performance-cost-and-integration-complexity/)
  (9% token overhead vs CrewAI 18% vs AutoGen 31%);
  [groundy.com](https://groundy.com/articles/crewai-vs-autogen-vs-langgraph-2026-the-real-trade-off-after-maintenance-mode/)
  (per-node failure isolation vs CrewAI full-plan retry).
- **AutoGen.** Entered maintenance mode in late 2025; absorbed into Microsoft
  Agent Framework 1.0 GA (April 3, 2026). Migration cost is real and the
  framework's strength (conversational coordination) doesn't match our
  deterministic-pipeline use case. Skip.
- **CrewAI.** Strong for "agent A → agent B → agent C" pipelines, but role
  coordination overhead is ~3× LangGraph's on simple workflows. Our use case
  (single agent per session) doesn't benefit. Skip.
- **Git worktrees for parallel agent runs.** Mentioned in the MFE draft; see
  Claude Desktop's approach. Interesting once we have a working research loop
  (#9), pointless before. Defer.
- **Narrative epistemology as an explicit framework.** Flowerree's "Reasoning
  Through Narrative" (Cambridge Episteme) and Betz et al. on NLMs as epistemic
  agents (PMC9910757) give philosophical grounding for AGENTS.md design (a
  narrative frame is a "modal-space-shaping tool, not a set of premises").
  Useful for writing AGENTS.md prose; not a discrete task. Cite if/when we
  publish methodology.
- **Hermes Agent as a harness.** Compelling memory story (MemPalace), but Python
  and tied to NousResearch's ecosystem. We integrate the memory piece directly
  via MCP (#6) without adopting the harness.

---

## Research notes (May 23, 2026)

Pulled via Exa search; supports the prioritization above. Each block lists the
key finding and the source.

### Karpathy autoresearch — single-metric loop

- **Source:** [karpathy/autoresearch](https://github.com/karpathy/autoresearch)
  - [distributedthoughts.org](https://www.distributedthoughts.org/2026-03-16-the-loop-is-only-as-good-as-the-metric/).
- Single file (`train.py`) edited by agent, fixed 5-minute time budget per
  experiment, scalar metric (val_bpb), branch-keep-or-revert protocol, LOOP
  FOREVER. ~12 experiments/hour.
- Four ingredients for this to work outside ML training: (1) one modifiable
  artifact, (2) reliable benchmark/harness, (3) scalar metric, (4) fixed eval
  cycle. The Husain layer adds: don't invent the metric — derive it from manual
  trace review.

### EvoSkill — automated skill discovery

- **Source:** [arxiv 2603.02766](https://arxiv.org/pdf/2603.02766v1),
  [sentient-agi/EvoSkill](https://github.com/sentient-agi/EvoSkill).
- Three agents: Proposer (diagnoses failures), Skill-Builder (materializes
  `SKILL.md` + helpers), evaluator (held-out validation).
- Pareto frontier of agent programs; round-robin parent selection;
  failure-driven textual feedback descent.
- **Why this matters for us:** our skills dir already matches EvoSkill's output
  shape (`SKILL.md` + helper files). The infrastructure they describe is closer
  to "build on top of our existing layout" than "adopt a new framework."

### Agentic-framework landscape, 2026

- **LangGraph 1.2 (May 2026):** production default. 9% token overhead over raw
  API. Per-node failure isolation (vs CrewAI/AutoGen full-plan retry). Best
  observability via LangSmith. Highest setup cost.
- **CrewAI 1.11 (Mar 2026):** fastest time-to-first-agent. 18% token overhead.
  Role-based. SQLite checkpointing added April 2026.
- **AutoGen:** maintenance mode since late 2025. Absorbed into Microsoft Agent
  Framework 1.0 GA (April 3, 2026; unified with Semantic Kernel, MCP-native,
  GraphFlow).
- **MAST taxonomy finding:** 79% of multi-agent failures originate from
  spec/coordination issues, not the underlying model
  ([arxiv 2503.16339](https://arxiv.org/abs/2503.16339)). 36.9% inter-agent
  misalignment, 21.3% task-verification breakdowns. **This validates investing
  in hook/skill/AGENTS.md infrastructure over swapping models.**

### MemPalace — long-term memory provider

- **Source:**
  [NousResearch/hermes-agent PR #5671](https://github.com/NousResearch/hermes-agent/pull/5671).
- 96.6% raw LongMemEval (100% with Haiku rerank). Fully local (ChromaDB + Ollama
  bge-m3 1024-dim). No API key.
- Hook architecture maps 1:1 onto ours (see #5 table). Eight MCP tools expose
  read/write.
- **Why this is the highest-leverage memory option:** matches our philosophy
  (local, no SaaS, hook-driven) and solves the AGENTS.md-compaction problem the
  validation doc flagged.

### Narrative epistemology — applied to AGENTS.md design

- **Source:** Flowerree, "Reasoning Through Narrative" (Cambridge _Episteme_,
  2023); Betz et al., "Probabilistic coherence... Neural language models as
  epistemic agents" (PMC9910757).
- Narratives shape **modal space** — what the model treats as possible,
  plausible, required. They aren't premises to evaluate as true/false; they're
  tools that frame inference.
- **Implication for AGENTS.md:** the doc's job isn't to state facts the model
  checks at decision points — it's to shape the model's default modal space.
  Forbidden patterns aren't "rules to look up" but "implausible options excluded
  from the action space." Frames the "context survival after compaction" problem
  differently: the question isn't "did the rules survive" but "did the
  modal-space shaping survive."
- NLMs as epistemic agents (Betz): self-training on synthetic corpora produces
  probabilistically-coherent belief revision. Suggestive for why AGENTS.md
  content that the model sees repeatedly (via PostToolUse re-injection) gets
  internalized better than content seen once.

### Exa rate-limit (operational)

- Free plan: serial only, no fan-out under ~1s. Observed May 23, 2026.
- Recorded in
  [extraction-history.md gap #9](./extraction-history.md#-gaps-and-bugs-in-dotfiles-pre-push)
  and as roadmap task #7.