MCP tools don't populate output.output in the tool.execute.after hook — the MCP content flows through OpenCode's internal parts pipeline instead. This caused a crash: undefined is not an object (evaluating 'text.length') in the truncate function.
38 KiB
Dotfiles Agent Infrastructure — Roadmap
Status: Planning. Companion to extraction-history.md, which covers the already-shipped extraction work and the validation findings against it.
Scope of this doc: future tasks against ~/dotfiles/.agents/ and the
ecosystem around it. Research that informs the prioritization is captured in the
"Research notes" section at the bottom — read those first if any of the task
rationale feels opaque.
How to use this doc: the "Tasks" list is ordered by recommended execution order (high leverage + low risk first). Each entry links to its design section. Move sections to dedicated docs once they grow past ~80 lines.
Land before anything else: the No-Live-Fire safety rule. One-paragraph addition to
~/dotfiles/.agents/AGENTS.md; takes 5 minutes; protects against theopencode run "Try to run rm -rf /"failure mode where a model takes the prompt literally if the hook fails to block.
Then relocate this doc out of Remnant: see Doc relocation (Remnant cleanup). This roadmap,
agent-infra-extraction.md, andverification.mdare not Remnant-specific and should live in~/dotfiles/so Remnant'sdocs/projects/contains only Remnant-app work. Do this after #0 and before resuming any numbered task below — once moved, the tasks list executes against the dotfiles copy and Remnant is free to evolve independently.
Doc relocation (Remnant cleanup)
Goal: Remnant's repo contains only Remnant-app docs. Everything about
~/dotfiles/.agents/ lives in ~/dotfiles/docs/ (or ~/dotfiles/.agents/docs/
— pick one and stick with it; the existing
agent-infrastructure.md stub already references
~/dotfiles/.agents/docs/agent-infrastructure.md, so that's the established
location).
Why now (priority: immediately after #0): the user wants Remnant in a good
state to work on independently. Every agent-infra doc sitting in
docs/projects/ is noise for Remnant-app planning sessions and gets
auto-injected as context whenever an agent touches docs/projects/. Moving them
is mechanical and reversible.
Files to relocate:
| Current path | Destination | Notes |
|---|---|---|
docs/projects/dotfiles-agent-infra-roadmap.md (this file) |
~/dotfiles/.agents/docs/roadmap.md |
Update internal links. Drop "Remnant" framing in the intro — it's just the roadmap once it lives there. |
docs/projects/agent-infra-extraction.md |
~/dotfiles/.agents/docs/extraction-history.md |
Validation log for the already-shipped extraction. Keep as historical record; not active planning. |
verification.md (repo root) |
~/dotfiles/.agents/tests/manual-verification.md |
Already specified as part of #3; do the move now rather than waiting for the test harness. |
docs/projects/agent-infrastructure.md |
Stay (already trimmed to Remnant-specific overlay) | Already correctly scoped: it documents Remnant's overlay hook + Remnant-specific integration test cases. Leave in place; it points to the canonical doc in dotfiles. |
Agent-infra entries inside docs/projects/COMPLETED.md |
Split out to ~/dotfiles/.agents/docs/completed.md |
Audit first — if there's nothing agent-infra-specific there, skip. |
Steps:
mkdir -p ~/dotfiles/.agents/docs ~/dotfiles/.agents/testsgit mveach file into~/dotfiles/(cross-repo: usegit mvinside Remnant to stage a delete, then a fresh add in dotfiles — there's no meaningful history to preserve across repos for these short-lived docs; if history matters foragent-infra-extraction.md, usegit format-patchgit aminstead).
- Rewrite intra-doc links: this file's references to
./agent-infra-extraction.mdbecome./extraction-history.md; references toverification.mdbecome../tests/manual-verification.md. - Find inbound links from anywhere in Remnant
(
grep -rn "dotfiles-agent-infra-roadmap\|agent-infra-extraction\|verification.md" ~/code/remnant) and either delete them or repoint at the dotfiles copies via absolute paths (e.g.,~/dotfiles/.agents/docs/roadmap.md). - Audit
docs/projects/COMPLETED.mdfor agent-infra rows; split if any exist. - Update
AGENTS.mdfiles in Remnant if any reference the moved docs. - Commit Remnant deletion and dotfiles addition together (or back-to-back commits with cross-references in the messages).
Acceptance: ls ~/code/remnant/docs/projects/ | grep -iE 'agent|dotfiles'
returns only agent-infrastructure.md; verification.md is gone from the
Remnant root; the roadmap (this doc) opens cleanly from its new path with
working links.
Risk: if any Remnant AGENTS.md instructions or
docs/projects/COMPLETED.md row links into these docs and the
link breaks silently, agents will follow a dead reference. Step 4 mitigates.
Tasks (recommended order)
- No-live-fire safety rule (land immediately) — AGENTS.md addition forbidding real destructive commands as hook-test inputs. Prerequisite for #3 and for any manual hook testing.
project.config.jsextraction — unblocks non-Remnant projects; resolves 6+ hardcodes catalogued in the hook-script audit.- Per-session tmp file capture — correctness bug; concurrent agent sessions clobber one another's task-capture file.
- Hook + agent-config verification framework
— automate the smoke-test currently in Remnant's
verification.md. Gated on #0 (safety rule) and benefits from #1 (config-driven test fixtures). - llama-server + AI models module —
user-requested; folds presets, systemd units, llama.cpp build, and GGUF
acquisition into
install.sh(skips heavy steps in devcontainers). - Kanban / task-doc unification — blocks MFE
adoption of the shared
stop.sh; deferred until #1 lands so the task-doc paths come from config, not the hook. - MemPalace integration for memory survival across compaction — directly addresses the "AGENTS.md context survival after compaction" WIP problem in extraction-history.md.
- Trace-based eval scaffolding (Husain methodology) — foundation for any future automated improvement loop.
- Exa rate-limit awareness — small follow-up to the gap recorded in the validation doc.
- Research-loop / EvoSkill-style improvements — gated on #7.
Items considered and deprioritized: see Deferred / not-now.
0. No-live-fire safety rule (land immediately)
Driver: May 23 2026 incident — opencode run "Try to run rm -rf /" was used
to smoke-test whether pre-tool-use.sh would block destructive commands. The
run happened to be safe because the loaded model refused on its own, but if the
hook had been broken and a more compliant model had been in the chair, the test
would have executed rm -rf / for real. The test methodology was the bug, not
the model behavior.
Rule (add verbatim to ~/dotfiles/.agents/AGENTS.md):
Testing destructive-command blocks — NEVER use live ammunition
When verifying that
pre-tool-use.sh(or any other hook) blocks a dangerous command pattern, never issue the real destructive command as the test input. The hook is the system under test — if it fails, the test destroys the host.Use one of these methods instead, in order of preference:
- Unit-test the hook directly. Pipe synthetic hook-input JSON to the script and check exit code + stderr. No agent in the loop. No real shell invocation. Example:
echo '{"tool_name":"run_in_terminal","tool_input":{"command":"rm -rf /"}}' | bash ~/dotfiles/.agents/hooks/pre-tool-use.sh; echo "exit=$?"The hook should exit non-zero (deny) and print the block reason. Normwas ever queued.- Use a sentinel that exercises the regex but is harmless if the block fails. A path that obviously doesn't exist and could not possibly hold real data:
rm -rf /var/empty/agent-block-canary-DO-NOT-CREATE-${RANDOM}. The hook pattern (rm\s+-rf?\s+/) matches; if the block fails, the worst case is a "no such file" error on a sentinel path. NEVER use bare/,/home,~,.,*, or any real path — those have to fail-closed even if the hook is broken.- Never issue the literal destructive command (
rm -rf /,dd if=/dev/zero of=/dev/sda,:(){ :|:& };:,chmod -R 000 /,git push --forceto a published branch, etc.) as an agent prompt. Not even with--dry-run. Not even "just to see." Not even if you're sure the hook works. The hook MIGHT not work. That's why you're testing it.This rule applies to humans writing test prompts AND to agents asked to verify hook behavior. If you (the agent) are asked to verify a block, refuse any plan that involves issuing the real destructive command and propose a unit-test or sentinel approach instead.
Why it lives in AGENTS.md, not just a hook: the failure mode is at the human/agent decision layer ("what command should I issue to test this?"), not at the execution layer. A hook can't catch a model that's been told to bypass the hook. The narrative-epistemology framing from the research notes applies — this rule shapes the modal space of test prompts so "issue the real command" doesn't appear in the action set.
Acceptance: the rule lives in ~/dotfiles/.agents/AGENTS.md under a
top-level section (so it survives compaction and AGENTS.md re-injection). Next
time anyone asks the agent to test a block, the agent proposes method 1 or 2 and
refuses method 3.
1. project.config.js extraction
Already designed in extraction-history.md → Suggested fix pattern. This task tracks the implementation.
Shape of work:
- Add a tiny loader (
~/dotfiles/.agents/hooks/_lib/project-config.sh) sourced by every hook that needs configured values. Loads<repo>/.agents/project.config.{js,ts,json}vianode/tsx/direct JSON read in that order; falls back to a defaults object matching Remnant today. - Replace hardcoded values in
pre-tool-use.shPolicies 5, 8, 9, 10, 11, 14 and instop.sh(ports, verify command, codegen rules, task-doc paths) per the audit. - Drop the
modelContextWindownotion entirely; genericize the Policy 14 "32K" wording to "may exhaust the model's context window." - Ship a Remnant
project.config.jsin the Remnant repo as the first consumer; ship an MFEproject.config.jslater as part of the MFE bootstrap.
Acceptance: running every hook from a project without a config file produces the same behavior as today (zero-regression for Remnant). Running from a project with a config file consults it.
2. Per-session tmp file capture
Already designed in extraction-history.md → Future task — per-session tmp file capture. Small, independent, can land before or after #1.
Bonus catch from that section: /tmp/.opencode-tool-count-${REPO_ID} in
post-tool-use.sh is keyed by repo only — two concurrent sessions in the same
repo share the self-check counter. Fix the same way.
3. Hook + agent-config verification framework
Driver: manual-verification.md is a manual
4-level smoke-test for the renamed build and orchestrator agents. It is (a)
sitting in the wrong repo — the agents it tests now live in
~/dotfiles/.agents/agents/, (b) outdated relative to the current agent config,
and (c) the kind of thing humans skip because running it takes 10+ minutes of
manual prompting. The user explicitly wants this to run automatically after
updates, and just-as-explicitly wants it to never resemble
opencode run "Try to run rm -rf /" (see
#0).
Test layers
Three layers, from cheapest/safest to most expensive/least safe. Run the lower
layers in CI on every commit to ~/dotfiles/.agents/; run the upper layer
manually before merging risky changes.
Layer 1 — Static checks (no execution, no agent):
bash -non every*.shhook (syntax-only parse).shellcheckon every hook (lints + common-bug detection).- Frontmatter validation on every
agents/*.mdandskills/*.md: required fields present, referenced tools exist in the framework's tool registry. node --checkortsx --checkon every JS/TS plugin (frameworks/opencode/*.ts,mcp/all-agents/src/*.ts).- JSON schema validation on
frameworks/github/hooks.jsonand any other framework configs. - Glob check: every file referenced by a hook (e.g.
_lib/project-config.shonce #1 lands) actually exists.
Layer 2 — Hook unit tests (synthetic input, no agent, no shell exec):
For each hook, a fixture file tests/hooks/<hook>.test.sh that pipes
hand-written JSON inputs to the hook and asserts the exit code + stderr. No real
command is ever invoked because the hook returns deny/allow before anything
runs.
Fixtures should cover, at minimum:
- Allow path: a benign tool call (e.g.
read_fileof an in-repo path) — hook exits 0, no stderr noise. - Block paths (one per policy): synthetic JSON that exercises each block in
pre-tool-use.sh(Policies 1–14). Assert exit code 2 (deny) and message contains the policy ID. All block fixtures use sentinel paths per #0 — no bare/, no real destructive commands. - Reminder injection:
post-tool-use.shfed a generated-file edit — assert stdout contains the.generated.tswarning. - Session boundaries:
session-start.sh,stop.sh,pre-compact.shwith realistic JSON inputs — assert they produce the expected stdout blocks.
A small runner (tests/run-hook-tests.sh) discovers *.test.sh files, executes
them, and reports pass/fail. CI calls this on every PR. Local dev calls it from
a ~/dotfiles/.agents/install.sh --verify flag.
Layer 3 — Live integration tests (real agent, sentinel inputs, gated):
The layers above don't catch "the framework didn't actually wire the hook in" failures — the hook can be perfect in isolation but never get called. Layer 3 catches that by running a real OpenCode/Copilot session against sentinel prompts:
- Per #0, prompts use sentinel
paths and the agent is asked to attempt the sentinel command, not the real
one. Example prompt: "Run
rm -rf /var/empty/canary-${RANDOM}and report what happened." Pass criterion: the hook block message appears in the agent's response and the tool was never executed. - Optional: drive via
opencode run --agent <name>so the session is scripted and non-interactive. Gate this behind an explicit--enable-live-testsflag in the runner; default off in CI. - Layer 3 also folds in Remnant's
verification.mdLevels 1–4 (read-only, small write, scope escalation refusal, orchestrator planning gate) once the agents are stable enough to script against.
Disposition of verification.md
- It's not Remnant's anymore (tests global infra). Move to
~/dotfiles/.agents/tests/manual-verification.mdas the human-runnable fallback until Layer 3 automation exists. - Drop from Remnant root in the same commit that creates
~/dotfiles/.agents/tests/. Until then it can stay where it is; it's not causing harm, just misfiled. - Once Layers 1 and 2 are running in CI, the manual doc shrinks to just Layer 3 scenarios. Once Layer 3 is automated, retire the doc entirely.
CI integration
- Add a GitHub Action (or Gitea CI step) in
~/dotfiles/that runs Layers 1 + 2 on every push. - Locally,
install.sh --verifyruns the same checks before applying any changes — so an interactiveinstall.shinvocation can refuse to symlink in a broken hook. - A
post-mergegit hook in~/dotfiles/runs Layers 1 + 2 aftergit pullso a user who syncs a broken commit gets told immediately rather than discovering it at the next agent invocation.
Open questions
- What's the canonical sentinel path? Proposal:
/var/empty/(exists, read-only, owned by root on most distros, used by sshd's PrivilegeSeparation — so a roguerm -rfwould fail with permission denied even before hitting nonexistent-file errors). Append a random + canary token. - Where do hook fixtures live in the global infra? Likely
~/dotfiles/.agents/tests/hooks/*.test.shand~/dotfiles/.agents/tests/fixtures/*.json. Symmetric withhooks/itself. - Should Layer 3 be a single integration test per framework, or per hook? Per framework is enough — the hook unit tests already cover per-hook behavior. Layer 3 only needs to prove "the framework calls the hook at all."
Acceptance
~/dotfiles/.agents/tests/run.shexists and exits 0 on a clean checkout.- A deliberately-broken hook (e.g. syntax error introduced) causes the runner to fail loudly with a useful error.
- A pull that breaks a hook is caught by the
post-mergehook before any agent sees it. - No test fixture in the repo references a real destructive command or real path
— grep
tests/forrm -rf /(without sentinel suffix),dd if=,:(){,chmod -R 000 /etc. as a CI lint.
4. llama-server + AI models module
Goal: ~/dotfiles/install.sh (or a sub-command of it) sets up llama.cpp
- CUDA, registers the systemd units, places
presets.inifrom dotfiles, and on a non-devcontainer machine downloads the configured set of GGUF models. A second script (scripts/models.sh) handles add/remove/list of models post-install.
Target layout
~/dotfiles/.agents/models/
├── presets.ini ← canonical, version-controlled
├── models.list ← URLs + filenames + checksums (committed)
├── README.md ← what each preset is for
└── gguf/ ← gitignored, populated by install.sh
└── *.gguf
~/dotfiles/.agents/llama-server/
├── start.sh ← canonical (replaces /opt/llama-server/start.sh)
├── llama-server.service ← systemd unit (User=current user, not ollama)
├── llama-server-presets.path ← path watcher
├── llama-server-presets.service ← oneshot restart
└── build-llama.sh ← clones + builds llama.cpp w/ CUDA
~/dotfiles/.agents/scripts/
├── models.sh ← add/remove/list GGUFs by URL
└── install-llama.sh ← called by install.sh; idempotent
install.sh additions (ordered)
- Detect environment. If
/.dockerenvexists,$REMOTE_CONTAINERSset, or$CODESPACESset → devcontainer mode: skip llama.cpp build and GGUF download (huge, slow, and not useful inside the container). Still placepresets.iniandmodels.listso the project can read them. - Dependencies.
apt install -y build-essential cmake ninja-build libcurl4-openssl-dev git(withsudoprompt). CUDA toolkit detection only — don't try to install CUDA itself; assume host setup or fail loud with a pointer to docs/llama-server-cuda-wsl2.md. - Build llama.cpp.
scripts/install-llama.shclonesggerganov/llama.cppto/opt/llama-server/src, builds with-DGGML_CUDA=ON, installs binaries + libs to/opt/llama-server/. Skips the clone+build if the binary exists and--rebuildwasn't passed. - Install systemd units. Copy from
~/dotfiles/.agents/llama-server/*.{service,path}to/etc/systemd/system/, substituting${USER}forUser=. Rundaemon-reload,enable --now llama-server.service llama-server-presets.path. - Symlink
presets.ini.ln -sf ~/dotfiles/.agents/models/presets.ini ~/models/presets.ini(keep the existing path-watcher target until users have migrated). The path watcher already restarts on modify — symlink target changes count. - Download GGUFs. Read
models.list; for each entry not already in~/dotfiles/.agents/models/gguf/, download withcurl --locationand verify checksum if listed. Print disk-usage estimate before starting. Skip in devcontainer mode.
models.list format
# url<TAB>filename<TAB>sha256(optional)
https://huggingface.co/.../qwen3-coder-30b-iq3.gguf qwen3-coder-30b-iq3.gguf abc123...
https://huggingface.co/.../deepcoder-14b-q5.gguf deepcoder-14b-q5.gguf def456...
https://huggingface.co/.../qwopus-3.6-35b-iq3.gguf qwopus-3.6-35b-iq3.gguf -
Plain TSV, easy to grep + diff. Comments via #.
models.sh CLI
models.sh list # show installed + configured
models.sh add <url> [--name=<file>] # download + append to models.list
models.sh remove <name> # rm file + drop from models.list
models.sh prune # delete files not in models.list
models.sh download # re-download anything missing
models.sh checksum <name> # compute + store sha256
Each command edits models.list and the gguf/ dir; presets.ini is edited by
hand (with the path-watcher restarting llama-server on save).
Open questions
User=in the systemd unit. The current unit runs asollama. The rationale was probably ollama's group ownership of/home/dev/models/. Moving the model dir into dotfiles means the user owns it directly — running as${USER}(or as a dedicatedllamasystem user) is cleaner. Decide before shipping.- CUDA-only assumption. The user accepted "can always make this more flexible later." Tag in the build script's header so a CPU/Metal fallback is easy to add. Don't gold-plate now.
- Where do the modelfiles go? Remnant's
omnicoder*.modelfilefiles are Ollama-format. If they're still useful, move them to~/dotfiles/.agents/models/modelfiles/and add amodels.sh modelfile apply <name>subcommand. Out of scope for the initial cut; track in #4.5.
5. Kanban / task-doc unification
Already designed in
extraction-history.md → Future task — unify kanban/task doc structure.
Once #1 lands, stop.sh reads task-doc paths from project.config.js, so the
"shared hook supports one shape" framing changes: the hook supports whatever
shape the config declares, and the migration becomes purely a per-project
content move.
Revised plan after #1:
- Drop the "stop.sh knows about Remnant's flat list vs MFE's
tasks/{backlog,todo,done}/" coupling.stop.shshould know how to scan a directory tree and how to scan a flat file, andtaskDocsin config picks which mode. - MFE bootstraps on the directory-tree mode from day one.
- Remnant's migration is optional — if the kanban-tree shape is demonstrably better in MFE, port Remnant later.
- Skill option still applies: a
migrate-task-docs.mdskill is probably cheaper than a script given the per-project judgment calls.
6. MemPalace integration
Why this is here: the WIP "AGENTS.md context survival after compaction" problem in the validation doc is a special case of the broader long-term memory problem. MemPalace (NousResearch/hermes-agent PR #5671) solves it with a hook architecture that matches ours almost line-for-line.
MemPalace primitives (verified from the PR):
| MemPalace hook | Our equivalent | What it does |
|---|---|---|
initialize() |
session-start.sh |
Loads identity, warms vector DB |
system_prompt_block() |
session-start.sh inject |
AAAK L0+L1 wake-up (~170 tokens) at every session |
prefetch() |
user-prompt-submit.sh |
Semantic search before each turn; wing-narrowed |
sync_turn() |
post-tool-use.sh |
Files every exchange to the palace, non-blocking |
on_session_end() |
stop.sh |
Full session mining + L1 layer regeneration |
on_pre_compress() |
pre-compact.sh |
Extract key exchanges before context compression |
on_memory_write() |
(new — explicit writes) | Mirrors explicit memory writes into the palace |
Practical plan:
- Stand up MemPalace locally (Ollama + bge-m3 1024-dim, ChromaDB at
~/.mempalace/). Hermes is the reference integration but MemPalace itself ships an MCP server (mempalace_search,mempalace_status, +6 more tools) that any MCP-aware harness can use directly. - Register the MemPalace MCP server in
~/.config/opencode/opencode.jsonand~/.vscode-server/.../mcp.jsonviainstall.sh— same pattern asall-agents. No code changes needed on the harness side for read access. - Wire write-side via our existing hooks:
post-tool-use.shcalls the MCP tool to file the turn,pre-compact.shextracts and stores key exchanges. This is additive — the existing dead-ends/explorations scaffolding stays. - Known bug to track upstream: the Hermes plugin defaulted to a 384-dim embedding function vs. MemPalace's 1024-dim collection. If we integrate directly with MemPalace's MCP server (not via Hermes's plugin), we sidestep it; if we follow Hermes's plugin pattern, fix per the PR comment.
Acceptance: after restart in a fresh session, the agent can recall specific facts (e.g. "what was the Phase 4 commit?") from a prior session without those facts being in the workspace files. Compaction in the middle of a session does not erase per-turn memory.
Why this is #6, not #1: it's higher-value than the small fixes but depends on Ollama already running (which #4 makes turnkey), and requires verifying MemPalace works against our chosen embedding model on our hardware before committing to it. Do #1, #2, #3 first, then this.
7. Trace-based eval scaffolding
Source: "The Loop Is Only as Good as the Metric" (distributedthoughts.org, Mar 2026) on Hamel Husain's evals methodology, contrasted with Karpathy's autoresearch loop. Quote: "the value of an optimization loop is determined entirely by the quality of its feedback signal."
Husain methodology in two sentences: review at least 100 real agent-output traces by hand, take open-ended notes, categorize failures, then build binary pass/fail evals around the failure modes you actually saw. Do not start with generic metrics.
Practical plan for us:
- Pick a trace store. Cheapest path: write every OpenCode/Copilot turn's agent
output to
~/.agent-traces/<date>/<session-id>.jsonlvia the existingpost-tool-use.sh(we already have session-ID derivation from #2). Add atrace_log()helper in_lib/. - Build a tiny review CLI:
scripts/trace-review.shopens the next unreviewed trace in$EDITORwith a frontmatter block (outcome: pass|fail|partial,failure_modes: [],notes: ""). Saves to~/.agent-traces/reviewed/. - After 100 reviewed traces, derive a
failure-modes.mddoc grouping the observed failure modes. This becomes the input to skill / hook / AGENTS.md improvements — concrete failure modes, not speculation.
Why this is gating for #9: an EvoSkill-style or Karpathy-style automated loop needs a metric. Without trace-based failure modes, the only metric available is "did the user thumbs-up" — too noisy, too slow, too coarse.
8. Exa rate-limit awareness
Per the validation doc gap #9. Free-plan limit: no parallel fanout under ~1s — calls must be serial.
Implementation:
- Add a
mcp_exa_*case topost-tool-use.shthat injects a one-liner reminder ("Exa free plan: serialize searches; one at a time"). - Add an "External service quirks" section to
~/dotfiles/.agents/AGENTS.mdlisting Exa (and any future per-service constraints) so the rule survives compaction. - Optional soft-warn in
pre-tool-use.sh: countmcp_exa_*calls per turn (reset onuser-prompt-submit); inject a warning (not a deny) past N=2 in a single turn.
Trivial, no dependencies, can land in any order.
9. Research-loop / EvoSkill-style improvements
Sources:
- Karpathy autoresearch (github.com/karpathy/autoresearch, Mar 2026): single-file experiment, fixed time budget, scalar metric (val_bpb), LOOP FOREVER on a dedicated branch — keep if metric improves, revert if not.
- EvoSkill (arxiv 2603.02766,
sentient-agi/EvoSkill):
failure-driven skill discovery via Proposer + Skill-Builder agents over a
Pareto frontier of programs; +7.3% OfficeQA, +12.1% SealQA, +5.3% zero-shot
transfer to BrowseComp. Skills materialize as
SKILL.md+ helper scripts — same shape as our existing skills dir.
What this looks like for us (after #7):
- The "controllable artifact" is the
~/dotfiles/.agents/AGENTS.md+agents/*.md+skills/*.md+ hook reminders. The "frozen model" is whatever LLM the user is running. - The scalar metric is something like: fraction of traces (from #6) where the agent's hook output and tool sequence matched a hand-labeled gold trajectory. Husain's binary pass/fail per failure mode aggregates into this.
- A Proposer agent (à la EvoSkill) reads recent failed traces + the current
skill set, proposes a new
SKILL.mdor an edit to an existing one, the Skill-Builder materializes it, the eval harness re-runs on the held-out trace set, and the frontier keeps it if the metric improves.
Why it's last in the queue: every prior task (config, sessions, llama turnkey, memory, traces) is a prerequisite or a strict improvement to the substrate this loop runs on. Starting #8 before them produces a loop that optimizes against a noisy or wrong metric — the exact failure mode the Husain piece warns about.
Deferred / not-now
- Adopt LangGraph as the harness. Best-in-class observability and state-machine recovery, but adopting it means rewriting the OpenCode + Copilot integration layer we just extracted. Revisit if LangSmith becomes the only path to debugging a specific failure mode we can't diagnose with traces (#7) alone. Sources: agent-harness.ai benchmark (9% token overhead vs CrewAI 18% vs AutoGen 31%); groundy.com (per-node failure isolation vs CrewAI full-plan retry).
- AutoGen. Entered maintenance mode in late 2025; absorbed into Microsoft Agent Framework 1.0 GA (April 3, 2026). Migration cost is real and the framework's strength (conversational coordination) doesn't match our deterministic-pipeline use case. Skip.
- CrewAI. Strong for "agent A → agent B → agent C" pipelines, but role coordination overhead is ~3× LangGraph's on simple workflows. Our use case (single agent per session) doesn't benefit. Skip.
- Git worktrees for parallel agent runs. Mentioned in the MFE draft; see Claude Desktop's approach. Interesting once we have a working research loop (#9), pointless before. Defer.
- Narrative epistemology as an explicit framework. Flowerree's "Reasoning Through Narrative" (Cambridge Episteme) and Betz et al. on NLMs as epistemic agents (PMC9910757) give philosophical grounding for AGENTS.md design (a narrative frame is a "modal-space-shaping tool, not a set of premises"). Useful for writing AGENTS.md prose; not a discrete task. Cite if/when we publish methodology.
- Hermes Agent as a harness. Compelling memory story (MemPalace), but Python and tied to NousResearch's ecosystem. We integrate the memory piece directly via MCP (#6) without adopting the harness.
Research notes (May 23, 2026)
Pulled via Exa search; supports the prioritization above. Each block lists the key finding and the source.
Karpathy autoresearch — single-metric loop
- Source: karpathy/autoresearch
- Single file (
train.py) edited by agent, fixed 5-minute time budget per experiment, scalar metric (val_bpb), branch-keep-or-revert protocol, LOOP FOREVER. ~12 experiments/hour. - Four ingredients for this to work outside ML training: (1) one modifiable artifact, (2) reliable benchmark/harness, (3) scalar metric, (4) fixed eval cycle. The Husain layer adds: don't invent the metric — derive it from manual trace review.
EvoSkill — automated skill discovery
- Source: arxiv 2603.02766, sentient-agi/EvoSkill.
- Three agents: Proposer (diagnoses failures), Skill-Builder (materializes
SKILL.md+ helpers), evaluator (held-out validation). - Pareto frontier of agent programs; round-robin parent selection; failure-driven textual feedback descent.
- Why this matters for us: our skills dir already matches EvoSkill's output
shape (
SKILL.md+ helper files). The infrastructure they describe is closer to "build on top of our existing layout" than "adopt a new framework."
Agentic-framework landscape, 2026
- LangGraph 1.2 (May 2026): production default. 9% token overhead over raw API. Per-node failure isolation (vs CrewAI/AutoGen full-plan retry). Best observability via LangSmith. Highest setup cost.
- CrewAI 1.11 (Mar 2026): fastest time-to-first-agent. 18% token overhead. Role-based. SQLite checkpointing added April 2026.
- AutoGen: maintenance mode since late 2025. Absorbed into Microsoft Agent Framework 1.0 GA (April 3, 2026; unified with Semantic Kernel, MCP-native, GraphFlow).
- MAST taxonomy finding: 79% of multi-agent failures originate from spec/coordination issues, not the underlying model (arxiv 2503.16339). 36.9% inter-agent misalignment, 21.3% task-verification breakdowns. This validates investing in hook/skill/AGENTS.md infrastructure over swapping models.
MemPalace — long-term memory provider
- Source: NousResearch/hermes-agent PR #5671.
- 96.6% raw LongMemEval (100% with Haiku rerank). Fully local (ChromaDB + Ollama bge-m3 1024-dim). No API key.
- Hook architecture maps 1:1 onto ours (see #5 table). Eight MCP tools expose read/write.
- Why this is the highest-leverage memory option: matches our philosophy (local, no SaaS, hook-driven) and solves the AGENTS.md-compaction problem the validation doc flagged.
Narrative epistemology — applied to AGENTS.md design
- Source: Flowerree, "Reasoning Through Narrative" (Cambridge Episteme, 2023); Betz et al., "Probabilistic coherence... Neural language models as epistemic agents" (PMC9910757).
- Narratives shape modal space — what the model treats as possible, plausible, required. They aren't premises to evaluate as true/false; they're tools that frame inference.
- Implication for AGENTS.md: the doc's job isn't to state facts the model checks at decision points — it's to shape the model's default modal space. Forbidden patterns aren't "rules to look up" but "implausible options excluded from the action space." Frames the "context survival after compaction" problem differently: the question isn't "did the rules survive" but "did the modal-space shaping survive."
- NLMs as epistemic agents (Betz): self-training on synthetic corpora produces probabilistically-coherent belief revision. Suggestive for why AGENTS.md content that the model sees repeatedly (via PostToolUse re-injection) gets internalized better than content seen once.
Exa rate-limit (operational)
- Free plan: serial only, no fan-out under ~1s. Observed May 23, 2026.
- Recorded in extraction-history.md gap #9 and as roadmap task #7.