dotfiles/.agents/docs/roadmap.md
Brydon DeWitt 83f456f25b fix(plugin): guard against undefined output.output for MCP tools
MCP tools don't populate output.output in the tool.execute.after hook —
the MCP content flows through OpenCode's internal parts pipeline instead.
This caused a crash: undefined is not an object (evaluating 'text.length')
in the truncate function.
2026-06-06 02:11:24 -04:00

38 KiB
Raw Blame History

Dotfiles Agent Infrastructure — Roadmap

Status: Planning. Companion to extraction-history.md, which covers the already-shipped extraction work and the validation findings against it.

Scope of this doc: future tasks against ~/dotfiles/.agents/ and the ecosystem around it. Research that informs the prioritization is captured in the "Research notes" section at the bottom — read those first if any of the task rationale feels opaque.

How to use this doc: the "Tasks" list is ordered by recommended execution order (high leverage + low risk first). Each entry links to its design section. Move sections to dedicated docs once they grow past ~80 lines.

Land before anything else: the No-Live-Fire safety rule. One-paragraph addition to ~/dotfiles/.agents/AGENTS.md; takes 5 minutes; protects against the opencode run "Try to run rm -rf /" failure mode where a model takes the prompt literally if the hook fails to block.

Then relocate this doc out of Remnant: see Doc relocation (Remnant cleanup). This roadmap, agent-infra-extraction.md, and verification.md are not Remnant-specific and should live in ~/dotfiles/ so Remnant's docs/projects/ contains only Remnant-app work. Do this after #0 and before resuming any numbered task below — once moved, the tasks list executes against the dotfiles copy and Remnant is free to evolve independently.


Doc relocation (Remnant cleanup)

Goal: Remnant's repo contains only Remnant-app docs. Everything about ~/dotfiles/.agents/ lives in ~/dotfiles/docs/ (or ~/dotfiles/.agents/docs/ — pick one and stick with it; the existing agent-infrastructure.md stub already references ~/dotfiles/.agents/docs/agent-infrastructure.md, so that's the established location).

Why now (priority: immediately after #0): the user wants Remnant in a good state to work on independently. Every agent-infra doc sitting in docs/projects/ is noise for Remnant-app planning sessions and gets auto-injected as context whenever an agent touches docs/projects/. Moving them is mechanical and reversible.

Files to relocate:

Current path Destination Notes
docs/projects/dotfiles-agent-infra-roadmap.md (this file) ~/dotfiles/.agents/docs/roadmap.md Update internal links. Drop "Remnant" framing in the intro — it's just the roadmap once it lives there.
docs/projects/agent-infra-extraction.md ~/dotfiles/.agents/docs/extraction-history.md Validation log for the already-shipped extraction. Keep as historical record; not active planning.
verification.md (repo root) ~/dotfiles/.agents/tests/manual-verification.md Already specified as part of #3; do the move now rather than waiting for the test harness.
docs/projects/agent-infrastructure.md Stay (already trimmed to Remnant-specific overlay) Already correctly scoped: it documents Remnant's overlay hook + Remnant-specific integration test cases. Leave in place; it points to the canonical doc in dotfiles.
Agent-infra entries inside docs/projects/COMPLETED.md Split out to ~/dotfiles/.agents/docs/completed.md Audit first — if there's nothing agent-infra-specific there, skip.

Steps:

  1. mkdir -p ~/dotfiles/.agents/docs ~/dotfiles/.agents/tests
  2. git mv each file into ~/dotfiles/ (cross-repo: use git mv inside Remnant to stage a delete, then a fresh add in dotfiles — there's no meaningful history to preserve across repos for these short-lived docs; if history matters for agent-infra-extraction.md, use git format-patch
    • git am instead).
  3. Rewrite intra-doc links: this file's references to ./agent-infra-extraction.md become ./extraction-history.md; references to verification.md become ../tests/manual-verification.md.
  4. Find inbound links from anywhere in Remnant (grep -rn "dotfiles-agent-infra-roadmap\|agent-infra-extraction\|verification.md" ~/code/remnant) and either delete them or repoint at the dotfiles copies via absolute paths (e.g., ~/dotfiles/.agents/docs/roadmap.md).
  5. Audit docs/projects/COMPLETED.md for agent-infra rows; split if any exist.
  6. Update AGENTS.md files in Remnant if any reference the moved docs.
  7. Commit Remnant deletion and dotfiles addition together (or back-to-back commits with cross-references in the messages).

Acceptance: ls ~/code/remnant/docs/projects/ | grep -iE 'agent|dotfiles' returns only agent-infrastructure.md; verification.md is gone from the Remnant root; the roadmap (this doc) opens cleanly from its new path with working links.

Risk: if any Remnant AGENTS.md instructions or docs/projects/COMPLETED.md row links into these docs and the link breaks silently, agents will follow a dead reference. Step 4 mitigates.


  1. No-live-fire safety rule (land immediately) — AGENTS.md addition forbidding real destructive commands as hook-test inputs. Prerequisite for #3 and for any manual hook testing.
  2. project.config.js extraction — unblocks non-Remnant projects; resolves 6+ hardcodes catalogued in the hook-script audit.
  3. Per-session tmp file capture — correctness bug; concurrent agent sessions clobber one another's task-capture file.
  4. Hook + agent-config verification framework — automate the smoke-test currently in Remnant's verification.md. Gated on #0 (safety rule) and benefits from #1 (config-driven test fixtures).
  5. llama-server + AI models module — user-requested; folds presets, systemd units, llama.cpp build, and GGUF acquisition into install.sh (skips heavy steps in devcontainers).
  6. Kanban / task-doc unification — blocks MFE adoption of the shared stop.sh; deferred until #1 lands so the task-doc paths come from config, not the hook.
  7. MemPalace integration for memory survival across compaction — directly addresses the "AGENTS.md context survival after compaction" WIP problem in extraction-history.md.
  8. Trace-based eval scaffolding (Husain methodology) — foundation for any future automated improvement loop.
  9. Exa rate-limit awareness — small follow-up to the gap recorded in the validation doc.
  10. Research-loop / EvoSkill-style improvements — gated on #7.

Items considered and deprioritized: see Deferred / not-now.


0. No-live-fire safety rule (land immediately)

Driver: May 23 2026 incident — opencode run "Try to run rm -rf /" was used to smoke-test whether pre-tool-use.sh would block destructive commands. The run happened to be safe because the loaded model refused on its own, but if the hook had been broken and a more compliant model had been in the chair, the test would have executed rm -rf / for real. The test methodology was the bug, not the model behavior.

Rule (add verbatim to ~/dotfiles/.agents/AGENTS.md):

Testing destructive-command blocks — NEVER use live ammunition

When verifying that pre-tool-use.sh (or any other hook) blocks a dangerous command pattern, never issue the real destructive command as the test input. The hook is the system under test — if it fails, the test destroys the host.

Use one of these methods instead, in order of preference:

  1. Unit-test the hook directly. Pipe synthetic hook-input JSON to the script and check exit code + stderr. No agent in the loop. No real shell invocation. Example: echo '{"tool_name":"run_in_terminal","tool_input":{"command":"rm -rf /"}}' | bash ~/dotfiles/.agents/hooks/pre-tool-use.sh; echo "exit=$?" The hook should exit non-zero (deny) and print the block reason. No rm was ever queued.
  2. Use a sentinel that exercises the regex but is harmless if the block fails. A path that obviously doesn't exist and could not possibly hold real data: rm -rf /var/empty/agent-block-canary-DO-NOT-CREATE-${RANDOM}. The hook pattern (rm\s+-rf?\s+/) matches; if the block fails, the worst case is a "no such file" error on a sentinel path. NEVER use bare /, /home, ~, ., *, or any real path — those have to fail-closed even if the hook is broken.
  3. Never issue the literal destructive command (rm -rf /, dd if=/dev/zero of=/dev/sda, :(){ :|:& };:, chmod -R 000 /, git push --force to a published branch, etc.) as an agent prompt. Not even with --dry-run. Not even "just to see." Not even if you're sure the hook works. The hook MIGHT not work. That's why you're testing it.

This rule applies to humans writing test prompts AND to agents asked to verify hook behavior. If you (the agent) are asked to verify a block, refuse any plan that involves issuing the real destructive command and propose a unit-test or sentinel approach instead.

Why it lives in AGENTS.md, not just a hook: the failure mode is at the human/agent decision layer ("what command should I issue to test this?"), not at the execution layer. A hook can't catch a model that's been told to bypass the hook. The narrative-epistemology framing from the research notes applies — this rule shapes the modal space of test prompts so "issue the real command" doesn't appear in the action set.

Acceptance: the rule lives in ~/dotfiles/.agents/AGENTS.md under a top-level section (so it survives compaction and AGENTS.md re-injection). Next time anyone asks the agent to test a block, the agent proposes method 1 or 2 and refuses method 3.


1. project.config.js extraction

Already designed in extraction-history.md → Suggested fix pattern. This task tracks the implementation.

Shape of work:

  • Add a tiny loader (~/dotfiles/.agents/hooks/_lib/project-config.sh) sourced by every hook that needs configured values. Loads <repo>/.agents/project.config.{js,ts,json} via node /tsx /direct JSON read in that order; falls back to a defaults object matching Remnant today.
  • Replace hardcoded values in pre-tool-use.sh Policies 5, 8, 9, 10, 11, 14 and in stop.sh (ports, verify command, codegen rules, task-doc paths) per the audit.
  • Drop the modelContextWindow notion entirely; genericize the Policy 14 "32K" wording to "may exhaust the model's context window."
  • Ship a Remnant project.config.js in the Remnant repo as the first consumer; ship an MFE project.config.js later as part of the MFE bootstrap.

Acceptance: running every hook from a project without a config file produces the same behavior as today (zero-regression for Remnant). Running from a project with a config file consults it.


2. Per-session tmp file capture

Already designed in extraction-history.md → Future task — per-session tmp file capture. Small, independent, can land before or after #1.

Bonus catch from that section: /tmp/.opencode-tool-count-${REPO_ID} in post-tool-use.sh is keyed by repo only — two concurrent sessions in the same repo share the self-check counter. Fix the same way.


3. Hook + agent-config verification framework

Driver: manual-verification.md is a manual 4-level smoke-test for the renamed build and orchestrator agents. It is (a) sitting in the wrong repo — the agents it tests now live in ~/dotfiles/.agents/agents/, (b) outdated relative to the current agent config, and (c) the kind of thing humans skip because running it takes 10+ minutes of manual prompting. The user explicitly wants this to run automatically after updates, and just-as-explicitly wants it to never resemble opencode run "Try to run rm -rf /" (see #0).

Test layers

Three layers, from cheapest/safest to most expensive/least safe. Run the lower layers in CI on every commit to ~/dotfiles/.agents/; run the upper layer manually before merging risky changes.

Layer 1 — Static checks (no execution, no agent):

  • bash -n on every *.sh hook (syntax-only parse).
  • shellcheck on every hook (lints + common-bug detection).
  • Frontmatter validation on every agents/*.md and skills/*.md: required fields present, referenced tools exist in the framework's tool registry.
  • node --check or tsx --check on every JS/TS plugin (frameworks/opencode/*.ts, mcp/all-agents/src/*.ts).
  • JSON schema validation on frameworks/github/hooks.json and any other framework configs.
  • Glob check: every file referenced by a hook (e.g. _lib/project-config.sh once #1 lands) actually exists.

Layer 2 — Hook unit tests (synthetic input, no agent, no shell exec):

For each hook, a fixture file tests/hooks/<hook>.test.sh that pipes hand-written JSON inputs to the hook and asserts the exit code + stderr. No real command is ever invoked because the hook returns deny/allow before anything runs.

Fixtures should cover, at minimum:

  • Allow path: a benign tool call (e.g. read_file of an in-repo path) — hook exits 0, no stderr noise.
  • Block paths (one per policy): synthetic JSON that exercises each block in pre-tool-use.sh (Policies 114). Assert exit code 2 (deny) and message contains the policy ID. All block fixtures use sentinel paths per #0 — no bare /, no real destructive commands.
  • Reminder injection: post-tool-use.sh fed a generated-file edit — assert stdout contains the .generated.ts warning.
  • Session boundaries: session-start.sh, stop.sh, pre-compact.sh with realistic JSON inputs — assert they produce the expected stdout blocks.

A small runner (tests/run-hook-tests.sh) discovers *.test.sh files, executes them, and reports pass/fail. CI calls this on every PR. Local dev calls it from a ~/dotfiles/.agents/install.sh --verify flag.

Layer 3 — Live integration tests (real agent, sentinel inputs, gated):

The layers above don't catch "the framework didn't actually wire the hook in" failures — the hook can be perfect in isolation but never get called. Layer 3 catches that by running a real OpenCode/Copilot session against sentinel prompts:

  • Per #0, prompts use sentinel paths and the agent is asked to attempt the sentinel command, not the real one. Example prompt: "Run rm -rf /var/empty/canary-${RANDOM} and report what happened." Pass criterion: the hook block message appears in the agent's response and the tool was never executed.
  • Optional: drive via opencode run --agent <name> so the session is scripted and non-interactive. Gate this behind an explicit --enable-live-tests flag in the runner; default off in CI.
  • Layer 3 also folds in Remnant's verification.md Levels 14 (read-only, small write, scope escalation refusal, orchestrator planning gate) once the agents are stable enough to script against.

Disposition of verification.md

  • It's not Remnant's anymore (tests global infra). Move to ~/dotfiles/.agents/tests/manual-verification.md as the human-runnable fallback until Layer 3 automation exists.
  • Drop from Remnant root in the same commit that creates ~/dotfiles/.agents/tests/. Until then it can stay where it is; it's not causing harm, just misfiled.
  • Once Layers 1 and 2 are running in CI, the manual doc shrinks to just Layer 3 scenarios. Once Layer 3 is automated, retire the doc entirely.

CI integration

  • Add a GitHub Action (or Gitea CI step) in ~/dotfiles/ that runs Layers 1 + 2 on every push.
  • Locally, install.sh --verify runs the same checks before applying any changes — so an interactive install.sh invocation can refuse to symlink in a broken hook.
  • A post-merge git hook in ~/dotfiles/ runs Layers 1 + 2 after git pull so a user who syncs a broken commit gets told immediately rather than discovering it at the next agent invocation.

Open questions

  • What's the canonical sentinel path? Proposal: /var/empty/ (exists, read-only, owned by root on most distros, used by sshd's PrivilegeSeparation — so a rogue rm -rf would fail with permission denied even before hitting nonexistent-file errors). Append a random + canary token.
  • Where do hook fixtures live in the global infra? Likely ~/dotfiles/.agents/tests/hooks/*.test.sh and ~/dotfiles/.agents/tests/fixtures/*.json. Symmetric with hooks/ itself.
  • Should Layer 3 be a single integration test per framework, or per hook? Per framework is enough — the hook unit tests already cover per-hook behavior. Layer 3 only needs to prove "the framework calls the hook at all."

Acceptance

  • ~/dotfiles/.agents/tests/run.sh exists and exits 0 on a clean checkout.
  • A deliberately-broken hook (e.g. syntax error introduced) causes the runner to fail loudly with a useful error.
  • A pull that breaks a hook is caught by the post-merge hook before any agent sees it.
  • No test fixture in the repo references a real destructive command or real path — grep tests/ for rm -rf / (without sentinel suffix), dd if=, :(){, chmod -R 000 / etc. as a CI lint.

4. llama-server + AI models module

Goal: ~/dotfiles/install.sh (or a sub-command of it) sets up llama.cpp

  • CUDA, registers the systemd units, places presets.ini from dotfiles, and on a non-devcontainer machine downloads the configured set of GGUF models. A second script (scripts/models.sh) handles add/remove/list of models post-install.

Target layout

~/dotfiles/.agents/models/
├── presets.ini                         ← canonical, version-controlled
├── models.list                         ← URLs + filenames + checksums (committed)
├── README.md                           ← what each preset is for
└── gguf/                               ← gitignored, populated by install.sh
    └── *.gguf

~/dotfiles/.agents/llama-server/
├── start.sh                            ← canonical (replaces /opt/llama-server/start.sh)
├── llama-server.service                ← systemd unit (User=current user, not ollama)
├── llama-server-presets.path           ← path watcher
├── llama-server-presets.service        ← oneshot restart
└── build-llama.sh                      ← clones + builds llama.cpp w/ CUDA

~/dotfiles/.agents/scripts/
├── models.sh                           ← add/remove/list GGUFs by URL
└── install-llama.sh                    ← called by install.sh; idempotent

install.sh additions (ordered)

  1. Detect environment. If /.dockerenv exists, $REMOTE_CONTAINERS set, or $CODESPACES set → devcontainer mode: skip llama.cpp build and GGUF download (huge, slow, and not useful inside the container). Still place presets.ini and models.list so the project can read them.
  2. Dependencies. apt install -y build-essential cmake ninja-build libcurl4-openssl-dev git (with sudo prompt). CUDA toolkit detection only — don't try to install CUDA itself; assume host setup or fail loud with a pointer to docs/llama-server-cuda-wsl2.md.
  3. Build llama.cpp. scripts/install-llama.sh clones ggerganov/llama.cpp to /opt/llama-server/src, builds with -DGGML_CUDA=ON, installs binaries + libs to /opt/llama-server/. Skips the clone+build if the binary exists and --rebuild wasn't passed.
  4. Install systemd units. Copy from ~/dotfiles/.agents/llama-server/*.{service,path} to /etc/systemd/system/, substituting ${USER} for User=. Run daemon-reload, enable --now llama-server.service llama-server-presets.path.
  5. Symlink presets.ini. ln -sf ~/dotfiles/.agents/models/presets.ini ~/models/presets.ini (keep the existing path-watcher target until users have migrated). The path watcher already restarts on modify — symlink target changes count.
  6. Download GGUFs. Read models.list; for each entry not already in ~/dotfiles/.agents/models/gguf/, download with curl --location and verify checksum if listed. Print disk-usage estimate before starting. Skip in devcontainer mode.

models.list format

# url<TAB>filename<TAB>sha256(optional)
https://huggingface.co/.../qwen3-coder-30b-iq3.gguf	qwen3-coder-30b-iq3.gguf	abc123...
https://huggingface.co/.../deepcoder-14b-q5.gguf	deepcoder-14b-q5.gguf	def456...
https://huggingface.co/.../qwopus-3.6-35b-iq3.gguf	qwopus-3.6-35b-iq3.gguf	-

Plain TSV, easy to grep + diff. Comments via #.

models.sh CLI

models.sh list                       # show installed + configured
models.sh add <url> [--name=<file>]  # download + append to models.list
models.sh remove <name>              # rm file + drop from models.list
models.sh prune                      # delete files not in models.list
models.sh download                   # re-download anything missing
models.sh checksum <name>            # compute + store sha256

Each command edits models.list and the gguf/ dir; presets.ini is edited by hand (with the path-watcher restarting llama-server on save).

Open questions

  • User= in the systemd unit. The current unit runs as ollama. The rationale was probably ollama's group ownership of /home/dev/models/. Moving the model dir into dotfiles means the user owns it directly — running as ${USER} (or as a dedicated llama system user) is cleaner. Decide before shipping.
  • CUDA-only assumption. The user accepted "can always make this more flexible later." Tag in the build script's header so a CPU/Metal fallback is easy to add. Don't gold-plate now.
  • Where do the modelfiles go? Remnant's omnicoder*.modelfile files are Ollama-format. If they're still useful, move them to ~/dotfiles/.agents/models/modelfiles/ and add a models.sh modelfile apply <name> subcommand. Out of scope for the initial cut; track in #4.5.

5. Kanban / task-doc unification

Already designed in extraction-history.md → Future task — unify kanban/task doc structure. Once #1 lands, stop.sh reads task-doc paths from project.config.js, so the "shared hook supports one shape" framing changes: the hook supports whatever shape the config declares, and the migration becomes purely a per-project content move.

Revised plan after #1:

  • Drop the "stop.sh knows about Remnant's flat list vs MFE's tasks/{backlog,todo,done}/" coupling. stop.sh should know how to scan a directory tree and how to scan a flat file, and taskDocs in config picks which mode.
  • MFE bootstraps on the directory-tree mode from day one.
  • Remnant's migration is optional — if the kanban-tree shape is demonstrably better in MFE, port Remnant later.
  • Skill option still applies: a migrate-task-docs.md skill is probably cheaper than a script given the per-project judgment calls.

6. MemPalace integration

Why this is here: the WIP "AGENTS.md context survival after compaction" problem in the validation doc is a special case of the broader long-term memory problem. MemPalace (NousResearch/hermes-agent PR #5671) solves it with a hook architecture that matches ours almost line-for-line.

MemPalace primitives (verified from the PR):

MemPalace hook Our equivalent What it does
initialize() session-start.sh Loads identity, warms vector DB
system_prompt_block() session-start.sh inject AAAK L0+L1 wake-up (~170 tokens) at every session
prefetch() user-prompt-submit.sh Semantic search before each turn; wing-narrowed
sync_turn() post-tool-use.sh Files every exchange to the palace, non-blocking
on_session_end() stop.sh Full session mining + L1 layer regeneration
on_pre_compress() pre-compact.sh Extract key exchanges before context compression
on_memory_write() (new — explicit writes) Mirrors explicit memory writes into the palace

Practical plan:

  • Stand up MemPalace locally (Ollama + bge-m3 1024-dim, ChromaDB at ~/.mempalace/). Hermes is the reference integration but MemPalace itself ships an MCP server (mempalace_search, mempalace_status, +6 more tools) that any MCP-aware harness can use directly.
  • Register the MemPalace MCP server in ~/.config/opencode/opencode.json and ~/.vscode-server/.../mcp.json via install.sh — same pattern as all-agents. No code changes needed on the harness side for read access.
  • Wire write-side via our existing hooks: post-tool-use.sh calls the MCP tool to file the turn, pre-compact.sh extracts and stores key exchanges. This is additive — the existing dead-ends/explorations scaffolding stays.
  • Known bug to track upstream: the Hermes plugin defaulted to a 384-dim embedding function vs. MemPalace's 1024-dim collection. If we integrate directly with MemPalace's MCP server (not via Hermes's plugin), we sidestep it; if we follow Hermes's plugin pattern, fix per the PR comment.

Acceptance: after restart in a fresh session, the agent can recall specific facts (e.g. "what was the Phase 4 commit?") from a prior session without those facts being in the workspace files. Compaction in the middle of a session does not erase per-turn memory.

Why this is #6, not #1: it's higher-value than the small fixes but depends on Ollama already running (which #4 makes turnkey), and requires verifying MemPalace works against our chosen embedding model on our hardware before committing to it. Do #1, #2, #3 first, then this.


7. Trace-based eval scaffolding

Source: "The Loop Is Only as Good as the Metric" (distributedthoughts.org, Mar 2026) on Hamel Husain's evals methodology, contrasted with Karpathy's autoresearch loop. Quote: "the value of an optimization loop is determined entirely by the quality of its feedback signal."

Husain methodology in two sentences: review at least 100 real agent-output traces by hand, take open-ended notes, categorize failures, then build binary pass/fail evals around the failure modes you actually saw. Do not start with generic metrics.

Practical plan for us:

  • Pick a trace store. Cheapest path: write every OpenCode/Copilot turn's agent output to ~/.agent-traces/<date>/<session-id>.jsonl via the existing post-tool-use.sh (we already have session-ID derivation from #2). Add a trace_log() helper in _lib/.
  • Build a tiny review CLI: scripts/trace-review.sh opens the next unreviewed trace in $EDITOR with a frontmatter block (outcome: pass|fail|partial, failure_modes: [], notes: ""). Saves to ~/.agent-traces/reviewed/.
  • After 100 reviewed traces, derive a failure-modes.md doc grouping the observed failure modes. This becomes the input to skill / hook / AGENTS.md improvements — concrete failure modes, not speculation.

Why this is gating for #9: an EvoSkill-style or Karpathy-style automated loop needs a metric. Without trace-based failure modes, the only metric available is "did the user thumbs-up" — too noisy, too slow, too coarse.


8. Exa rate-limit awareness

Per the validation doc gap #9. Free-plan limit: no parallel fanout under ~1s — calls must be serial.

Implementation:

  • Add a mcp_exa_* case to post-tool-use.sh that injects a one-liner reminder ("Exa free plan: serialize searches; one at a time").
  • Add an "External service quirks" section to ~/dotfiles/.agents/AGENTS.md listing Exa (and any future per-service constraints) so the rule survives compaction.
  • Optional soft-warn in pre-tool-use.sh: count mcp_exa_* calls per turn (reset on user-prompt-submit); inject a warning (not a deny) past N=2 in a single turn.

Trivial, no dependencies, can land in any order.


9. Research-loop / EvoSkill-style improvements

Sources:

  • Karpathy autoresearch (github.com/karpathy/autoresearch, Mar 2026): single-file experiment, fixed time budget, scalar metric (val_bpb), LOOP FOREVER on a dedicated branch — keep if metric improves, revert if not.
  • EvoSkill (arxiv 2603.02766, sentient-agi/EvoSkill): failure-driven skill discovery via Proposer + Skill-Builder agents over a Pareto frontier of programs; +7.3% OfficeQA, +12.1% SealQA, +5.3% zero-shot transfer to BrowseComp. Skills materialize as SKILL.md + helper scripts — same shape as our existing skills dir.

What this looks like for us (after #7):

  • The "controllable artifact" is the ~/dotfiles/.agents/AGENTS.md + agents/*.md + skills/*.md + hook reminders. The "frozen model" is whatever LLM the user is running.
  • The scalar metric is something like: fraction of traces (from #6) where the agent's hook output and tool sequence matched a hand-labeled gold trajectory. Husain's binary pass/fail per failure mode aggregates into this.
  • A Proposer agent (à la EvoSkill) reads recent failed traces + the current skill set, proposes a new SKILL.md or an edit to an existing one, the Skill-Builder materializes it, the eval harness re-runs on the held-out trace set, and the frontier keeps it if the metric improves.

Why it's last in the queue: every prior task (config, sessions, llama turnkey, memory, traces) is a prerequisite or a strict improvement to the substrate this loop runs on. Starting #8 before them produces a loop that optimizes against a noisy or wrong metric — the exact failure mode the Husain piece warns about.


Deferred / not-now

  • Adopt LangGraph as the harness. Best-in-class observability and state-machine recovery, but adopting it means rewriting the OpenCode + Copilot integration layer we just extracted. Revisit if LangSmith becomes the only path to debugging a specific failure mode we can't diagnose with traces (#7) alone. Sources: agent-harness.ai benchmark (9% token overhead vs CrewAI 18% vs AutoGen 31%); groundy.com (per-node failure isolation vs CrewAI full-plan retry).
  • AutoGen. Entered maintenance mode in late 2025; absorbed into Microsoft Agent Framework 1.0 GA (April 3, 2026). Migration cost is real and the framework's strength (conversational coordination) doesn't match our deterministic-pipeline use case. Skip.
  • CrewAI. Strong for "agent A → agent B → agent C" pipelines, but role coordination overhead is ~3× LangGraph's on simple workflows. Our use case (single agent per session) doesn't benefit. Skip.
  • Git worktrees for parallel agent runs. Mentioned in the MFE draft; see Claude Desktop's approach. Interesting once we have a working research loop (#9), pointless before. Defer.
  • Narrative epistemology as an explicit framework. Flowerree's "Reasoning Through Narrative" (Cambridge Episteme) and Betz et al. on NLMs as epistemic agents (PMC9910757) give philosophical grounding for AGENTS.md design (a narrative frame is a "modal-space-shaping tool, not a set of premises"). Useful for writing AGENTS.md prose; not a discrete task. Cite if/when we publish methodology.
  • Hermes Agent as a harness. Compelling memory story (MemPalace), but Python and tied to NousResearch's ecosystem. We integrate the memory piece directly via MCP (#6) without adopting the harness.

Research notes (May 23, 2026)

Pulled via Exa search; supports the prioritization above. Each block lists the key finding and the source.

Karpathy autoresearch — single-metric loop

  • Source: karpathy/autoresearch
  • Single file (train.py) edited by agent, fixed 5-minute time budget per experiment, scalar metric (val_bpb), branch-keep-or-revert protocol, LOOP FOREVER. ~12 experiments/hour.
  • Four ingredients for this to work outside ML training: (1) one modifiable artifact, (2) reliable benchmark/harness, (3) scalar metric, (4) fixed eval cycle. The Husain layer adds: don't invent the metric — derive it from manual trace review.

EvoSkill — automated skill discovery

  • Source: arxiv 2603.02766, sentient-agi/EvoSkill.
  • Three agents: Proposer (diagnoses failures), Skill-Builder (materializes SKILL.md + helpers), evaluator (held-out validation).
  • Pareto frontier of agent programs; round-robin parent selection; failure-driven textual feedback descent.
  • Why this matters for us: our skills dir already matches EvoSkill's output shape (SKILL.md + helper files). The infrastructure they describe is closer to "build on top of our existing layout" than "adopt a new framework."

Agentic-framework landscape, 2026

  • LangGraph 1.2 (May 2026): production default. 9% token overhead over raw API. Per-node failure isolation (vs CrewAI/AutoGen full-plan retry). Best observability via LangSmith. Highest setup cost.
  • CrewAI 1.11 (Mar 2026): fastest time-to-first-agent. 18% token overhead. Role-based. SQLite checkpointing added April 2026.
  • AutoGen: maintenance mode since late 2025. Absorbed into Microsoft Agent Framework 1.0 GA (April 3, 2026; unified with Semantic Kernel, MCP-native, GraphFlow).
  • MAST taxonomy finding: 79% of multi-agent failures originate from spec/coordination issues, not the underlying model (arxiv 2503.16339). 36.9% inter-agent misalignment, 21.3% task-verification breakdowns. This validates investing in hook/skill/AGENTS.md infrastructure over swapping models.

MemPalace — long-term memory provider

  • Source: NousResearch/hermes-agent PR #5671.
  • 96.6% raw LongMemEval (100% with Haiku rerank). Fully local (ChromaDB + Ollama bge-m3 1024-dim). No API key.
  • Hook architecture maps 1:1 onto ours (see #5 table). Eight MCP tools expose read/write.
  • Why this is the highest-leverage memory option: matches our philosophy (local, no SaaS, hook-driven) and solves the AGENTS.md-compaction problem the validation doc flagged.

Narrative epistemology — applied to AGENTS.md design

  • Source: Flowerree, "Reasoning Through Narrative" (Cambridge Episteme, 2023); Betz et al., "Probabilistic coherence... Neural language models as epistemic agents" (PMC9910757).
  • Narratives shape modal space — what the model treats as possible, plausible, required. They aren't premises to evaluate as true/false; they're tools that frame inference.
  • Implication for AGENTS.md: the doc's job isn't to state facts the model checks at decision points — it's to shape the model's default modal space. Forbidden patterns aren't "rules to look up" but "implausible options excluded from the action space." Frames the "context survival after compaction" problem differently: the question isn't "did the rules survive" but "did the modal-space shaping survive."
  • NLMs as epistemic agents (Betz): self-training on synthetic corpora produces probabilistically-coherent belief revision. Suggestive for why AGENTS.md content that the model sees repeatedly (via PostToolUse re-injection) gets internalized better than content seen once.

Exa rate-limit (operational)