dotfiles/.agents/docs/agent-infrastructure.md
Brydon DeWitt 6b07e4ccb2 feat: add shared agent infrastructure (.agents/)
- AGENTS.md: design principles, enforcement hierarchy, deferred loading
- agents/: brainstorm, build, orchestrator, research (auto-discovered by MCP server)
- skills/: research methodology (auto-discovered by MCP server)
- hooks/: pre-tool-use, post-tool-use (BFF block removed), session-start,
  stop, pre-compact, user-prompt-submit
- frameworks/: opencode/plugin.ts (resolves hooks via import.meta.url — works
  as project-local or global plugin), github/hooks.json
- mcp/index.ts: auto-discovers agents/*.md and skills/*.md from frontmatter
  (replaces hand-maintained registry); server renamed all-agents
- docs/: agent-infrastructure.md (generalized), research docs (7 files),
  ai_architectures.md, llama-server-cuda-wsl2.md
- install.sh: idempotent setup — Copilot global hooks, OpenCode global plugin +
  AGENTS.md + MCP entry, VS Code global MCP config
2026-05-22 13:13:43 -04:00

49 KiB
Raw Permalink Blame History

Agent Infrastructure

Shared agent infrastructure for VS Code Copilot and OpenCode — brainstorm agent, research agent, nudge instructions, hooks, skills, and MCP server. Project-specific overlays live in each project's .agents/ directory.

See also: docs/research/ai-coding-best-practices.md — research synthesis covering the Prompt/Context/Harness taxonomy, failure modes, enforcement hierarchy, small-model harness patterns, and all primary-source citations that underpin the design decisions here.

Current State

Architecture Overview

The infrastructure is tool-agnostic: canonical sources live in .agents/ and a generator (npm run generate:agents) distributes them to .github/agents/, .github/skills/, .opencode/agents/, .opencode/skills/. Edit the .agents/ sources; never edit the generated output directories (they are .gitignored and blocked by pre-tool-use policy).

.agents/
├── AGENTS.md                        # Root design doc + enforcement hierarchy
├── agents/                          # Agent definitions (canonical)
│   ├── brainstorm.md
│   ├── research.md
│   └── build-local.md               # OmniCoder 9B via Ollama
├── hooks/                           # Shared bash hooks (delegated by all harnesses)
│   ├── pre-tool-use.sh              # Hard blocks (terminal cmds + file-path policies)
│   ├── post-tool-use.sh             # Self-check counter + methodology reminders
│   ├── session-start.sh             # Inject project state at session start
│   ├── user-prompt-submit.sh        # Per-turn nudge detection + task capture
│   ├── pre-compact.sh               # Export state before context summarization
│   └── stop.sh                      # Session-end verification
└── skills/
    └── research/SKILL.md            # Research methodology (any agent can load)

Generated output (do not edit — regenerated by npm run generate:agents):

  • .github/agents/ — VS Code Copilot agent files
  • .github/skills/ — VS Code Copilot skill files
  • .opencode/agents/ — OpenCode agent files
  • .opencode/skills/ — OpenCode skill files

Harness integration:

  • VS Code Copilot: .github/agent-support.json — maps 4 hook events to the shared bash scripts in .agents/hooks/
  • OpenCode: .opencode/plugins/agent-support.ts — TypeScript plugin that shells out to the same bash scripts

Brainstorm Agent

  • 4-phase workflow: Quick Frame → Diverge → Converge → Capture & Hand Off
  • 6 techniques: Rapid Ideation, SCAMPER, Worst Possible Idea, How Might We, Inversion/Pre-mortem, Constraint Flipping
  • Counterbalances Opus 4.6 overthinking tendency
  • Phase 2 includes "push past the obvious" nudge (Zhao et al. 2024: LLMs fall short on originality, excel at elaboration — first ideas are "average")
  • Phase 4 routes to @research for investigation, default agent for implementation
  • Creates exploration files at docs/explorations/<name>.md and session memory notes

Research Agent

  • Two orientations that compose recursively:
    • Understand (Grounded Theory): open coding → constant comparison → axial coding → memo → saturation check
    • Diagnose (Strong Inference + Satisficing): 5-factor triage gates between satisficing (low risk) and full falsification (high risk)
  • 5-factor triage: reversibility, blast radius, confidence, novelty, time cost
  • Timing awareness: time prefix on unknown commands, session/repo memory for baselines, timing feeds into triage decisions
  • Investigation files at docs/explorations/<name>.md
  • Techniques reference: Five Whys, Delta Debugging, Rubber Duck
  • Delegates evidence-gathering to Explore subagent, keeps analytical thinking local

Nudge Instructions

  • Brainstorm nudge: triggers on hesitation/overthinking language ('wait', 'actually', 'hmm', 'overcomplicating', etc.)
  • Research nudge: triggers on debugging/investigation language ('why is this broken', 'how does this work', 'root cause', etc.)
  • Both are non-intrusive single-sentence suggestions, only fire once per topic

Tool Mapping (Copilot ↔ OpenCode)

Copilot OpenCode equivalent
AGENTS.md (root + nested) AGENTS.md (root, native; nested via instructions glob in opencode.json)
.github/agents/*.agent.md .opencode/agents/*.md (frontmatter: description, mode, model, temperature, permission)
.github/skills/<name>/SKILL.md .opencode/skills/<n>/SKILL.md — also reads .agents/skills/ and .claude/skills/
.github/instructions/*.instructions.md (applyTo) No direct equivalent — fold into AGENTS.md stubs or instructions glob
.github/hooks/*.sh (JSON-configured shell) .opencode/plugins/*.ts (TS modules, event-driven) — shells out via Bun's $
runSubagent / Explore agent Built-in general and explore subagents; @-mention syntax
vscode_askQuestions No equivalent — OpenCode uses agent's natural turn-taking

OpenCode plugin event mapping:

Copilot hook OpenCode event
SessionStart session.created
PreToolUse tool.execute.before
PostToolUse tool.execute.after
PreCompact experimental.session.compacting
Stop session.idle (closest equivalent)

Research Foundation

For full research depth, citations, and failure-mode analysis, see docs/research/ai-coding-best-practices.md. The list below records the specific papers and frameworks that shaped the design decisions in this project.

Methodologies and papers that informed the design:

  • Grounded Theory (Glaser & Strauss): build understanding from data, not assumptions. Applied to code-reading in the Understand orientation.
  • Strong Inference (Platt 1964): multiple competing hypotheses → crucial experiments → eliminate. Applied to the Diagnose orientation.
  • Satisficing (Simon 1956): accept "good enough" when optimization cost exceeds benefit. Gates between cheap confirmation and expensive falsification.
  • Dual Process Theory (Kahneman): System 1 (fast, pattern-matching) vs System 2 (slow, analytical). System 1 more accurate in familiar domains. Informs the triage decision.
  • Zhao et al. 2024 (arxiv): LLMs fall short on originality, excel at elaboration. First ideas are "average." Informs brainstorm agent's "push past the obvious" nudge.
  • "Lost in the Middle" (Liu et al. 2023): LLMs attend best to beginning/end of context. Informs hook design — inject at context tail for high attention.
  • Delta Debugging: binary search the change space between passing/failing cases. Logic behind git bisect.
  • Five Whys: iterative causal chain tracing. Starting point for hypothesis generation, not sole diagnostic method.
  • Ronacher "Agent Design Is Still Hard": reinforce methodology after every tool call at context tail. Structural injection outperforms relying on instructions in the system prompt.
  • Think-Anywhere (Jiang et al. arXiv:2603.29957, Mar 2026, Peking U + Tongyi Lab): LLMs trained to invoke <think> blocks at any token position during code generation, not just upfront. SOTA on LeetCode/LiveCodeBench with fewer total tokens. The motivating insight: a model can plan correctly at the start but introduce an off-by-one bug mid-implementation — only mid-loop reasoning catches it. Applied here: the research agent's investigation checklist includes "Re-evaluate hypothesis at every tool-call boundary." For Claude 4 models, interleaved thinking makes this automatic. Complements Plan-and-Solve: upfront decomposition where structure is clear, mid-execution re-evaluation when intermediate results change what to do next.
  • Anthropic interleaved thinking (Claude 4 + adaptive thinking): Claude Sonnet 4.6+ and Opus 4.6+ automatically insert thinking blocks between tool calls. No separate implementation needed — agent instruction design drives it. The research agent's "Re-evaluate at every tool-call boundary" instruction explicitly activates this behavior.
  • Prompt/Context/Harness framework (Alibaba Cloud, Apr 2026): Names the three engineering layers. Prompt = task expression (stateless). Context = what the model sees (AGENTS.md, skills, tools — engineering target is progressive disclosure). Harness = system constraints + verification loops (hooks, permission gates, sub-agent isolation). Diagnostic map: wrong output → Prompt; hallucinated fact → Context; wrong tool selected → Context (fix description); task drift → Harness (sub-agent boundary); destructive action → Harness (permission hook). LangChain improved Terminal Bench 2.0 from 52.8% → 66.5% by changing Harness alone.
  • Context engineering (Rajasekaran et al., Anthropic, Sep 2025): Formally distinguishes context engineering from prompt engineering. Key principles: (a) just-in-time context — agents hold references and load on demand, not upfront; (b) structured note-taking (NOTES.md) as external working memory for long sequential tasks; (c) every new token depletes attention budget — validates the <60-line AGENTS.md ceiling; (d) compaction strategy: maximize recall first, then improve precision.

MCP Server Lifecycle Hooks — Protocol Status (May 2026)

The .agents/mcp/ server exposes prompts and tools to agents via the MCP protocol. A recurring question: can the MCP server react to session lifecycle events (session start/end, tool-use boundaries)?

Current protocol state

No lifecycle hooks exist in the MCP protocol. The spec defines three phases only: initialize → operation → shutdown. There is no session.created, post-tool-call, or session.ended notification. This gap is why session awareness currently lives in the OpenCode plugin layer (.opencode/plugins/agent-support.ts) rather than the MCP server — OpenCode exposes session.created, session.idle, session.compacted, session.deleted, and tool.execute.before/after events natively to plugins.

Active work in the MCP spec

SEP-2624: Interceptors for the Model Context Protocol (PR #2624)

The most organized effort. Supersedes SEP-1763 (closed as completed). Proposes Interceptors as a new MCP primitive — two types: validators (inspect, return pass/fail) and mutators (transform context payloads) — discoverable and invocable via interceptors/list and interceptor/invoke JSON-RPC methods. These fire at protocol-level operation events: tools/call, prompts/get, resources/read, sampling/createMessage, elicitation/create. Not session-start/stop hooks, but before/after wrapping for every operation.

There is now a formal Interceptors Working Group (Bloomberg + Saxo Bank engineers, biweekly cadence). Reference implementations in progress for Go and C# SDKs. Experimental repo: modelcontextprotocol/experimental-ext-interceptors. Charter: modelcontextprotocol.io/community/interceptors/charter.

SEP-2282: Server-Declared Behavioural Hooks (PR #2282)

Smaller, separate open PR. Proposes servers declare context injections in ServerCapabilities — text injected into the agent's context at client-side lifecycle events (session start, post-tool-use, session end). The contract is "here's context the model should have at this moment," not code execution. More directly analogous to our OpenCode session.created / session.idle patterns. Currently unsponsored — needs a maintainer to pick it up.

What to watch

  • Primary: PR #2624 + experimental-ext-interceptors repo
  • Secondary: PR #2282 (closest to session-lifecycle hooks)
  • Label filter: SEP label on the modelcontextprotocol repo
  • Milestone: 2026-06-30-RC is the next spec revision window

Implication for this project

Until interceptors land in a shipping spec version and the TypeScript SDK, the session lifecycle pattern stays at the OpenCode plugin layer. When SEP-2282 or an equivalent lands, the MCP server could self-register context injection hooks during initialize, removing the need for tool-specific plugin code.


Model Scale Profiles

Different model sizes require different infrastructure strategies. The failure modes are different, so the mitigations are different.

Large-scale API models (Claude Sonnet / Opus)

Primary failure modes: overthinking, sycophancy, verbosity, tendency to add unrequested features or comments.

Infrastructure strategy:

  • Advisory methodology + structural reinforcement (hooks, circuit breakers)
  • PostToolUse self-check nudges every ~15 calls
  • PreToolUse hard blocks for high-risk operations
  • Subagent delegation for isolated tasks (parent Opus → child Sonnet/Haiku)

Smaller-scale local models (OmniCoder 9B via Ollama)

Primary failure modes (different from "low reasoning" — OmniCoder uses Qwen3 thinking blocks natively):

  • Narrower training distribution (Python/JS heavy)
  • Quantization degradation: JSON schema compliance drops as context fills
  • Tool-call history is the primary context consumer — responses must be truncated aggressively
  • Instruction drift: fewer attention heads (32 vs 64 in 32B) means system prompt recall degrades faster

Infrastructure strategy:

  • PostToolUse response truncation at ~1500 tokens (plugin layer, not bash hook)
  • PreToolUse JSON validation with schema-specific error messages
  • Context pressure injection at ≥70% fill (~22K/32K tokens)
  • steps: 20 cap + ask permission gates for natural checkpoints
  • explore subagent delegation to reduce context pressure on the main agent
  • NOTES.md working memory pattern enforced in agent body
  • No web tool — keeps context lean
  • Reasoning guidance: "Hold references; load on demand" explicit in agent body

OmniCoder 2 Orchestration — Pending Work

Full historical rationale and audit findings were maintained in docs/projects/local-ai-orchestration.md (deleted May 2026 after merge). The plan used an orchestrator-workers pattern with structural edit: deny enforcement on the orchestrator. All OpenCode config values verified against opencode.ai/docs (May 2026).

Goals

  1. All agents run on ollama/arch-omni2-9b — no cloud fallback
  2. User can type vague prompts; the system decomposes and delegates automatically
  3. Context windows are isolated per subagent (no shared state bleed)
  4. Changes scale forward: switching to cloud means changing model strings, not architecture

Pending Changes

Quick wins — under 5 minutes each, no testing required

    • [CRITICAL] Fix <tool\*call> typo in omnicoder2.modelfile — markdown-escape artifact; malformed opening tag paired with correct closing tag. Highest-leverage change; everything below depends on reliable tool-call JSON.
    • Mark canonical/deprecated modelfiles# CANONICAL header on omnicoder2.modelfile; # DEPRECATED on omnicoder.modelfile; omnicoder-v2.modelfile.template deleted (was dead code — v2 now served from HuggingFace path).
    • Add compaction.reserved: 3000 to opencode.json — default 10,000 fires compaction too early given ~812K baseline context.
    • Fix pre-compact.sh prettier call — removes npx prettier which violates pre-tool-use Policy 1 (self-violating policy).
    • MCP server error handling — wrap server.connect(transport) in try/catch with stderr + process.exit(1).

Short session — 1530 minutes each, bounded scope

    • Fix stop.sh JSON escaping — replace sed-based escaping with printf '%b' | node JSON.stringify pattern used in every other hook.
    • Per-session PostToolUse counter — repo-scoped path /tmp/.opencode-tool-count-<repo-hash> (derived from REPO_ROOT via md5sum); prevents cross-repo contamination; session-start.sh resets it at session begin.
    • Shrink compaction prompt to ~120 words (in .opencode/plugins/agent-support.ts) — shorter instructions free bandwidth for the 9B to actually summarize.
    • Update .agents/agents/build-local.md for v2 — pagination 100 → 50 lines; rule 4 now says "recipient not dispatcher"; rule 7 scope-check says "tell the user, do not self-decompose".

Depends on orchestrator being proven first

    • Trim root AGENTS.md to ~60 lines — reduced from 435 lines to 45 lines; all architecture rationale, code examples, quick task table, and project context removed; cross-cutting rules and quality gate preserved (May 2026).
    • PostToolUse weighted counter — reads (read_file, grep, list) +0.25; writes/shell +1; keeps 15-call SELF-CHECK from firing mid-investigation sweep. Depends on #7 (per-session counter) first.

      **Implementation** (`.agents/hooks/post-tool-use.sh`): bash has no
      float arithmetic — scale to integers: reads +1, writes/shell +4,
      threshold 60 (equivalent to 15 effective write-units). Read-class
      tools: `read_file`, `grep_search`, `list_dir`, `file_search`,
      `semantic_search`, `explore_subagent`. Write/shell-class: all
      `*_string_in_file`, `create_file`, `run_in_terminal`. Replace the
      single `COUNT=$((COUNT + 1))` with a `case "$TOOL_NAME"` block that
      does `COUNT=$((COUNT + 1))` for reads and `COUNT=$((COUNT + 4))` for
      writes/shell. Change the self-check condition from
      `(( COUNT % 15 == 0 ))` to `(( COUNT % 60 == 0 ))`.
      
    • PostToolUse reminder priority filter — emit at most 2 reminders per tool call; priority: SELF-CHECK > DEBUGGING > path-scoped > tool-specific. Depends on #11.

      **Implementation** (`.agents/hooks/post-tool-use.sh`): replace the
      current single `context` string accumulator with an indexed array
      `reminders=()`. Each block appends `reminders+=("$msg")` in priority
      order (SELF-CHECK first, DEBUGGING second, BFF/QUALITY GATE third,
      RENAME fourth). At output time: join only the first 2 elements.
      Append with `\n\n` separator. Blocks that didn't fire don't append,
      so the cap is natural.
      
    • Broaden PostToolUse truncation to all ollama/ agents (.opencode/plugins/agent-support.ts); differentiate limit: orchestrator 2,500 tokens vs workers 1,500. Minor until orchestrator exists.

      **Implementation**: rename `BUILD_LOCAL_MAX_RESPONSE_TOKENS` →
      `LOCAL_WORKER_MAX_TOKENS = 1500`; add
      `LOCAL_ORCHESTRATOR_MAX_TOKENS = 2500`. In `tool.execute.after`, the
      existing `isLocalAgent` check covers all `ollama/` agents via
      `input.model.startsWith('ollama/')`. Add a second check:
      `input.agent === 'local-orchestrator'` → use orchestrator limit, else
      worker limit. The `agent` field is available in `tool.execute.after`
      (confirmed working for `build-local`).
      
    • Create .agents/agents/local-orchestrator.md — primary agent with edit: deny, write: deny, bash: deny; whitelist task to build-local, research, brainstorm only.

      **Implementation**: new file modeled on `build-local.md`. Role: receive
      high-level goal, decompose into bounded subtasks, show decomposition to
      user before dispatching, delegate via `task` subagent. Permission
      block in `opencode.json` `agent.local-orchestrator`:
      `{ "edit": "deny", "write": "deny", "bash": "deny" }`. Agent body
      rules: (1) read project root `AGENTS.md` first; (2) produce a task
      list and confirm with user before dispatching; (3) one `task` call per
      subtask, wait for result; (4) never attempt to edit files directly —
      if a subtask requires context the worker needs, inject it via the
      `task` prompt, not by reading files yourself; (5) after all subtasks,
      report summary to user.
      
    • Set default_agent: "local-orchestrator" in opencode.json — Done May 2026. Key is default_agent (snake_case, confirmed from opencode.ai/config.json schema). local-orchestrator has mode: all so it qualifies as a primary agent.

Done

  • Soften opus-deep.modelfile directive — file deleted (May 2026); DeepSeek R1 available online when needed; OmniCoder 2 is the sole local model.

Known Tradeoffs

Tradeoff Impact Mitigation
Instructions glob trimmed to root AGENTS.md only Agents miss project-specific patterns for subdirectories unless they read nested AGENTS.md explicitly Add reminder in orchestrator + build-local agent body: "check nested AGENTS.md before working in subdirectories"
Same model for all roles Orchestrator, worker, compaction agent are all same weights with different prompts Structural edit: deny is the safety net; circuit breakers limit runaway loops
No cloud fallback If task is too complex for 9B, no escalation path Orchestrator includes "ask the user for direction" rule; user can switch to Copilot
Latency Sequential dispatch: orchestrator decomposes → build-local runs → returns. ~2× wall time vs. direct build-local Acceptable for local dev; no VRAM multiplier since Ollama keeps weights hot
Reminder-stacking cap 2-per-call priority filter (pending work above) drops lower-priority warnings Skipped reminders fire on next call if condition holds

Cloud Migration Path

When ready to add a cloud model, only opencode.json changes:

{
  "model": "ollama/arch-omni2-9b",
  "agent": {
    "local-orchestrator": {
      "model": "anthropic/claude-haiku-4-5"
    }
  }
}

Schema verified against opencode.ai/docs/agents/ (May 2026). The tools key inside agent configs is deprecated in favour of permission — the orchestrator definition uses permission, so it is current. The agent.{name}.model key is the correct per-agent override mechanism.


Ecosystem Gap — Contextual AGENTS.md Injection

During local AI work (May 2026) we hit a fundamental limitation: OpenCode's instructions glob in opencode.json loads all matched files upfront into every session. For a 9B local model with a 32K context window, loading all of apps/*/AGENTS.md and packages/*/AGENTS.md at startup consumes ~3040% of the context budget before the first message, triggering early compaction and degrading quality.

The correct behaviour — injecting only the AGENTS.md relevant to the file being edited — does not exist natively in OpenCode or its plugin ecosystem. The closest community plugin (opencode-skillful, 295 stars) is archived as of Feb 2026 and still requires the model to explicitly call skill_find/skill_use; it provides no path-triggered structural injection.

Open tasks

    • Assess: is filling this ecosystem gap worth the effort? — Before building a contextual-injection plugin, evaluate: (a) Is OpenCode actively used for serious local AI coding work, or is the community primarily cloud-model users for whom context cost is irrelevant? (b) Are there better local AI coding stacks (e.g. Aider + litellm, Cursor local mode, VS Code Copilot + Ollama) where this problem is already solved? (c) Is the tool.execute.before event stable enough to build on? Target: 30-minute research session, concrete go/no-go recommendation.
    • Review + write up our issues and fixes as an ecosystem contribution — If the gap is worth filling: document the context-bleed problem, the early-compaction root cause, our hook-based mitigation, and the remaining structural gap. Publish as a GitHub issue on the OpenCode repo and/or an npm plugin (opencode-contextual-rules?) implementing tool.execute.before path-triggered AGENTS.md injection. Depends on #16 go/no-go.
    • Trim .agents/AGENTS.md — Done May 2026. Condensed from 12,584 → 10,507 bytes (43 lines removed). Trimmed: Hook Architecture Principle block (redirected to item 22 in project doc), Deferred Loading example + "why not" paragraph, session-start/stop hook prose, outdated generate-agents.ts references in Skills/Agents sections. Agent body files updated to prompt-body-only convention (see items 25/26).
    • Block bash bypass of read pagination — Done May 2026. Added Policy 14 to pre-tool-use.sh: blocks cat/head/tail/jq reads of apps/*/package.json and packages/*/package.json. Scope limited to package.json (confirmed live bypass vector); general .ts/.md bash reads are not yet blocked (lower-urgency gap). Pattern verified with Node.js unit test — exact bypass command cat apps/api/package.json | jq is caught by P1.
    • Improve explore-first scope detection — Policy 14 blocks manage_todo_list with ≥4 items, but OmniCoder sometimes starts with Explore/find before planning, bypassing the check. Options: (a) block explore_subagent when the query looks like a multi-file discovery sweep (glob patterns for source files across multiple dirs); (b) add a pre-tool-use check on run_in_terminal that denies find commands spanning the whole repo when the task hasn't been scoped yet; (c) rely on the todo-list check firing when planning eventually happens (current behavior — catches it late but still before edits start).
    • Remove debug logging from plugin after verified cycle — Done May 2026. Removed the full-input dump block from tool.execute.before in plugin.ts (/tmp/plugin-debug.jsonl appender). Guards verified via opencode export session transcript inspection — no longer need the dump file. Hook error logger (/tmp/plugin-hook-errors.log) kept as it only fires on failures, not every call.
    • Refactor hook scripts to be platform-agnostic — currently pre-tool-use.sh parses Copilot-specific JSON and outputs Copilot-specific permissionDecision JSON. plugin.ts implements duplicate guards inline rather than calling the script. This means OpenCode and Copilot guards can drift (confirmed May 2026: Policy 14 in pre-tool-use.sh had no effect on OpenCode bash tool calls).

      **Design target**: scripts accept normalized env vars (`TOOL_NAME`,
      `COMMAND`, `FILE_PATH`), exit non-zero with plain-text denial reason
      on stdout. Callers normalize input and translate output to their
      native denial format. Tracked in `.agents/AGENTS.md` Hook Architecture
      Principle section.
      
      **Audit required first**: review all hook scripts for Copilot-specific
      assumptions before refactoring.
      
    • Question-drift marker in user-prompt-submit.sh — when the model has committed to a prior position and follow-up questions are being misread through that lens, prepend a disambiguation marker at the prompt tail. Detected pattern: model answers "no" or "not possible" in a prior turn → subsequent turns interpreted as defense of that position. See §2.1 ("Position-anchored priming") in the research doc.

      **Implementation**: in `user-prompt-submit.sh`, read the last N turns
      of `$TRANSCRIPT_PATH` (injected by OpenCode's native hook env) and
      look for a prior committed "no/impossible/can't" response within the
      last 3 model turns. If detected, append to `ADDITIONAL_CONTEXT`:
      `CURRENT QUESTION (answer only this — not the prior exchange): [prompt
      text]`. The key is repeating the user's exact question at the tail,
      after the marker, to counteract lost-in-the-middle effects. Fallback
      trigger: user prompt contains "that's not what I asked" / "you're
      answering the wrong question" / "I said" → always inject marker
      regardless of transcript scan.
      
    • Review all custom agent files for local-model-specific framing — Done May 2026. build-local.md reframed: dropped "OmniCoder", "9B", "Ollama", "Qwen3 thinking blocks", "32K tokens total"; replaced with model-agnostic equivalents. research.md and brainstorm.md verified clean — no model/provider mentions. local-orchestrator.md was fixed earlier this session. All four agent body files are now model-agnostic.
    • Failure-mode routing in SELF-CHECK — when the periodic SELF-CHECK fires in post-tool-use.sh, if a recent terminal failure or test failure is also present in the same turn, classify the failure type and inject the matched intervention rather than generic "step back." Reference: failure-mode routing table in §3.5 of the research doc.

      **Implementation**: in the SELF-CHECK block, if `context` already
      contains `DEBUGGING REMINDER` (i.e., test/terminal failure co-occurred
      this turn), append a classification hint:
      `FAILURE TYPE HINT: If this is a test/build failure → Reflexion loop
      (fix based on test output). If convention violation → grep for the
      pattern and inject a canonical example. If wrong file/directory → stop
      and re-read the project structure. Do not default to "try harder."`.
      Low implementation cost — pure text append with a conditional on
      `$context`.
      
    • Audit agent .md files for OpenCode-specific frontmatter — Done May 2026. Audit result: only local-orchestrator.md had OpenCode frontmatter keys (mode, model, permission). brainstorm.md, build-local.md, research.md were already plain markdown. Went with option (b): stripped mode/model/permission from local-orchestrator.md; moved mode: all into opencode.json (model + permission were already there). Kept description in frontmatter as it is neutral and self-documenting. Body files are now prompt-body only — valid in both OpenCode and Copilot.
    • plugin.ts local-agent detection uses provider prefix, not agent nametool.execute.after detects local agents via input.model.startsWith('ollama/'). This is provider-specific: if the model is served via a different backend (e.g. llama-server/, lmstudio/), truncation silently stops working. Fix: detect by agent name (input.agent.includes('build-local')) only, removing the ollama/ fallback. The input.agent field is available in tool.execute.after (confirmed May 2026).
    • plugin.ts context pressure threshold is hardcoded to 32,768 tokensCONTEXT_LIMIT_TOKENS = 32768 assumes OmniCoder 9B's context window. If the local model changes, the threshold silently drifts out of calibration. Options: (a) read from opencode.json model config if OpenCode exposes it to plugins; (b) make it a top-of-file constant with a comment to update when changing models; (c) accept the drift as low-severity (threshold is advisory only — context pressure warnings are informational, not blocking). Option (b) is the minimum; option (a) is ideal if OpenCode exposes model metadata to plugins.
    • Move permission out of local-orchestrator.md frontmatter — Done May 2026 as part of item 25. mode: all added to opencode.json agent entry. model and permission were already in opencode.json. opencode.json is now the single source of truth for all runtime config; .md files are prompt-body only.

Testing & Regression

Research summary (May 2026): No pre-existing tool exactly fits this use case. Existing tools (RagaAI Catalyst, AgentEvalKit, agent-eval-arena, intent-eval-lab, j-rig-skill-binary-eval) focus on LLM output quality, hallucination detection, or cross-runtime behavior scoring — not config file structure or policy enforcement regression. The closest analogue is j-rig-skill-binary-eval (binary pass/fail criteria across 7 layers), which uses the same conceptual approach we'd want here. Our testing is bespoke by necessity: we're testing configuration files, shell scripts, and specific policy enforcement behaviors, not general LLM response quality.

Two layers of testing:

Layer What it tests Cost When to run
Config + policy unit tests Schema validity, hook regex correctness None (no model) Always — CI, pre-commit
CLI integration smoke tests Actual enforcement via opencode run Local model only On-demand; local model must be running

Cloud agents excluded from integration testsopencode run with a cloud model (Copilot, Anthropic) incurs API costs and rate limits. Tests must detect the active model and skip if it's not a local provider.

Open tasks

    • Config + policy unit test suite — test config file structure and hook regex patterns without invoking any model. Implementation:

      a. **`opencode.json` schema validation**: the file references
         `"$schema": "https://opencode.ai/config.json"` — validate it using
         `ajv` (already used in the monorepo) against the live schema or a
         cached copy. Catches permission typos, unknown agent keys,
         unsupported field values.
      
      b. **Hook JSON structure validation**: validate
         `.agents/frameworks/github/hooks.json` and
         `.agents/frameworks/opencode/plugin.ts` (TypeScript, already type-
         checked). Write a schema for the hooks JSON format and run ajv on
         it.
      
      c. **Hook policy regex unit tests**: extract every regex used in
         `pre-tool-use.sh` into a `tests/hooks.test.ts` file and run it
         with `vitest`. For each policy, define 23 input strings that
         SHOULD match and 23 that SHOULD NOT. Policy 14 already has an
         informal Node.js test from this session — formalize it.
      
      d. **Agent `.md` frontmatter validator**: check that no agent file
         under `.agents/agents/` has frontmatter keys other than
         `description`. Catches regression when someone adds `model:` or
         `permission:` back to a body file.
      
      **Suggested location**: `.agents/tests/` or root `test/agents/`.
      **Stack**: vitest (already in monorepo), ajv (already available), Node
      built-ins. No new dependencies needed.
      
    • CLI integration smoke tests (local model only) — use opencode run in non-interactive mode to verify enforcement is actually firing via the real runtime. These tests exercise the plugin + hook wiring end-to-end.

      **Command shape**:
      ```
      opencode run "prompt" --agent build-local \
        --model llama-server/arch-omni2-9b-native \
        --format json
      ```
      
      **Assertions via `opencode export`**: after each run, export the
      session with `opencode export <sessionID> 2>/dev/null` and parse the
      JSON transcript. Assert on `parts` array: tool calls that SHOULD have
      been blocked appear with error/denied status; tool calls that SHOULD
      have passed completed normally.
      
      **Test cases to start with** (all verified real enforcement gaps):
      1. Attempt to `read` a nested `package.json` (e.g. `apps/api/package.json`) → BLOCKED by plugin
         package.json guard
      2. Attempt to `read` a source file with no `limit` → BLOCKED by
         pagination guard
      3. Attempt to `read` a source file with `limit: 51` → BLOCKED
      4. Attempt to `read` a docs file with `limit: 501` → BLOCKED
      5. Attempt to `read` a docs file with `limit: 50` → PASSES
      6. Bash command `cat apps/api/package.json` → BLOCKED by pre-tool-use
         Policy 14 (substitute your project's equivalent nested package.json)
      
      **Guard rail**: skip all tests if `llama-server` is not reachable at
      `http://127.0.0.1:8080/v1`. Do not run against cloud models. Add
      an env var `AGENT_INTEGRATION_TESTS=1` required to enable (off by
      default, never runs in standard `npm test`).
      
      **Suggested location**: `.agents/tests/integration/`.
      **Stack**: Node.js test runner or vitest, `opencode` CLI in PATH.
      

Verified facts (May 2026)

  • OpenCode's read tool input schema is { filePath: string, limit?: number, offset?: number } — NOT startLine/endLine. Confirmed via plugin debug logging of real tool calls.
  • tool.execute.before input contains only { tool, sessionID, callID }. It does NOT include agent or model, so plugin-layer gating cannot filter by agent. Confirmed via plugin debug logging.
  • OpenCode has its own native hook system that calls pre-tool-use.sh directly for tools like run_in_terminal, replace_string_in_file, etc. This is completely separate from the plugin's runHook calls. The native hook payload includes timestamp, hook_event_name, session_id, transcript_path, tool_use_id, and cwd — fields the plugin never sends. The plugin runHook is a second call, layered on top.
  • Bun shell $ API does not have a .stdin() method. The correct API for piping stdin is $`cmd < ${Buffer.from(text)}`. .stdin(text) silently throws TypeError: $\...`.stdin is not a function, which was caught by runHook's catchblock and returned''. This caused the plugin's runHookto silently no-op for every call withstdinJsonsince the plugin was first written — hook enforcement (all 12 policies) was never running via the plugin path. It only ran via OpenCode's native hook system for the tools OpenCode natively supports. Confirmed via/tmp/plugin-hook-errors.log`.
  • The silent catch in runHook is dangerous. It masked the Bun .stdin() bug entirely. Always log hook failures to a debug file during development; remove only after enforcement is verified working.
  • Plugin-layer enforcement works for read after fixing the Bun stdin API. The read tool fires tool.execute.before in the plugin, which calls runHook('pre-tool-use.sh', ...) via < ${Buffer.from(...)}, which applies Policy 13 (50-line limit). Verified: bare read (no limit) → BLOCKED; read with limit: 50 → passes. (May 2026)
  • Plugin load failure: unescaped regex slashes caused silent syntax error. plugin-debug.jsonl was empty even after the Bun stdin fix because the plugin file itself failed to parse. Line 84 had /(^|/)(apps|packages)/[^/]+/... — forward slashes inside the regex literal were not escaped, producing a JS syntax error at parse time. Bun silently drops plugins that fail to import. Fixed to /(^|\/)(apps|packages)\/[^/]+\/.... The fix also corrected the pagination guard to use limit/offset (not startLine/endLine) and added an unbounded-read block (limit === undefined). All three guards verified working in a live session (May 2026).
  • Package.json read guard verified working. local-orchestrator attempting to read apps/*/package.json and packages/*/package.json → BLOCKED by plugin. Root package.json read correctly passes. (May 2026)
  • Policy 14 (manage_todo_list ≥ 4 items) catches some but not all broad task attempts. OmniCoder sometimes proceeds directly to Explore/find without calling manage_todo_list first, bypassing the policy. When it does plan with the todo tool before acting, the deny fires correctly.
  • OmniCoder comprehension failure: prompt ambiguity → wrong directory. Given "refactor the five hook files", OmniCoder ran a glob for *hook* files and found .husky/ hooks instead of .agents/hooks/. The correct files were in the grep output from the Explore subagent but were not selected. Root cause: the model lacks enough context about the repo layout to disambiguate "hook files" without explicit path guidance. Mitigation: be explicit in prompts ("the five .agents/hooks/*.sh files").
  • OpenCode agent permission config requires a .opencode/agents/<name>.md file. Without a matching markdown file, opencode.json's agent.<name>.permission config is silently ignored — the agent is unknown to OpenCode and runs as a nameless build-agent alias. The markdown file must exist in .opencode/agents/ (or ~/.config/opencode/agents/). Confirmed by test run where @local-orchestrator edited files despite permission.edit: "deny" in JSON config; fixed by creating .opencode/agents/local-orchestrator.md symlink. (May 2026)
  • "write" is NOT a valid OpenCode permission key. Use "edit" instead — it covers write, edit, and apply_patch tools. "write": "deny" is silently ignored. Valid top-level permission keys include: read, edit, glob, grep, list, bash, task, skill, lsp, question, webfetch, websearch, external_directory, doom_loop, todowrite. Confirmed from opencode.ai/docs/permissions (May 2026).
  • default_agent key is snake_case in opencode.json (not defaultAgent). Confirmed from opencode.ai/docs/config (May 2026).
  • tools: false is deprecated. The current approach for per-agent tool restriction is permission: { edit: "deny" }. The old tools: false still works but is documented as legacy. Confirmed from opencode.ai/docs/agents (May 2026).
  • Broken symlinks are silent. OpenCode does not error on a broken .opencode/agents/ symlink — it just skips the agent silently. The agent won't appear in opencode agent list and all opencode.json permission config for it is ignored. Always verify with cat .opencode/agents/<name>.md | head -5 (should print content, not a "No such file" error) and opencode agent list (agent should appear with correct deny rules). The correct symlink depth from .opencode/agents/ is ../../.agents/agents/<name>.md (two levels), not three.
  • opencode agent list is the authoritative verification command. Run it after any agent config change to confirm: (a) the agent appears by name, (b) its mode is correct (all/primary/subagent), and (c) deny rules appear at the bottom of its permission list. Missing agent = broken symlink or YAML parse error. Present but missing deny rules = frontmatter not parsed correctly or wrong key names. (May 2026)
  • @mention routing only works at session start. If you send any message that gets answered by the current primary agent first, then send @local-orchestrator ..., the TUI passes the full message text to the current model (Build/OmniCoder) which treats @local-orchestrator as freeform text and answers it itself. Always open a fresh session and make @agent-name the very first message. Alternatively, use opencode run --agent local-orchestrator "..." from the CLI for reliable agent-scoped invocation. Tab-switching to a custom all-mode agent in an existing session works correctly.
  • edit: deny on local-orchestrator is working correctly. When given an edit task, the orchestrator correctly avoided using replace_string_in_file and instead used the task tool to delegate to a subagent. This is the expected behaviour. Confirmed May 2026.
  • task tool has a JSON serialization limit. OmniCoder 9B caused an Unterminated string error by embedding the entire contents of multiple package.json files as a literal string inside the task prompt JSON. The task tool prompt is serialized as JSON; very long strings truncate and produce parse errors. Mitigation: instruct the orchestrator in its system prompt to tell workers which files to read rather than quoting file contents inline. This has been added to local-orchestrator.md. (May 2026)
  • ollama/arch-omni2-9b is the wrong model identifier for the llama-server instance. The correct ID is llama-server/arch-omni2-9b-native (verify with opencode models | grep arch). Using the wrong ID causes an immediate "cannot load model" error when the agent is invoked. Fixed in opencode.json and local-orchestrator.md frontmatter. (May 2026)

Open Issues

Known bugs and stale claims identified during code review (see deleted agent-infrastructure-review.md and agent-infrastructure-review-pass2.md for full context). Not yet fixed.

CRITICAL — description: empty in all generated agent/skill files

scripts/generate-agents.ts uses a hand-rolled YAML parser that silently drops descriptions when they are written in block-scalar form (value on the next line under the key). Every generated file in .github/agents/, .github/skills/, .opencode/agents/, .opencode/skills/ has a blank description: field.

description: is the primary routing signal for Copilot's SkillsContextComputer and OpenCode's agent dispatch. Explicitly @-mentioning an agent by name still works; description-triggered auto-routing does not.

Fix: Inline the description strings in the canonical .agents/ source files (change block-scalar to key: 'value' format). The existing parser handles inline strings correctly. Add a generate:agents:check assertion that every generated file has a non-empty description:.

MEDIUM — printf '%s' regression in hooks breaks \n rendering (resolved)

.agents/hooks/post-tool-use.sh, session-start.sh, and user-prompt-submit.sh use printf '%s' "$context" | node -e '...' to JSON-escape the context variable. %s does not interpret \n escape sequences, so multi-line context strings (SELF-CHECK, DEBUGGING REMINDER, BFF REMINDER) arrive at the model as single lines with literal \n characters.

Verified fixed (May 2026): all three hooks already use printf '%b'.

LOW — arXiv citation 2603.29957 unverified (resolved)

arXiv:2603.29957 (Jiang et al. 2026, "Think-Anywhere") appears in .agents/agents/research.md, .agents/agents/brainstorm.md, and the Research Foundation section above. Verify the ID resolves at https://arxiv.org/abs/2603.29957 and fix all references if it doesn't.

Verified real (May 2026): "Think Anywhere in Code Generation" by Xue Jiang, Tianyu Zhang, Ge Li et al., submitted March 31, 2026, revised April 27, 2026 (v3), cs.SE. All existing citations are correct.

LOW — .claude/ false claims in tool-agnostic-agent-infra.md (resolved)

The file docs/projects/tool-agnostic-agent-infra.md no longer exists — already deleted. No action needed.