Brydon DeWitt 6b07e4ccb2 feat: add shared agent infrastructure (.agents/)

- AGENTS.md: design principles, enforcement hierarchy, deferred loading
- agents/: brainstorm, build, orchestrator, research (auto-discovered by MCP server)
- skills/: research methodology (auto-discovered by MCP server)
- hooks/: pre-tool-use, post-tool-use (BFF block removed), session-start,
  stop, pre-compact, user-prompt-submit
- frameworks/: opencode/plugin.ts (resolves hooks via import.meta.url — works
  as project-local or global plugin), github/hooks.json
- mcp/index.ts: auto-discovers agents/*.md and skills/*.md from frontmatter
  (replaces hand-maintained registry); server renamed all-agents
- docs/: agent-infrastructure.md (generalized), research docs (7 files),
  ai_architectures.md, llama-server-cuda-wsl2.md
- install.sh: idempotent setup — Copilot global hooks, OpenCode global plugin +
  AGENTS.md + MCP entry, VS Code global MCP config

2026-05-22 13:13:43 -04:00

49 KiB

Raw Blame History

Agent Infrastructure

Shared agent infrastructure for VS Code Copilot and OpenCode — brainstorm agent, research agent, nudge instructions, hooks, skills, and MCP server. Project-specific overlays live in each project's .agents/ directory.

See also: docs/research/ai-coding-best-practices.md — research synthesis covering the Prompt/Context/Harness taxonomy, failure modes, enforcement hierarchy, small-model harness patterns, and all primary-source citations that underpin the design decisions here.

Current State

Architecture Overview

The infrastructure is tool-agnostic: canonical sources live in .agents/ and a generator (npm run generate:agents) distributes them to .github/agents/, .github/skills/, .opencode/agents/, .opencode/skills/. Edit the .agents/ sources; never edit the generated output directories (they are .gitignored and blocked by pre-tool-use policy).

.agents/
├── AGENTS.md                        # Root design doc + enforcement hierarchy
├── agents/                          # Agent definitions (canonical)
│   ├── brainstorm.md
│   ├── research.md
│   └── build-local.md               # OmniCoder 9B via Ollama
├── hooks/                           # Shared bash hooks (delegated by all harnesses)
│   ├── pre-tool-use.sh              # Hard blocks (terminal cmds + file-path policies)
│   ├── post-tool-use.sh             # Self-check counter + methodology reminders
│   ├── session-start.sh             # Inject project state at session start
│   ├── user-prompt-submit.sh        # Per-turn nudge detection + task capture
│   ├── pre-compact.sh               # Export state before context summarization
│   └── stop.sh                      # Session-end verification
└── skills/
    └── research/SKILL.md            # Research methodology (any agent can load)

Generated output (do not edit — regenerated by npm run generate:agents):

.github/agents/ — VS Code Copilot agent files
.github/skills/ — VS Code Copilot skill files
.opencode/agents/ — OpenCode agent files
.opencode/skills/ — OpenCode skill files

Harness integration:

VS Code Copilot: .github/agent-support.json — maps 4 hook events to the shared bash scripts in .agents/hooks/
OpenCode: .opencode/plugins/agent-support.ts — TypeScript plugin that shells out to the same bash scripts

Brainstorm Agent

4-phase workflow: Quick Frame → Diverge → Converge → Capture & Hand Off
6 techniques: Rapid Ideation, SCAMPER, Worst Possible Idea, How Might We, Inversion/Pre-mortem, Constraint Flipping
Counterbalances Opus 4.6 overthinking tendency
Phase 2 includes "push past the obvious" nudge (Zhao et al. 2024: LLMs fall short on originality, excel at elaboration — first ideas are "average")
Phase 4 routes to @research for investigation, default agent for implementation
Creates exploration files at docs/explorations/<name>.md and session memory notes

Research Agent

Two orientations that compose recursively:
- Understand (Grounded Theory): open coding → constant comparison → axial coding → memo → saturation check
- Diagnose (Strong Inference + Satisficing): 5-factor triage gates between satisficing (low risk) and full falsification (high risk)
5-factor triage: reversibility, blast radius, confidence, novelty, time cost
Timing awareness: time prefix on unknown commands, session/repo memory for baselines, timing feeds into triage decisions
Investigation files at docs/explorations/<name>.md
Techniques reference: Five Whys, Delta Debugging, Rubber Duck
Delegates evidence-gathering to Explore subagent, keeps analytical thinking local

Nudge Instructions

Brainstorm nudge: triggers on hesitation/overthinking language ('wait', 'actually', 'hmm', 'overcomplicating', etc.)
Research nudge: triggers on debugging/investigation language ('why is this broken', 'how does this work', 'root cause', etc.)
Both are non-intrusive single-sentence suggestions, only fire once per topic

Tool Mapping (Copilot ↔ OpenCode)

Copilot	OpenCode equivalent
`AGENTS.md` (root + nested)	`AGENTS.md` (root, native; nested via `instructions` glob in `opencode.json`)
`.github/agents/*.agent.md`	`.opencode/agents/*.md` (frontmatter: `description`, `mode`, `model`, `temperature`, `permission`)
`.github/skills/<name>/SKILL.md`	`.opencode/skills/<n>/SKILL.md` — also reads `.agents/skills/` and `.claude/skills/`
`.github/instructions/*.instructions.md` (`applyTo`)	No direct equivalent — fold into AGENTS.md stubs or `instructions` glob
`.github/hooks/*.sh` (JSON-configured shell)	`.opencode/plugins/*.ts` (TS modules, event-driven) — shells out via Bun's `$`
`runSubagent` / `Explore` agent	Built-in `general` and `explore` subagents; `@`-mention syntax
`vscode_askQuestions`	No equivalent — OpenCode uses agent's natural turn-taking

OpenCode plugin event mapping:

Copilot hook	OpenCode event
`SessionStart`	`session.created`
`PreToolUse`	`tool.execute.before`
`PostToolUse`	`tool.execute.after`
`PreCompact`	`experimental.session.compacting`
`Stop`	`session.idle` (closest equivalent)

Research Foundation

For full research depth, citations, and failure-mode analysis, see docs/research/ai-coding-best-practices.md. The list below records the specific papers and frameworks that shaped the design decisions in this project.

Methodologies and papers that informed the design:

Grounded Theory (Glaser & Strauss): build understanding from data, not assumptions. Applied to code-reading in the Understand orientation.
Strong Inference (Platt 1964): multiple competing hypotheses → crucial experiments → eliminate. Applied to the Diagnose orientation.
Satisficing (Simon 1956): accept "good enough" when optimization cost exceeds benefit. Gates between cheap confirmation and expensive falsification.
Dual Process Theory (Kahneman): System 1 (fast, pattern-matching) vs System 2 (slow, analytical). System 1 more accurate in familiar domains. Informs the triage decision.
Zhao et al. 2024 (arxiv): LLMs fall short on originality, excel at elaboration. First ideas are "average." Informs brainstorm agent's "push past the obvious" nudge.
"Lost in the Middle" (Liu et al. 2023): LLMs attend best to beginning/end of context. Informs hook design — inject at context tail for high attention.
Delta Debugging: binary search the change space between passing/failing cases. Logic behind git bisect.
Five Whys: iterative causal chain tracing. Starting point for hypothesis generation, not sole diagnostic method.
Ronacher "Agent Design Is Still Hard": reinforce methodology after every tool call at context tail. Structural injection outperforms relying on instructions in the system prompt.
Think-Anywhere (Jiang et al. arXiv:2603.29957, Mar 2026, Peking U + Tongyi Lab): LLMs trained to invoke <think> blocks at any token position during code generation, not just upfront. SOTA on LeetCode/LiveCodeBench with fewer total tokens. The motivating insight: a model can plan correctly at the start but introduce an off-by-one bug mid-implementation — only mid-loop reasoning catches it. Applied here: the research agent's investigation checklist includes "Re-evaluate hypothesis at every tool-call boundary." For Claude 4 models, interleaved thinking makes this automatic. Complements Plan-and-Solve: upfront decomposition where structure is clear, mid-execution re-evaluation when intermediate results change what to do next.
Anthropic interleaved thinking (Claude 4 + adaptive thinking): Claude Sonnet 4.6+ and Opus 4.6+ automatically insert thinking blocks between tool calls. No separate implementation needed — agent instruction design drives it. The research agent's "Re-evaluate at every tool-call boundary" instruction explicitly activates this behavior.
Prompt/Context/Harness framework (Alibaba Cloud, Apr 2026): Names the three engineering layers. Prompt = task expression (stateless). Context = what the model sees (AGENTS.md, skills, tools — engineering target is progressive disclosure). Harness = system constraints + verification loops (hooks, permission gates, sub-agent isolation). Diagnostic map: wrong output → Prompt; hallucinated fact → Context; wrong tool selected → Context (fix description); task drift → Harness (sub-agent boundary); destructive action → Harness (permission hook). LangChain improved Terminal Bench 2.0 from 52.8% → 66.5% by changing Harness alone.
Context engineering (Rajasekaran et al., Anthropic, Sep 2025): Formally distinguishes context engineering from prompt engineering. Key principles: (a) just-in-time context — agents hold references and load on demand, not upfront; (b) structured note-taking (NOTES.md) as external working memory for long sequential tasks; (c) every new token depletes attention budget — validates the <60-line AGENTS.md ceiling; (d) compaction strategy: maximize recall first, then improve precision.

MCP Server Lifecycle Hooks — Protocol Status (May 2026)

The .agents/mcp/ server exposes prompts and tools to agents via the MCP protocol. A recurring question: can the MCP server react to session lifecycle events (session start/end, tool-use boundaries)?

Current protocol state

No lifecycle hooks exist in the MCP protocol. The spec defines three phases only: initialize → operation → shutdown. There is no session.created, post-tool-call, or session.ended notification. This gap is why session awareness currently lives in the OpenCode plugin layer (.opencode/plugins/agent-support.ts) rather than the MCP server — OpenCode exposes session.created, session.idle, session.compacted, session.deleted, and tool.execute.before/after events natively to plugins.

Active work in the MCP spec

SEP-2624: Interceptors for the Model Context Protocol (PR #2624)

The most organized effort. Supersedes SEP-1763 (closed as completed). Proposes Interceptors as a new MCP primitive — two types: validators (inspect, return pass/fail) and mutators (transform context payloads) — discoverable and invocable via interceptors/list and interceptor/invoke JSON-RPC methods. These fire at protocol-level operation events: tools/call, prompts/get, resources/read, sampling/createMessage, elicitation/create. Not session-start/stop hooks, but before/after wrapping for every operation.

There is now a formal Interceptors Working Group (Bloomberg + Saxo Bank engineers, biweekly cadence). Reference implementations in progress for Go and C# SDKs. Experimental repo: modelcontextprotocol/experimental-ext-interceptors. Charter: modelcontextprotocol.io/community/interceptors/charter.

SEP-2282: Server-Declared Behavioural Hooks (PR #2282)

Smaller, separate open PR. Proposes servers declare context injections in ServerCapabilities — text injected into the agent's context at client-side lifecycle events (session start, post-tool-use, session end). The contract is "here's context the model should have at this moment," not code execution. More directly analogous to our OpenCode session.created / session.idle patterns. Currently unsponsored — needs a maintainer to pick it up.

What to watch

Primary: PR #2624 + experimental-ext-interceptors repo
Secondary: PR #2282 (closest to session-lifecycle hooks)
Label filter: SEP label on the modelcontextprotocol repo
Milestone: 2026-06-30-RC is the next spec revision window

Implication for this project

Until interceptors land in a shipping spec version and the TypeScript SDK, the session lifecycle pattern stays at the OpenCode plugin layer. When SEP-2282 or an equivalent lands, the MCP server could self-register context injection hooks during initialize, removing the need for tool-specific plugin code.

Model Scale Profiles

Different model sizes require different infrastructure strategies. The failure modes are different, so the mitigations are different.

Large-scale API models (Claude Sonnet / Opus)

Primary failure modes: overthinking, sycophancy, verbosity, tendency to add unrequested features or comments.

Infrastructure strategy:

Advisory methodology + structural reinforcement (hooks, circuit breakers)
PostToolUse self-check nudges every ~15 calls
PreToolUse hard blocks for high-risk operations
Subagent delegation for isolated tasks (parent Opus → child Sonnet/Haiku)

Smaller-scale local models (OmniCoder 9B via Ollama)

Primary failure modes (different from "low reasoning" — OmniCoder uses Qwen3 thinking blocks natively):

Narrower training distribution (Python/JS heavy)
Quantization degradation: JSON schema compliance drops as context fills
Tool-call history is the primary context consumer — responses must be truncated aggressively
Instruction drift: fewer attention heads (32 vs 64 in 32B) means system prompt recall degrades faster

Infrastructure strategy:

PostToolUse response truncation at ~1500 tokens (plugin layer, not bash hook)
PreToolUse JSON validation with schema-specific error messages
Context pressure injection at ≥70% fill (~22K/32K tokens)
steps: 20 cap + ask permission gates for natural checkpoints
explore subagent delegation to reduce context pressure on the main agent
NOTES.md working memory pattern enforced in agent body
No web tool — keeps context lean
Reasoning guidance: "Hold references; load on demand" explicit in agent body

OmniCoder 2 Orchestration — Pending Work

Full historical rationale and audit findings were maintained in docs/projects/local-ai-orchestration.md (deleted May 2026 after merge). The plan used an orchestrator-workers pattern with structural edit: deny enforcement on the orchestrator. All OpenCode config values verified against opencode.ai/docs (May 2026).

Goals

All agents run on ollama/arch-omni2-9b — no cloud fallback
User can type vague prompts; the system decomposes and delegates automatically
Context windows are isolated per subagent (no shared state bleed)
Changes scale forward: switching to cloud means changing model strings, not architecture

Pending Changes

Quick wins — under 5 minutes each, no testing required

- [CRITICAL] Fix <tool\*call> typo in omnicoder2.modelfile — markdown-escape artifact; malformed opening tag paired with correct closing tag. Highest-leverage change; everything below depends on reliable tool-call JSON.
- Mark canonical/deprecated modelfiles — # CANONICAL header on omnicoder2.modelfile; # DEPRECATED on omnicoder.modelfile; omnicoder-v2.modelfile.template deleted (was dead code — v2 now served from HuggingFace path).
- Add compaction.reserved: 3000 to opencode.json — default 10,000 fires compaction too early given ~8–12K baseline context.
- Fix pre-compact.sh prettier call — removes npx prettier which violates pre-tool-use Policy 1 (self-violating policy).
- MCP server error handling — wrap server.connect(transport) in try/catch with stderr + process.exit(1).

Short session — 15–30 minutes each, bounded scope

- Fix stop.sh JSON escaping — replace sed-based escaping with printf '%b' | node JSON.stringify pattern used in every other hook.
- Per-session PostToolUse counter — repo-scoped path /tmp/.opencode-tool-count-<repo-hash> (derived from REPO_ROOT via md5sum); prevents cross-repo contamination; session-start.sh resets it at session begin.
- Shrink compaction prompt to ~120 words (in .opencode/plugins/agent-support.ts) — shorter instructions free bandwidth for the 9B to actually summarize.
- Update .agents/agents/build-local.md for v2 — pagination 100 → 50 lines; rule 4 now says "recipient not dispatcher"; rule 7 scope-check says "tell the user, do not self-decompose".

Depends on orchestrator being proven first

- Trim root AGENTS.md to ~60 lines — reduced from 435 lines to 45 lines; all architecture rationale, code examples, quick task table, and project context removed; cross-cutting rules and quality gate preserved (May 2026).

PostToolUse weighted counter — reads (read_file, grep, list) +0.25; writes/shell +1; keeps 15-call SELF-CHECK from firing mid-investigation sweep. Depends on #7 (per-session counter) first.

**Implementation** (`.agents/hooks/post-tool-use.sh`): bash has no
float arithmetic — scale to integers: reads +1, writes/shell +4,
threshold 60 (equivalent to 15 effective write-units). Read-class
tools: `read_file`, `grep_search`, `list_dir`, `file_search`,
`semantic_search`, `explore_subagent`. Write/shell-class: all
`*_string_in_file`, `create_file`, `run_in_terminal`. Replace the
single `COUNT=$((COUNT + 1))` with a `case "$TOOL_NAME"` block that
does `COUNT=$((COUNT + 1))` for reads and `COUNT=$((COUNT + 4))` for
writes/shell. Change the self-check condition from
`(( COUNT % 15 == 0 ))` to `(( COUNT % 60 == 0 ))`.

PostToolUse reminder priority filter — emit at most 2 reminders per tool call; priority: SELF-CHECK > DEBUGGING > path-scoped > tool-specific. Depends on #11.

**Implementation** (`.agents/hooks/post-tool-use.sh`): replace the
current single `context` string accumulator with an indexed array
`reminders=()`. Each block appends `reminders+=("$msg")` in priority
order (SELF-CHECK first, DEBUGGING second, BFF/QUALITY GATE third,
RENAME fourth). At output time: join only the first 2 elements.
Append with `\n\n` separator. Blocks that didn't fire don't append,
so the cap is natural.

Broaden PostToolUse truncation to all ollama/ agents (.opencode/plugins/agent-support.ts); differentiate limit: orchestrator 2,500 tokens vs workers 1,500. Minor until orchestrator exists.

**Implementation**: rename `BUILD_LOCAL_MAX_RESPONSE_TOKENS` →
`LOCAL_WORKER_MAX_TOKENS = 1500`; add
`LOCAL_ORCHESTRATOR_MAX_TOKENS = 2500`. In `tool.execute.after`, the
existing `isLocalAgent` check covers all `ollama/` agents via
`input.model.startsWith('ollama/')`. Add a second check:
`input.agent === 'local-orchestrator'` → use orchestrator limit, else
worker limit. The `agent` field is available in `tool.execute.after`
(confirmed working for `build-local`).

Create .agents/agents/local-orchestrator.md — primary agent with edit: deny, write: deny, bash: deny; whitelist task to build-local, research, brainstorm only.

**Implementation**: new file modeled on `build-local.md`. Role: receive
high-level goal, decompose into bounded subtasks, show decomposition to
user before dispatching, delegate via `task` subagent. Permission
block in `opencode.json` `agent.local-orchestrator`:
`{ "edit": "deny", "write": "deny", "bash": "deny" }`. Agent body
rules: (1) read project root `AGENTS.md` first; (2) produce a task
list and confirm with user before dispatching; (3) one `task` call per
subtask, wait for result; (4) never attempt to edit files directly —
if a subtask requires context the worker needs, inject it via the
`task` prompt, not by reading files yourself; (5) after all subtasks,
report summary to user.

- ~~Set default_agent: "local-orchestrator" in opencode.json~~ — Done May 2026. Key is default_agent (snake_case, confirmed from opencode.ai/config.json schema). local-orchestrator has mode: all so it qualifies as a primary agent.

Done

~~Soften opus-deep.modelfile directive~~ — file deleted (May 2026); DeepSeek R1 available online when needed; OmniCoder 2 is the sole local model.

Known Tradeoffs

Tradeoff	Impact	Mitigation
Instructions glob trimmed to root `AGENTS.md` only	Agents miss project-specific patterns for subdirectories unless they read nested `AGENTS.md` explicitly	Add reminder in orchestrator + build-local agent body: "check nested `AGENTS.md` before working in subdirectories"
Same model for all roles	Orchestrator, worker, compaction agent are all same weights with different prompts	Structural `edit: deny` is the safety net; circuit breakers limit runaway loops
No cloud fallback	If task is too complex for 9B, no escalation path	Orchestrator includes "ask the user for direction" rule; user can switch to Copilot
Latency	Sequential dispatch: orchestrator decomposes → build-local runs → returns. ~2× wall time vs. direct build-local	Acceptable for local dev; no VRAM multiplier since Ollama keeps weights hot
Reminder-stacking cap	2-per-call priority filter (pending work above) drops lower-priority warnings	Skipped reminders fire on next call if condition holds

Cloud Migration Path

When ready to add a cloud model, only opencode.json changes:

{
  "model": "ollama/arch-omni2-9b",
  "agent": {
    "local-orchestrator": {
      "model": "anthropic/claude-haiku-4-5"
    }
  }
}

Schema verified against opencode.ai/docs/agents/ (May 2026). The tools key inside agent configs is deprecated in favour of permission — the orchestrator definition uses permission, so it is current. The agent.{name}.model key is the correct per-agent override mechanism.

Ecosystem Gap — Contextual AGENTS.md Injection

During local AI work (May 2026) we hit a fundamental limitation: OpenCode's instructions glob in opencode.json loads all matched files upfront into every session. For a 9B local model with a 32K context window, loading all of apps/*/AGENTS.md and packages/*/AGENTS.md at startup consumes ~30–40% of the context budget before the first message, triggering early compaction and degrading quality.

The correct behaviour — injecting only the AGENTS.md relevant to the file being edited — does not exist natively in OpenCode or its plugin ecosystem. The closest community plugin (opencode-skillful, 295 stars) is archived as of Feb 2026 and still requires the model to explicitly call skill_find/skill_use; it provides no path-triggered structural injection.

Open tasks

- Assess: is filling this ecosystem gap worth the effort? — Before building a contextual-injection plugin, evaluate: (a) Is OpenCode actively used for serious local AI coding work, or is the community primarily cloud-model users for whom context cost is irrelevant? (b) Are there better local AI coding stacks (e.g. Aider + litellm, Cursor local mode, VS Code Copilot + Ollama) where this problem is already solved? (c) Is the tool.execute.before event stable enough to build on? Target: 30-minute research session, concrete go/no-go recommendation.
- Review + write up our issues and fixes as an ecosystem contribution — If the gap is worth filling: document the context-bleed problem, the early-compaction root cause, our hook-based mitigation, and the remaining structural gap. Publish as a GitHub issue on the OpenCode repo and/or an npm plugin (opencode-contextual-rules?) implementing tool.execute.before path-triggered AGENTS.md injection. Depends on #16 go/no-go.
- ~~Trim .agents/AGENTS.md~~ — Done May 2026. Condensed from 12,584 → 10,507 bytes (43 lines removed). Trimmed: Hook Architecture Principle block (redirected to item 22 in project doc), Deferred Loading example + "why not" paragraph, session-start/stop hook prose, outdated generate-agents.ts references in Skills/Agents sections. Agent body files updated to prompt-body-only convention (see items 25/26).
- ~~Block bash bypass of read pagination~~ — Done May 2026. Added Policy 14 to pre-tool-use.sh: blocks cat/head/tail/jq reads of apps/*/package.json and packages/*/package.json. Scope limited to package.json (confirmed live bypass vector); general .ts/.md bash reads are not yet blocked (lower-urgency gap). Pattern verified with Node.js unit test — exact bypass command cat apps/api/package.json | jq is caught by P1.
- Improve explore-first scope detection — Policy 14 blocks manage_todo_list with ≥4 items, but OmniCoder sometimes starts with Explore/find before planning, bypassing the check. Options: (a) block explore_subagent when the query looks like a multi-file discovery sweep (glob patterns for source files across multiple dirs); (b) add a pre-tool-use check on run_in_terminal that denies find commands spanning the whole repo when the task hasn't been scoped yet; (c) rely on the todo-list check firing when planning eventually happens (current behavior — catches it late but still before edits start).
- ~~Remove debug logging from plugin after verified cycle~~ — Done May 2026. Removed the full-input dump block from tool.execute.before in plugin.ts (/tmp/plugin-debug.jsonl appender). Guards verified via opencode export session transcript inspection — no longer need the dump file. Hook error logger (/tmp/plugin-hook-errors.log) kept as it only fires on failures, not every call.

Refactor hook scripts to be platform-agnostic — currently pre-tool-use.sh parses Copilot-specific JSON and outputs Copilot-specific permissionDecision JSON. plugin.ts implements duplicate guards inline rather than calling the script. This means OpenCode and Copilot guards can drift (confirmed May 2026: Policy 14 in pre-tool-use.sh had no effect on OpenCode bash tool calls).

**Design target**: scripts accept normalized env vars (`TOOL_NAME`,
`COMMAND`, `FILE_PATH`), exit non-zero with plain-text denial reason
on stdout. Callers normalize input and translate output to their
native denial format. Tracked in `.agents/AGENTS.md` Hook Architecture
Principle section.

**Audit required first**: review all hook scripts for Copilot-specific
assumptions before refactoring.

Question-drift marker in user-prompt-submit.sh — when the model has committed to a prior position and follow-up questions are being misread through that lens, prepend a disambiguation marker at the prompt tail. Detected pattern: model answers "no" or "not possible" in a prior turn → subsequent turns interpreted as defense of that position. See §2.1 ("Position-anchored priming") in the research doc.

**Implementation**: in `user-prompt-submit.sh`, read the last N turns
of `$TRANSCRIPT_PATH` (injected by OpenCode's native hook env) and
look for a prior committed "no/impossible/can't" response within the
last 3 model turns. If detected, append to `ADDITIONAL_CONTEXT`:
`CURRENT QUESTION (answer only this — not the prior exchange): [prompt
text]`. The key is repeating the user's exact question at the tail,
after the marker, to counteract lost-in-the-middle effects. Fallback
trigger: user prompt contains "that's not what I asked" / "you're
answering the wrong question" / "I said" → always inject marker
regardless of transcript scan.

- ~~Review all custom agent files for local-model-specific framing~~ — Done May 2026. build-local.md reframed: dropped "OmniCoder", "9B", "Ollama", "Qwen3 thinking blocks", "32K tokens total"; replaced with model-agnostic equivalents. research.md and brainstorm.md verified clean — no model/provider mentions. local-orchestrator.md was fixed earlier this session. All four agent body files are now model-agnostic.

Failure-mode routing in SELF-CHECK — when the periodic SELF-CHECK fires in post-tool-use.sh, if a recent terminal failure or test failure is also present in the same turn, classify the failure type and inject the matched intervention rather than generic "step back." Reference: failure-mode routing table in §3.5 of the research doc.

**Implementation**: in the SELF-CHECK block, if `context` already
contains `DEBUGGING REMINDER` (i.e., test/terminal failure co-occurred
this turn), append a classification hint:
`FAILURE TYPE HINT: If this is a test/build failure → Reflexion loop
(fix based on test output). If convention violation → grep for the
pattern and inject a canonical example. If wrong file/directory → stop
and re-read the project structure. Do not default to "try harder."`.
Low implementation cost — pure text append with a conditional on
`$context`.

- ~~Audit agent .md files for OpenCode-specific frontmatter~~ — Done May 2026. Audit result: only local-orchestrator.md had OpenCode frontmatter keys (mode, model, permission). brainstorm.md, build-local.md, research.md were already plain markdown. Went with option (b): stripped mode/model/permission from local-orchestrator.md; moved mode: all into opencode.json (model + permission were already there). Kept description in frontmatter as it is neutral and self-documenting. Body files are now prompt-body only — valid in both OpenCode and Copilot.
- plugin.ts local-agent detection uses provider prefix, not agent name — tool.execute.after detects local agents via input.model.startsWith('ollama/'). This is provider-specific: if the model is served via a different backend (e.g. llama-server/, lmstudio/), truncation silently stops working. Fix: detect by agent name (input.agent.includes('build-local')) only, removing the ollama/ fallback. The input.agent field is available in tool.execute.after (confirmed May 2026).
- plugin.ts context pressure threshold is hardcoded to 32,768 tokens — CONTEXT_LIMIT_TOKENS = 32768 assumes OmniCoder 9B's context window. If the local model changes, the threshold silently drifts out of calibration. Options: (a) read from opencode.json model config if OpenCode exposes it to plugins; (b) make it a top-of-file constant with a comment to update when changing models; (c) accept the drift as low-severity (threshold is advisory only — context pressure warnings are informational, not blocking). Option (b) is the minimum; option (a) is ideal if OpenCode exposes model metadata to plugins.
- ~~Move permission out of local-orchestrator.md frontmatter~~ — Done May 2026 as part of item 25. mode: all added to opencode.json agent entry. model and permission were already in opencode.json. opencode.json is now the single source of truth for all runtime config; .md files are prompt-body only.

Testing & Regression

Research summary (May 2026): No pre-existing tool exactly fits this use case. Existing tools (RagaAI Catalyst, AgentEvalKit, agent-eval-arena, intent-eval-lab, j-rig-skill-binary-eval) focus on LLM output quality, hallucination detection, or cross-runtime behavior scoring — not config file structure or policy enforcement regression. The closest analogue is j-rig-skill-binary-eval (binary pass/fail criteria across 7 layers), which uses the same conceptual approach we'd want here. Our testing is bespoke by necessity: we're testing configuration files, shell scripts, and specific policy enforcement behaviors, not general LLM response quality.

Two layers of testing:

Layer	What it tests	Cost	When to run
Config + policy unit tests	Schema validity, hook regex correctness	None (no model)	Always — CI, pre-commit
CLI integration smoke tests	Actual enforcement via `opencode run`	Local model only	On-demand; local model must be running

Cloud agents excluded from integration tests — opencode run with a cloud model (Copilot, Anthropic) incurs API costs and rate limits. Tests must detect the active model and skip if it's not a local provider.

Open tasks

Config + policy unit test suite — test config file structure and hook regex patterns without invoking any model. Implementation:

a. **`opencode.json` schema validation**: the file references
   `"$schema": "https://opencode.ai/config.json"` — validate it using
   `ajv` (already used in the monorepo) against the live schema or a
   cached copy. Catches permission typos, unknown agent keys,
   unsupported field values.

b. **Hook JSON structure validation**: validate
   `.agents/frameworks/github/hooks.json` and
   `.agents/frameworks/opencode/plugin.ts` (TypeScript, already type-
   checked). Write a schema for the hooks JSON format and run ajv on
   it.

c. **Hook policy regex unit tests**: extract every regex used in
   `pre-tool-use.sh` into a `tests/hooks.test.ts` file and run it
   with `vitest`. For each policy, define 2–3 input strings that
   SHOULD match and 2–3 that SHOULD NOT. Policy 14 already has an
   informal Node.js test from this session — formalize it.

d. **Agent `.md` frontmatter validator**: check that no agent file
   under `.agents/agents/` has frontmatter keys other than
   `description`. Catches regression when someone adds `model:` or
   `permission:` back to a body file.

**Suggested location**: `.agents/tests/` or root `test/agents/`.
**Stack**: vitest (already in monorepo), ajv (already available), Node
built-ins. No new dependencies needed.

CLI integration smoke tests (local model only) — use opencode run in non-interactive mode to verify enforcement is actually firing via the real runtime. These tests exercise the plugin + hook wiring end-to-end.

**Command shape**:
```
opencode run "prompt" --agent build-local \
  --model llama-server/arch-omni2-9b-native \
  --format json
```

**Assertions via `opencode export`**: after each run, export the
session with `opencode export <sessionID> 2>/dev/null` and parse the
JSON transcript. Assert on `parts` array: tool calls that SHOULD have
been blocked appear with error/denied status; tool calls that SHOULD
have passed completed normally.

**Test cases to start with** (all verified real enforcement gaps):
1. Attempt to `read` a nested `package.json` (e.g. `apps/api/package.json`) → BLOCKED by plugin
   package.json guard
2. Attempt to `read` a source file with no `limit` → BLOCKED by
   pagination guard
3. Attempt to `read` a source file with `limit: 51` → BLOCKED
4. Attempt to `read` a docs file with `limit: 501` → BLOCKED
5. Attempt to `read` a docs file with `limit: 50` → PASSES
6. Bash command `cat apps/api/package.json` → BLOCKED by pre-tool-use
   Policy 14 (substitute your project's equivalent nested package.json)

**Guard rail**: skip all tests if `llama-server` is not reachable at
`http://127.0.0.1:8080/v1`. Do not run against cloud models. Add
an env var `AGENT_INTEGRATION_TESTS=1` required to enable (off by
default, never runs in standard `npm test`).

**Suggested location**: `.agents/tests/integration/`.
**Stack**: Node.js test runner or vitest, `opencode` CLI in PATH.

Verified facts (May 2026)

OpenCode's read tool input schema is { filePath: string, limit?: number, offset?: number } — NOT startLine/endLine. Confirmed via plugin debug logging of real tool calls.
tool.execute.before input contains only { tool, sessionID, callID }. It does NOT include agent or model, so plugin-layer gating cannot filter by agent. Confirmed via plugin debug logging.
OpenCode has its own native hook system that calls pre-tool-use.sh directly for tools like run_in_terminal, replace_string_in_file, etc. This is completely separate from the plugin's runHook calls. The native hook payload includes timestamp, hook_event_name, session_id, transcript_path, tool_use_id, and cwd — fields the plugin never sends. The plugin runHook is a second call, layered on top.
Bun shell $ API does not have a .stdin() method. The correct API for piping stdin is $`cmd < ${Buffer.from(text)}`. .stdin(text) silently throws TypeError: $\...`.stdin is not a function, which was caught by runHook's catchblock and returned''. This caused the plugin's runHookto silently no-op for every call withstdinJsonsince the plugin was first written — hook enforcement (all 12 policies) was never running via the plugin path. It only ran via OpenCode's native hook system for the tools OpenCode natively supports. Confirmed via/tmp/plugin-hook-errors.log`.
The silent catch in runHook is dangerous. It masked the Bun .stdin() bug entirely. Always log hook failures to a debug file during development; remove only after enforcement is verified working.
Plugin-layer enforcement works for read after fixing the Bun stdin API. The read tool fires tool.execute.before in the plugin, which calls runHook('pre-tool-use.sh', ...) via < ${Buffer.from(...)}, which applies Policy 13 (50-line limit). Verified: bare read (no limit) → BLOCKED; read with limit: 50 → passes. (May 2026)
Plugin load failure: unescaped regex slashes caused silent syntax error. plugin-debug.jsonl was empty even after the Bun stdin fix because the plugin file itself failed to parse. Line 84 had /(^|/)(apps|packages)/[^/]+/... — forward slashes inside the regex literal were not escaped, producing a JS syntax error at parse time. Bun silently drops plugins that fail to import. Fixed to /(^|\/)(apps|packages)\/[^/]+\/.... The fix also corrected the pagination guard to use limit/offset (not startLine/endLine) and added an unbounded-read block (limit === undefined). All three guards verified working in a live session (May 2026).
Package.json read guard verified working. local-orchestrator attempting to read apps/*/package.json and packages/*/package.json → BLOCKED by plugin. Root package.json read correctly passes. (May 2026)
Policy 14 (manage_todo_list ≥ 4 items) catches some but not all broad task attempts. OmniCoder sometimes proceeds directly to Explore/find without calling manage_todo_list first, bypassing the policy. When it does plan with the todo tool before acting, the deny fires correctly.
OmniCoder comprehension failure: prompt ambiguity → wrong directory. Given "refactor the five hook files", OmniCoder ran a glob for *hook* files and found .husky/ hooks instead of .agents/hooks/. The correct files were in the grep output from the Explore subagent but were not selected. Root cause: the model lacks enough context about the repo layout to disambiguate "hook files" without explicit path guidance. Mitigation: be explicit in prompts ("the five .agents/hooks/*.sh files").
OpenCode agent permission config requires a .opencode/agents/<name>.md file. Without a matching markdown file, opencode.json's agent.<name>.permission config is silently ignored — the agent is unknown to OpenCode and runs as a nameless build-agent alias. The markdown file must exist in .opencode/agents/ (or ~/.config/opencode/agents/). Confirmed by test run where @local-orchestrator edited files despite permission.edit: "deny" in JSON config; fixed by creating .opencode/agents/local-orchestrator.md symlink. (May 2026)
"write" is NOT a valid OpenCode permission key. Use "edit" instead — it covers write, edit, and apply_patch tools. "write": "deny" is silently ignored. Valid top-level permission keys include: read, edit, glob, grep, list, bash, task, skill, lsp, question, webfetch, websearch, external_directory, doom_loop, todowrite. Confirmed from opencode.ai/docs/permissions (May 2026).
default_agent key is snake_case in opencode.json (not defaultAgent). Confirmed from opencode.ai/docs/config (May 2026).
tools: false is deprecated. The current approach for per-agent tool restriction is permission: { edit: "deny" }. The old tools: false still works but is documented as legacy. Confirmed from opencode.ai/docs/agents (May 2026).
Broken symlinks are silent. OpenCode does not error on a broken .opencode/agents/ symlink — it just skips the agent silently. The agent won't appear in opencode agent list and all opencode.json permission config for it is ignored. Always verify with cat .opencode/agents/<name>.md | head -5 (should print content, not a "No such file" error) and opencode agent list (agent should appear with correct deny rules). The correct symlink depth from .opencode/agents/ is ../../.agents/agents/<name>.md (two levels), not three.
opencode agent list is the authoritative verification command. Run it after any agent config change to confirm: (a) the agent appears by name, (b) its mode is correct (all/primary/subagent), and (c) deny rules appear at the bottom of its permission list. Missing agent = broken symlink or YAML parse error. Present but missing deny rules = frontmatter not parsed correctly or wrong key names. (May 2026)
@mention routing only works at session start. If you send any message that gets answered by the current primary agent first, then send @local-orchestrator ..., the TUI passes the full message text to the current model (Build/OmniCoder) which treats @local-orchestrator as freeform text and answers it itself. Always open a fresh session and make @agent-name the very first message. Alternatively, use opencode run --agent local-orchestrator "..." from the CLI for reliable agent-scoped invocation. Tab-switching to a custom all-mode agent in an existing session works correctly.
edit: deny on local-orchestrator is working correctly. When given an edit task, the orchestrator correctly avoided using replace_string_in_file and instead used the task tool to delegate to a subagent. This is the expected behaviour. Confirmed May 2026.
task tool has a JSON serialization limit. OmniCoder 9B caused an Unterminated string error by embedding the entire contents of multiple package.json files as a literal string inside the task prompt JSON. The task tool prompt is serialized as JSON; very long strings truncate and produce parse errors. Mitigation: instruct the orchestrator in its system prompt to tell workers which files to read rather than quoting file contents inline. This has been added to local-orchestrator.md. (May 2026)
ollama/arch-omni2-9b is the wrong model identifier for the llama-server instance. The correct ID is llama-server/arch-omni2-9b-native (verify with opencode models | grep arch). Using the wrong ID causes an immediate "cannot load model" error when the agent is invoked. Fixed in opencode.json and local-orchestrator.md frontmatter. (May 2026)

Open Issues

Known bugs and stale claims identified during code review (see deleted agent-infrastructure-review.md and agent-infrastructure-review-pass2.md for full context). Not yet fixed.

CRITICAL — `description:` empty in all generated agent/skill files

scripts/generate-agents.ts uses a hand-rolled YAML parser that silently drops descriptions when they are written in block-scalar form (value on the next line under the key). Every generated file in .github/agents/, .github/skills/, .opencode/agents/, .opencode/skills/ has a blank description: field.

description: is the primary routing signal for Copilot's SkillsContextComputer and OpenCode's agent dispatch. Explicitly @-mentioning an agent by name still works; description-triggered auto-routing does not.

Fix: Inline the description strings in the canonical .agents/ source files (change block-scalar to key: 'value' format). The existing parser handles inline strings correctly. Add a generate:agents:check assertion that every generated file has a non-empty description:.

MEDIUM — `printf '%s'` regression in hooks breaks `\n` rendering (resolved)

.agents/hooks/post-tool-use.sh, session-start.sh, and user-prompt-submit.sh use printf '%s' "$context" | node -e '...' to JSON-escape the context variable. %s does not interpret \n escape sequences, so multi-line context strings (SELF-CHECK, DEBUGGING REMINDER, BFF REMINDER) arrive at the model as single lines with literal \n characters.

Verified fixed (May 2026): all three hooks already use printf '%b'.

LOW — arXiv citation `2603.29957` unverified (resolved)

arXiv:2603.29957 (Jiang et al. 2026, "Think-Anywhere") appears in .agents/agents/research.md, .agents/agents/brainstorm.md, and the Research Foundation section above. Verify the ID resolves at https://arxiv.org/abs/2603.29957 and fix all references if it doesn't.

Verified real (May 2026): "Think Anywhere in Code Generation" by Xue Jiang, Tianyu Zhang, Ge Li et al., submitted March 31, 2026, revised April 27, 2026 (v3), cs.SE. All existing citations are correct.

LOW — `.claude/` false claims in `tool-agnostic-agent-infra.md` (resolved)

The file docs/projects/tool-agnostic-agent-infra.md no longer exists — already deleted. No action needed.

49 KiB Raw Blame History Unescape Escape

Agent Infrastructure

Current State

Architecture Overview

Brainstorm Agent

Research Agent

Nudge Instructions

Tool Mapping (Copilot ↔ OpenCode)

Research Foundation

MCP Server Lifecycle Hooks — Protocol Status (May 2026)

Current protocol state

Active work in the MCP spec

What to watch

Implication for this project

Model Scale Profiles

Large-scale API models (Claude Sonnet / Opus)

Smaller-scale local models (OmniCoder 9B via Ollama)

OmniCoder 2 Orchestration — Pending Work

Goals

Pending Changes

Quick wins — under 5 minutes each, no testing required

Short session — 15–30 minutes each, bounded scope

Depends on orchestrator being proven first

Done

Known Tradeoffs

Cloud Migration Path

Ecosystem Gap — Contextual AGENTS.md Injection

Open tasks

Testing & Regression

Open tasks

Verified facts (May 2026)

Open Issues

CRITICAL — description: empty in all generated agent/skill files

MEDIUM — printf '%s' regression in hooks breaks \n rendering (resolved)

LOW — arXiv citation 2603.29957 unverified (resolved)

LOW — .claude/ false claims in tool-agnostic-agent-infra.md (resolved)

49 KiB

Raw Blame History

CRITICAL — `description:` empty in all generated agent/skill files

MEDIUM — `printf '%s'` regression in hooks breaks `\n` rendering (resolved)

LOW — arXiv citation `2603.29957` unverified (resolved)

LOW — `.claude/` false claims in `tool-agnostic-agent-infra.md` (resolved)