dotfiles/.agents/docs/agent-infrastructure.md

# Agent Infrastructure

Shared agent infrastructure for VS Code Copilot and OpenCode — brainstorm
agent, research agent, nudge instructions, hooks, skills, and MCP server.
Project-specific overlays live in each project's `.agents/` directory.

> **See also:**
> [`docs/research/ai-coding-best-practices.md`](../research/ai-coding-best-practices.md)
> — research synthesis covering the Prompt/Context/Harness taxonomy, failure
> modes, enforcement hierarchy, small-model harness patterns, and all
> primary-source citations that underpin the design decisions here.

## Current State

### Architecture Overview

The infrastructure is **tool-agnostic**: canonical sources live in `.agents/`
and a generator (`npm run generate:agents`) distributes them to
`.github/agents/`, `.github/skills/`, `.opencode/agents/`, `.opencode/skills/`.
Edit the `.agents/` sources; never edit the generated output directories (they
are `.gitignore`d and blocked by pre-tool-use policy).

```
.agents/
├── AGENTS.md                        # Root design doc + enforcement hierarchy
├── agents/                          # Agent definitions (canonical)
│   ├── brainstorm.md
│   ├── research.md
│   └── build-local.md               # OmniCoder 9B via Ollama
├── hooks/                           # Shared bash hooks (delegated by all harnesses)
│   ├── pre-tool-use.sh              # Hard blocks (terminal cmds + file-path policies)
│   ├── post-tool-use.sh             # Self-check counter + methodology reminders
│   ├── session-start.sh             # Inject project state at session start
│   ├── user-prompt-submit.sh        # Per-turn nudge detection + task capture
│   ├── pre-compact.sh               # Export state before context summarization
│   └── stop.sh                      # Session-end verification
└── skills/
    └── research/SKILL.md            # Research methodology (any agent can load)
```

Generated output (do not edit — regenerated by `npm run generate:agents`):

- `.github/agents/` — VS Code Copilot agent files
- `.github/skills/` — VS Code Copilot skill files
- `.opencode/agents/` — OpenCode agent files
- `.opencode/skills/` — OpenCode skill files

Harness integration:

- **VS Code Copilot**: `.github/agent-support.json` — maps 4 hook events to the
  shared bash scripts in `.agents/hooks/`
- **OpenCode**: `.opencode/plugins/agent-support.ts` — TypeScript plugin that
  shells out to the same bash scripts

### Brainstorm Agent

- 4-phase workflow: Quick Frame → Diverge → Converge → Capture & Hand Off
- 6 techniques: Rapid Ideation, SCAMPER, Worst Possible Idea, How Might We,
  Inversion/Pre-mortem, Constraint Flipping
- Counterbalances Opus 4.6 overthinking tendency
- Phase 2 includes "push past the obvious" nudge (Zhao et al. 2024: LLMs fall
  short on originality, excel at elaboration — first ideas are "average")
- Phase 4 routes to `@research` for investigation, default agent for
  implementation
- Creates exploration files at `docs/explorations/<name>.md` and session memory
  notes

### Research Agent

- Two orientations that compose recursively:
  - **Understand** (Grounded Theory): open coding → constant comparison → axial
    coding → memo → saturation check
  - **Diagnose** (Strong Inference + Satisficing): 5-factor triage gates between
    satisficing (low risk) and full falsification (high risk)
- 5-factor triage: reversibility, blast radius, confidence, novelty, time cost
- Timing awareness: `time` prefix on unknown commands, session/repo memory for
  baselines, timing feeds into triage decisions
- Investigation files at `docs/explorations/<name>.md`
- Techniques reference: Five Whys, Delta Debugging, Rubber Duck
- Delegates evidence-gathering to Explore subagent, keeps analytical thinking
  local

### Nudge Instructions

- Brainstorm nudge: triggers on hesitation/overthinking language ('wait',
  'actually', 'hmm', 'overcomplicating', etc.)
- Research nudge: triggers on debugging/investigation language ('why is this
  broken', 'how does this work', 'root cause', etc.)
- Both are non-intrusive single-sentence suggestions, only fire once per topic

### Tool Mapping (Copilot ↔ OpenCode)

| Copilot                                              | OpenCode equivalent                                                                                |
| ---------------------------------------------------- | -------------------------------------------------------------------------------------------------- |
| `AGENTS.md` (root + nested)                          | `AGENTS.md` (root, native; nested via `instructions` glob in `opencode.json`)                      |
| `.github/agents/*.agent.md`                          | `.opencode/agents/*.md` (frontmatter: `description`, `mode`, `model`, `temperature`, `permission`) |
| `.github/skills/<name>/SKILL.md`                     | `.opencode/skills/<n>/SKILL.md` — also reads `.agents/skills/` and `.claude/skills/`               |
| `.github/instructions/*.instructions.md` (`applyTo`) | No direct equivalent — fold into AGENTS.md stubs or `instructions` glob                            |
| `.github/hooks/*.sh` (JSON-configured shell)         | `.opencode/plugins/*.ts` (TS modules, event-driven) — shells out via Bun's `$`                     |
| `runSubagent` / `Explore` agent                      | Built-in `general` and `explore` subagents; `@`-mention syntax                                     |
| `vscode_askQuestions`                                | No equivalent — OpenCode uses agent's natural turn-taking                                          |

OpenCode plugin event mapping:

| Copilot hook   | OpenCode event                      |
| -------------- | ----------------------------------- |
| `SessionStart` | `session.created`                   |
| `PreToolUse`   | `tool.execute.before`               |
| `PostToolUse`  | `tool.execute.after`                |
| `PreCompact`   | `experimental.session.compacting`   |
| `Stop`         | `session.idle` (closest equivalent) |

## Research Foundation

> For full research depth, citations, and failure-mode analysis, see
> [`docs/research/ai-coding-best-practices.md`](../research/ai-coding-best-practices.md).
> The list below records the specific papers and frameworks that shaped the
> design decisions in this project.

Methodologies and papers that informed the design:

- **Grounded Theory** (Glaser & Strauss): build understanding from data, not
  assumptions. Applied to code-reading in the Understand orientation.
- **Strong Inference** (Platt 1964): multiple competing hypotheses → crucial
  experiments → eliminate. Applied to the Diagnose orientation.
- **Satisficing** (Simon 1956): accept "good enough" when optimization cost
  exceeds benefit. Gates between cheap confirmation and expensive falsification.
- **Dual Process Theory** (Kahneman): System 1 (fast, pattern-matching) vs
  System 2 (slow, analytical). System 1 more accurate in familiar domains.
  Informs the triage decision.
- **Zhao et al. 2024** (arxiv): LLMs fall short on originality, excel at
  elaboration. First ideas are "average." Informs brainstorm agent's "push past
  the obvious" nudge.
- **"Lost in the Middle"** (Liu et al. 2023): LLMs attend best to beginning/end
  of context. Informs hook design — inject at context tail for high attention.
- **Delta Debugging**: binary search the change space between passing/failing
  cases. Logic behind `git bisect`.
- **Five Whys**: iterative causal chain tracing. Starting point for hypothesis
  generation, not sole diagnostic method.
- **Ronacher "Agent Design Is Still Hard"**: reinforce methodology after every
  tool call at context tail. Structural injection outperforms relying on
  instructions in the system prompt.
- **Think-Anywhere** (Jiang et al. arXiv:2603.29957, Mar 2026, Peking U + Tongyi
  Lab): LLMs trained to invoke `<think>` blocks at any token position during
  code generation, not just upfront. SOTA on LeetCode/LiveCodeBench with fewer
  total tokens. The motivating insight: a model can plan correctly at the start
  but introduce an off-by-one bug mid-implementation — only mid-loop reasoning
  catches it. **Applied here**: the research agent's investigation checklist
  includes "Re-evaluate hypothesis at every tool-call boundary." For Claude 4
  models, interleaved thinking makes this automatic. Complements Plan-and-Solve:
  upfront decomposition where structure is clear, mid-execution re-evaluation
  when intermediate results change what to do next.
- **Anthropic interleaved thinking** (Claude 4 + adaptive thinking): Claude
  Sonnet 4.6+ and Opus 4.6+ automatically insert thinking blocks between tool
  calls. No separate implementation needed — agent instruction design drives it.
  The research agent's "Re-evaluate at every tool-call boundary" instruction
  explicitly activates this behavior.
- **Prompt/Context/Harness framework** (Alibaba Cloud, Apr 2026): Names the
  three engineering layers. Prompt = task expression (stateless). Context = what
  the model sees (AGENTS.md, skills, tools — engineering target is progressive
  disclosure). Harness = system constraints + verification loops (hooks,
  permission gates, sub-agent isolation). Diagnostic map: wrong output → Prompt;
  hallucinated fact → Context; wrong tool selected → Context (fix description);
  task drift → Harness (sub-agent boundary); destructive action → Harness
  (permission hook). LangChain improved Terminal Bench 2.0 from 52.8% → 66.5% by
  changing Harness alone.
- **Context engineering** (Rajasekaran et al., Anthropic, Sep 2025): Formally
  distinguishes context engineering from prompt engineering. Key principles: (a)
  just-in-time context — agents hold references and load on demand, not upfront;
  (b) structured note-taking (NOTES.md) as external working memory for long
  sequential tasks; (c) every new token depletes attention budget — validates
  the <60-line AGENTS.md ceiling; (d) compaction strategy: maximize recall
  first, then improve precision.

## MCP Server Lifecycle Hooks — Protocol Status (May 2026)

The `.agents/mcp/` server exposes prompts and tools to agents via the MCP
protocol. A recurring question: can the MCP server react to session lifecycle
events (session start/end, tool-use boundaries)?

### Current protocol state

**No lifecycle hooks exist in the MCP protocol.** The spec defines three phases
only: `initialize → operation → shutdown`. There is no `session.created`,
`post-tool-call`, or `session.ended` notification. This gap is why session
awareness currently lives in the OpenCode plugin layer
(`.opencode/plugins/agent-support.ts`) rather than the MCP server — OpenCode
exposes `session.created`, `session.idle`, `session.compacted`,
`session.deleted`, and `tool.execute.before/after` events natively to plugins.

### Active work in the MCP spec

**SEP-2624: Interceptors for the Model Context Protocol**
([PR #2624](https://github.com/modelcontextprotocol/modelcontextprotocol/pull/2624))

The most organized effort. Supersedes SEP-1763 (closed as completed). Proposes
**Interceptors** as a new MCP primitive — two types: **validators** (inspect,
return pass/fail) and **mutators** (transform context payloads) — discoverable
and invocable via `interceptors/list` and `interceptor/invoke` JSON-RPC methods.
These fire at protocol-level operation events: `tools/call`, `prompts/get`,
`resources/read`, `sampling/createMessage`, `elicitation/create`. Not
session-start/stop hooks, but before/after wrapping for every operation.

There is now a formal **Interceptors Working Group** (Bloomberg + Saxo Bank
engineers, biweekly cadence). Reference implementations in progress for Go and
C# SDKs. Experimental repo:
[modelcontextprotocol/experimental-ext-interceptors](https://github.com/modelcontextprotocol/experimental-ext-interceptors).
Charter:
[modelcontextprotocol.io/community/interceptors/charter](https://modelcontextprotocol.io/community/interceptors/charter).

**SEP-2282: Server-Declared Behavioural Hooks**
([PR #2282](https://github.com/modelcontextprotocol/modelcontextprotocol/pull/2282))

Smaller, separate open PR. Proposes servers declare **context injections** in
`ServerCapabilities` — text injected into the agent's context at client-side
lifecycle events (session start, post-tool-use, session end). The contract is
"here's context the model should have at this moment," not code execution. More
directly analogous to our OpenCode `session.created` / `session.idle` patterns.
Currently unsponsored — needs a maintainer to pick it up.

### What to watch

- **Primary**:
  [PR #2624](https://github.com/modelcontextprotocol/modelcontextprotocol/pull/2624) +
  experimental-ext-interceptors repo
- **Secondary**:
  [PR #2282](https://github.com/modelcontextprotocol/modelcontextprotocol/pull/2282)
  (closest to session-lifecycle hooks)
- **Label filter**:
  [`SEP` label](https://github.com/modelcontextprotocol/modelcontextprotocol/issues?q=label%3ASEP)
  on the modelcontextprotocol repo
- **Milestone**: `2026-06-30-RC` is the next spec revision window

### Implication for this project

Until interceptors land in a shipping spec version and the TypeScript SDK, the
session lifecycle pattern stays at the OpenCode plugin layer. When SEP-2282 or
an equivalent lands, the MCP server could self-register context injection hooks
during `initialize`, removing the need for tool-specific plugin code.

---

## Model Scale Profiles

Different model sizes require different infrastructure strategies. The failure
modes are different, so the mitigations are different.

### Large-scale API models (Claude Sonnet / Opus)

**Primary failure modes**: overthinking, sycophancy, verbosity, tendency to add
unrequested features or comments.

**Infrastructure strategy**:

- Advisory methodology + structural reinforcement (hooks, circuit breakers)
- PostToolUse self-check nudges every ~15 calls
- PreToolUse hard blocks for high-risk operations
- Subagent delegation for isolated tasks (parent Opus → child Sonnet/Haiku)

### Smaller-scale local models (OmniCoder 9B via Ollama)

**Primary failure modes** (different from "low reasoning" — OmniCoder uses Qwen3
thinking blocks natively):

- Narrower training distribution (Python/JS heavy)
- Quantization degradation: JSON schema compliance drops as context fills
- Tool-call history is the primary context consumer — responses must be
  truncated aggressively
- Instruction drift: fewer attention heads (32 vs 64 in 32B) means system prompt
  recall degrades faster

**Infrastructure strategy**:

- PostToolUse response truncation at ~1500 tokens (plugin layer, not bash hook)
- PreToolUse JSON validation with schema-specific error messages
- Context pressure injection at ≥70% fill (~22K/32K tokens)
- `steps: 20` cap + `ask` permission gates for natural checkpoints
- `explore` subagent delegation to reduce context pressure on the main agent
- `NOTES.md` working memory pattern enforced in agent body
- No `web` tool — keeps context lean
- Reasoning guidance: "Hold references; load on demand" explicit in agent body

---

## OmniCoder 2 Orchestration — Pending Work

> Full historical rationale and audit findings were maintained in
> `docs/projects/local-ai-orchestration.md` (deleted May 2026 after merge). The
> plan used an orchestrator-workers pattern with structural `edit: deny`
> enforcement on the orchestrator. All OpenCode config values verified against
> opencode.ai/docs (May 2026).

### Goals

1. All agents run on `ollama/arch-omni2-9b` — no cloud fallback
2. User can type vague prompts; the system decomposes and delegates
   automatically
3. Context windows are isolated per subagent (no shared state bleed)
4. Changes scale forward: switching to cloud means changing model strings, not
   architecture

### Pending Changes

#### Quick wins — under 5 minutes each, no testing required

1. - [x] **[CRITICAL] Fix `<tool\*call>` typo in `omnicoder2.modelfile`** —
         markdown-escape artifact; malformed opening tag paired with correct
         closing tag. Highest-leverage change; everything below depends on
         reliable tool-call JSON.
2. - [x] **Mark canonical/deprecated modelfiles** — `# CANONICAL` header on
         `omnicoder2.modelfile`; `# DEPRECATED` on `omnicoder.modelfile`;
         `omnicoder-v2.modelfile.template` deleted (was dead code — v2 now
         served from HuggingFace path).
3. - [x] **Add `compaction.reserved: 3000` to `opencode.json`** — default 10,000
         fires compaction too early given ~8–12K baseline context.
4. - [x] **Fix `pre-compact.sh` prettier call** — removes `npx prettier` which
         violates pre-tool-use Policy 1 (self-violating policy).
5. - [x] **MCP server error handling** — wrap `server.connect(transport)` in
         try/catch with stderr + `process.exit(1)`.

#### Short session — 15–30 minutes each, bounded scope

6. - [x] **Fix `stop.sh` JSON escaping** — replace `sed`-based escaping with
         `printf '%b' | node JSON.stringify` pattern used in every other hook.
7. - [x] **Per-session PostToolUse counter** — repo-scoped path
         `/tmp/.opencode-tool-count-<repo-hash>` (derived from REPO_ROOT via
         md5sum); prevents cross-repo contamination; session-start.sh resets it
         at session begin.
8. - [x] **Shrink compaction prompt to ~120 words** (in
         `.opencode/plugins/agent-support.ts`) — shorter instructions free
         bandwidth for the 9B to actually summarize.
9. - [x] **Update `.agents/agents/build-local.md` for v2** — pagination 100 → 50
         lines; rule 4 now says "recipient not dispatcher"; rule 7 scope-check
         says "tell the user, do not self-decompose".

#### Depends on orchestrator being proven first

10. - [x] **Trim root `AGENTS.md` to ~60 lines** — reduced from 435 lines to 45
          lines; all architecture rationale, code examples, quick task table,
          and project context removed; cross-cutting rules and quality gate
          preserved (May 2026).
11. - [x] **PostToolUse weighted counter** — reads (`read_file`, `grep`, `list`)
          +0.25; writes/shell +1; keeps 15-call SELF-CHECK from firing
          mid-investigation sweep. Depends on #7 (per-session counter) first.

          **Implementation** (`.agents/hooks/post-tool-use.sh`): bash has no
          float arithmetic — scale to integers: reads +1, writes/shell +4,
          threshold 60 (equivalent to 15 effective write-units). Read-class
          tools: `read_file`, `grep_search`, `list_dir`, `file_search`,
          `semantic_search`, `explore_subagent`. Write/shell-class: all
          `*_string_in_file`, `create_file`, `run_in_terminal`. Replace the
          single `COUNT=$((COUNT + 1))` with a `case "$TOOL_NAME"` block that
          does `COUNT=$((COUNT + 1))` for reads and `COUNT=$((COUNT + 4))` for
          writes/shell. Change the self-check condition from
          `(( COUNT % 15 == 0 ))` to `(( COUNT % 60 == 0 ))`.

12. - [x] **PostToolUse reminder priority filter** — emit at most 2 reminders
          per tool call; priority: SELF-CHECK > DEBUGGING > path-scoped >
          tool-specific. Depends on #11.

          **Implementation** (`.agents/hooks/post-tool-use.sh`): replace the
          current single `context` string accumulator with an indexed array
          `reminders=()`. Each block appends `reminders+=("$msg")` in priority
          order (SELF-CHECK first, DEBUGGING second, BFF/QUALITY GATE third,
          RENAME fourth). At output time: join only the first 2 elements.
          Append with `\n\n` separator. Blocks that didn't fire don't append,
          so the cap is natural.

13. - [x] **Broaden PostToolUse truncation to all `ollama/` agents**
          (`.opencode/plugins/agent-support.ts`); differentiate limit:
          orchestrator 2,500 tokens vs workers 1,500. Minor until orchestrator
          exists.

          **Implementation**: rename `BUILD_LOCAL_MAX_RESPONSE_TOKENS` →
          `LOCAL_WORKER_MAX_TOKENS = 1500`; add
          `LOCAL_ORCHESTRATOR_MAX_TOKENS = 2500`. In `tool.execute.after`, the
          existing `isLocalAgent` check covers all `ollama/` agents via
          `input.model.startsWith('ollama/')`. Add a second check:
          `input.agent === 'local-orchestrator'` → use orchestrator limit, else
          worker limit. The `agent` field is available in `tool.execute.after`
          (confirmed working for `build-local`).

14. - [x] **Create `.agents/agents/local-orchestrator.md`** — primary agent with
          `edit: deny`, `write: deny`, `bash: deny`; whitelist `task` to
          `build-local`, `research`, `brainstorm` only.

          **Implementation**: new file modeled on `build-local.md`. Role: receive
          high-level goal, decompose into bounded subtasks, show decomposition to
          user before dispatching, delegate via `task` subagent. Permission
          block in `opencode.json` `agent.local-orchestrator`:
          `{ "edit": "deny", "write": "deny", "bash": "deny" }`. Agent body
          rules: (1) read project root `AGENTS.md` first; (2) produce a task
          list and confirm with user before dispatching; (3) one `task` call per
          subtask, wait for result; (4) never attempt to edit files directly —
          if a subtask requires context the worker needs, inject it via the
          `task` prompt, not by reading files yourself; (5) after all subtasks,
          report summary to user.

15. - [x] ~~**Set `default_agent: "local-orchestrator"` in `opencode.json`**~~ —
          Done May 2026. Key is `default_agent` (snake_case, confirmed from
          `opencode.ai/config.json` schema). `local-orchestrator` has
          `mode: all` so it qualifies as a primary agent.

#### Done

- [x] ~~**Soften `opus-deep.modelfile` directive**~~ — file deleted (May 2026);
      DeepSeek R1 available online when needed; OmniCoder 2 is the sole local
      model.

### Known Tradeoffs

| Tradeoff                                           | Impact                                                                                                          | Mitigation                                                                                                         |
| -------------------------------------------------- | --------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------ |
| Instructions glob trimmed to root `AGENTS.md` only | Agents miss project-specific patterns for subdirectories unless they read nested `AGENTS.md` explicitly         | Add reminder in orchestrator + build-local agent body: "check nested `AGENTS.md` before working in subdirectories" |
| Same model for all roles                           | Orchestrator, worker, compaction agent are all same weights with different prompts                              | Structural `edit: deny` is the safety net; circuit breakers limit runaway loops                                    |
| No cloud fallback                                  | If task is too complex for 9B, no escalation path                                                               | Orchestrator includes "ask the user for direction" rule; user can switch to Copilot                                |
| Latency                                            | Sequential dispatch: orchestrator decomposes → build-local runs → returns. ~2× wall time vs. direct build-local | Acceptable for local dev; no VRAM multiplier since Ollama keeps weights hot                                        |
| Reminder-stacking cap                              | 2-per-call priority filter (pending work above) drops lower-priority warnings                                   | Skipped reminders fire on next call if condition holds                                                             |

### Cloud Migration Path

When ready to add a cloud model, only `opencode.json` changes:

```json
{
  "model": "ollama/arch-omni2-9b",
  "agent": {
    "local-orchestrator": {
      "model": "anthropic/claude-haiku-4-5"
    }
  }
}
```

Schema verified against opencode.ai/docs/agents/ (May 2026). The `tools` key
inside agent configs is deprecated in favour of `permission` — the orchestrator
definition uses `permission`, so it is current. The `agent.{name}.model` key is
the correct per-agent override mechanism.

---

## Ecosystem Gap — Contextual AGENTS.md Injection

During local AI work (May 2026) we hit a fundamental limitation: OpenCode's
`instructions` glob in `opencode.json` loads **all matched files upfront** into
every session. For a 9B local model with a 32K context window, loading all of
`apps/*/AGENTS.md` and `packages/*/AGENTS.md` at startup consumes ~30–40% of the
context budget before the first message, triggering early compaction and
degrading quality.

The correct behaviour — injecting only the AGENTS.md relevant to the file being
edited — does not exist natively in OpenCode or its plugin ecosystem. The
closest community plugin (`opencode-skillful`, 295 stars) is archived as of Feb
2026 and still requires the model to explicitly call `skill_find`/`skill_use`;
it provides no path-triggered structural injection.

### Open tasks

16. - [ ] **Assess: is filling this ecosystem gap worth the effort?** — Before
          building a contextual-injection plugin, evaluate: (a) Is OpenCode
          actively used for serious local AI coding work, or is the community
          primarily cloud-model users for whom context cost is irrelevant? (b)
          Are there better local AI coding stacks (e.g. Aider + litellm, Cursor
          local mode, VS Code Copilot + Ollama) where this problem is already
          solved? (c) Is the `tool.execute.before` event stable enough to build
          on? Target: 30-minute research session, concrete go/no-go
          recommendation.

17. - [ ] **Review + write up our issues and fixes as an ecosystem
          contribution** — If the gap is worth filling: document the
          context-bleed problem, the early-compaction root cause, our hook-based
          mitigation, and the remaining structural gap. Publish as a GitHub
          issue on the OpenCode repo and/or an npm plugin
          (`opencode-contextual-rules`?) implementing `tool.execute.before`
          path-triggered AGENTS.md injection. Depends on #16 go/no-go.

18. - [x] ~~**Trim `.agents/AGENTS.md`**~~ — Done May 2026. Condensed from
          12,584 → 10,507 bytes (43 lines removed). Trimmed: Hook Architecture
          Principle block (redirected to item 22 in project doc), Deferred
          Loading example + "why not" paragraph, session-start/stop hook prose,
          outdated `generate-agents.ts` references in Skills/Agents sections.
          Agent body files updated to prompt-body-only convention (see items
          25/26).

19. - [x] ~~**Block bash bypass of read pagination**~~ — Done May 2026. Added
          Policy 14 to `pre-tool-use.sh`: blocks `cat`/`head`/`tail`/`jq` reads
          of `apps/*/package.json` and `packages/*/package.json`. Scope limited
          to package.json (confirmed live bypass vector); general `.ts`/`.md`
          bash reads are not yet blocked (lower-urgency gap). Pattern verified
          with Node.js unit test — exact bypass command
          `cat apps/api/package.json | jq` is caught by P1.

20. - [ ] **Improve explore-first scope detection** — Policy 14 blocks
          `manage_todo_list` with ≥4 items, but OmniCoder sometimes starts with
          `Explore`/`find` before planning, bypassing the check. Options: (a)
          block `explore_subagent` when the query looks like a multi-file
          discovery sweep (glob patterns for source files across multiple dirs);
          (b) add a pre-tool-use check on `run_in_terminal` that denies `find`
          commands spanning the whole repo when the task hasn't been scoped yet;
          (c) rely on the todo-list check firing when planning eventually
          happens (current behavior — catches it late but still before edits
          start).

21. - [x] ~~**Remove debug logging from plugin after verified cycle**~~ — Done
          May 2026. Removed the full-input dump block from `tool.execute.before`
          in `plugin.ts` (`/tmp/plugin-debug.jsonl` appender). Guards verified
          via `opencode export` session transcript inspection — no longer need
          the dump file. Hook error logger (`/tmp/plugin-hook-errors.log`) kept
          as it only fires on failures, not every call.

22. - [ ] **Refactor hook scripts to be platform-agnostic** — currently
          `pre-tool-use.sh` parses Copilot-specific JSON and outputs
          Copilot-specific `permissionDecision` JSON. `plugin.ts` implements
          duplicate guards inline rather than calling the script. This means
          OpenCode and Copilot guards can drift (confirmed May 2026: Policy 14
          in `pre-tool-use.sh` had no effect on OpenCode `bash` tool calls).

          **Design target**: scripts accept normalized env vars (`TOOL_NAME`,
          `COMMAND`, `FILE_PATH`), exit non-zero with plain-text denial reason
          on stdout. Callers normalize input and translate output to their
          native denial format. Tracked in `.agents/AGENTS.md` Hook Architecture
          Principle section.

          **Audit required first**: review all hook scripts for Copilot-specific
          assumptions before refactoring.

23. - [ ] **Question-drift marker in `user-prompt-submit.sh`** — when the model
          has committed to a prior position and follow-up questions are being
          misread through that lens, prepend a disambiguation marker at the
          prompt tail. Detected pattern: model answers "no" or "not possible" in
          a prior turn → subsequent turns interpreted as defense of that
          position. See §2.1 ("Position-anchored priming") in the research doc.

          **Implementation**: in `user-prompt-submit.sh`, read the last N turns
          of `$TRANSCRIPT_PATH` (injected by OpenCode's native hook env) and
          look for a prior committed "no/impossible/can't" response within the
          last 3 model turns. If detected, append to `ADDITIONAL_CONTEXT`:
          `CURRENT QUESTION (answer only this — not the prior exchange): [prompt
          text]`. The key is repeating the user's exact question at the tail,
          after the marker, to counteract lost-in-the-middle effects. Fallback
          trigger: user prompt contains "that's not what I asked" / "you're
          answering the wrong question" / "I said" → always inject marker
          regardless of transcript scan.

24. - [x] ~~**Review all custom agent files for local-model-specific framing**~~
          — Done May 2026. `build-local.md` reframed: dropped "OmniCoder", "9B",
          "Ollama", "Qwen3 thinking blocks", "32K tokens total"; replaced with
          model-agnostic equivalents. `research.md` and `brainstorm.md` verified
          clean — no model/provider mentions. `local-orchestrator.md` was fixed
          earlier this session. All four agent body files are now
          model-agnostic.

25. - [ ] **Failure-mode routing in SELF-CHECK** — when the periodic SELF-CHECK
          fires in `post-tool-use.sh`, if a recent terminal failure or test
          failure is also present in the same turn, classify the failure type
          and inject the matched intervention rather than generic "step back."
          Reference: failure-mode routing table in §3.5 of the research doc.

          **Implementation**: in the SELF-CHECK block, if `context` already
          contains `DEBUGGING REMINDER` (i.e., test/terminal failure co-occurred
          this turn), append a classification hint:
          `FAILURE TYPE HINT: If this is a test/build failure → Reflexion loop
          (fix based on test output). If convention violation → grep for the
          pattern and inject a canonical example. If wrong file/directory → stop
          and re-read the project structure. Do not default to "try harder."`.
          Low implementation cost — pure text append with a conditional on
          `$context`.

26. - [x] ~~**Audit agent `.md` files for OpenCode-specific frontmatter**~~ —
          Done May 2026. Audit result: only `local-orchestrator.md` had OpenCode
          frontmatter keys (`mode`, `model`, `permission`). `brainstorm.md`,
          `build-local.md`, `research.md` were already plain markdown. Went with
          option (b): stripped `mode`/`model`/`permission` from
          `local-orchestrator.md`; moved `mode: all` into `opencode.json`
          (model + permission were already there). Kept `description` in
          frontmatter as it is neutral and self-documenting. Body files are now
          prompt-body only — valid in both OpenCode and Copilot.

27. - [ ] **`plugin.ts` local-agent detection uses provider prefix, not agent
          name** — `tool.execute.after` detects local agents via
          `input.model.startsWith('ollama/')`. This is provider-specific: if the
          model is served via a different backend (e.g. `llama-server/`,
          `lmstudio/`), truncation silently stops working. Fix: detect by agent
          name (`input.agent.includes('build-local')`) only, removing the
          `ollama/` fallback. The `input.agent` field is available in
          `tool.execute.after` (confirmed May 2026).

28. - [ ] **`plugin.ts` context pressure threshold is hardcoded to 32,768
          tokens** — `CONTEXT_LIMIT_TOKENS = 32768` assumes OmniCoder 9B's
          context window. If the local model changes, the threshold silently
          drifts out of calibration. Options: (a) read from `opencode.json`
          model config if OpenCode exposes it to plugins; (b) make it a
          top-of-file constant with a comment to update when changing models;
          (c) accept the drift as low-severity (threshold is advisory only —
          context pressure warnings are informational, not blocking). Option (b)
          is the minimum; option (a) is ideal if OpenCode exposes model metadata
          to plugins.

29. - [x] ~~**Move `permission` out of `local-orchestrator.md` frontmatter**~~ —
          Done May 2026 as part of item 25. `mode: all` added to `opencode.json`
          agent entry. `model` and `permission` were already in `opencode.json`.
          `opencode.json` is now the single source of truth for all runtime
          config; `.md` files are prompt-body only.

---

## Testing & Regression

**Research summary (May 2026):** No pre-existing tool exactly fits this use
case. Existing tools (RagaAI Catalyst, AgentEvalKit, agent-eval-arena,
intent-eval-lab, j-rig-skill-binary-eval) focus on LLM output quality,
hallucination detection, or cross-runtime behavior scoring — not config file
structure or policy enforcement regression. The closest analogue is
`j-rig-skill-binary-eval` (binary pass/fail criteria across 7 layers), which
uses the same conceptual approach we'd want here. Our testing is bespoke by
necessity: we're testing configuration files, shell scripts, and specific policy
enforcement behaviors, not general LLM response quality.

**Two layers of testing:**

| Layer                       | What it tests                           | Cost             | When to run                            |
| --------------------------- | --------------------------------------- | ---------------- | -------------------------------------- |
| Config + policy unit tests  | Schema validity, hook regex correctness | None (no model)  | Always — CI, pre-commit                |
| CLI integration smoke tests | Actual enforcement via `opencode run`   | Local model only | On-demand; local model must be running |

**Cloud agents excluded from integration tests** — `opencode run` with a cloud
model (Copilot, Anthropic) incurs API costs and rate limits. Tests must detect
the active model and skip if it's not a local provider.

### Open tasks

30. - [ ] **Config + policy unit test suite** — test config file structure and
          hook regex patterns without invoking any model. Implementation:

          a. **`opencode.json` schema validation**: the file references
             `"$schema": "https://opencode.ai/config.json"` — validate it using
             `ajv` (already used in the monorepo) against the live schema or a
             cached copy. Catches permission typos, unknown agent keys,
             unsupported field values.

          b. **Hook JSON structure validation**: validate
             `.agents/frameworks/github/hooks.json` and
             `.agents/frameworks/opencode/plugin.ts` (TypeScript, already type-
             checked). Write a schema for the hooks JSON format and run ajv on
             it.

          c. **Hook policy regex unit tests**: extract every regex used in
             `pre-tool-use.sh` into a `tests/hooks.test.ts` file and run it
             with `vitest`. For each policy, define 2–3 input strings that
             SHOULD match and 2–3 that SHOULD NOT. Policy 14 already has an
             informal Node.js test from this session — formalize it.

          d. **Agent `.md` frontmatter validator**: check that no agent file
             under `.agents/agents/` has frontmatter keys other than
             `description`. Catches regression when someone adds `model:` or
             `permission:` back to a body file.

          **Suggested location**: `.agents/tests/` or root `test/agents/`.
          **Stack**: vitest (already in monorepo), ajv (already available), Node
          built-ins. No new dependencies needed.

31. - [ ] **CLI integration smoke tests (local model only)** — use
          `opencode run` in non-interactive mode to verify enforcement is
          actually firing via the real runtime. These tests exercise the
          plugin + hook wiring end-to-end.

          **Command shape**:
          ```
          opencode run "prompt" --agent build-local \
            --model llama-server/arch-omni2-9b-native \
            --format json
          ```

          **Assertions via `opencode export`**: after each run, export the
          session with `opencode export <sessionID> 2>/dev/null` and parse the
          JSON transcript. Assert on `parts` array: tool calls that SHOULD have
          been blocked appear with error/denied status; tool calls that SHOULD
          have passed completed normally.

          **Test cases to start with** (all verified real enforcement gaps):
          1. Attempt to `read` a nested `package.json` (e.g. `apps/api/package.json`) → BLOCKED by plugin
             package.json guard
          2. Attempt to `read` a source file with no `limit` → BLOCKED by
             pagination guard
          3. Attempt to `read` a source file with `limit: 51` → BLOCKED
          4. Attempt to `read` a docs file with `limit: 501` → BLOCKED
          5. Attempt to `read` a docs file with `limit: 50` → PASSES
          6. Bash command `cat apps/api/package.json` → BLOCKED by pre-tool-use
             Policy 14 (substitute your project's equivalent nested package.json)

          **Guard rail**: skip all tests if `llama-server` is not reachable at
          `http://127.0.0.1:8080/v1`. Do not run against cloud models. Add
          an env var `AGENT_INTEGRATION_TESTS=1` required to enable (off by
          default, never runs in standard `npm test`).

          **Suggested location**: `.agents/tests/integration/`.
          **Stack**: Node.js test runner or vitest, `opencode` CLI in PATH.

### Verified facts (May 2026)

- OpenCode's `read` tool input schema is
  `{ filePath: string, limit?: number, offset?: number }` — NOT
  `startLine`/`endLine`. Confirmed via plugin debug logging of real tool calls.
- `tool.execute.before` input contains only `{ tool, sessionID, callID }`. It
  does NOT include `agent` or `model`, so plugin-layer gating cannot filter by
  agent. Confirmed via plugin debug logging.
- **OpenCode has its own native hook system** that calls `pre-tool-use.sh`
  directly for tools like `run_in_terminal`, `replace_string_in_file`, etc. This
  is completely separate from the plugin's `runHook` calls. The native hook
  payload includes `timestamp`, `hook_event_name`, `session_id`,
  `transcript_path`, `tool_use_id`, and `cwd` — fields the plugin never sends.
  The plugin `runHook` is a _second_ call, layered on top.
- **Bun shell `$` API does not have a `.stdin()` method.** The correct API for
  piping stdin is `` $`cmd < ${Buffer.from(text)}` ``. `.stdin(text)` silently
  throws `TypeError: $\`...\`.stdin is not a
  function`, which was caught by `runHook`'s `catch`block and returned`''`. This caused the plugin's `runHook`to silently no-op for every call with`stdinJson`since the plugin was first written — hook enforcement (all 12 policies) was never running via the plugin path. It only ran via OpenCode's native hook system for the tools OpenCode natively supports. Confirmed via`/tmp/plugin-hook-errors.log`.
- **The silent `catch` in `runHook` is dangerous.** It masked the Bun `.stdin()`
  bug entirely. Always log hook failures to a debug file during development;
  remove only after enforcement is verified working.
- **Plugin-layer enforcement works for `read`** after fixing the Bun stdin API.
  The `read` tool fires `tool.execute.before` in the plugin, which calls
  `runHook('pre-tool-use.sh', ...)` via `< ${Buffer.from(...)}`, which applies
  Policy 13 (50-line limit). Verified: bare `read` (no limit) → BLOCKED; `read`
  with `limit: 50` → passes. (May 2026)
- **Plugin load failure: unescaped regex slashes caused silent syntax error.**
  `plugin-debug.jsonl` was empty even after the Bun stdin fix because the plugin
  file itself failed to parse. Line 84 had `/(^|/)(apps|packages)/[^/]+/...` —
  forward slashes inside the regex literal were not escaped, producing a JS
  syntax error at parse time. Bun silently drops plugins that fail to import.
  Fixed to `/(^|\/)(apps|packages)\/[^/]+\/...`. The fix also corrected the
  pagination guard to use `limit`/`offset` (not `startLine`/`endLine`) and added
  an unbounded-read block (`limit === undefined`). All three guards verified
  working in a live session (May 2026).
- **Package.json read guard verified working.** `local-orchestrator` attempting
  to read `apps/*/package.json` and `packages/*/package.json` → BLOCKED by
  plugin. Root `package.json` read correctly passes. (May 2026)
- **Policy 14 (`manage_todo_list` ≥ 4 items) catches some but not all broad task
  attempts.** OmniCoder sometimes proceeds directly to `Explore`/`find` without
  calling `manage_todo_list` first, bypassing the policy. When it does plan with
  the todo tool before acting, the deny fires correctly.
- **OmniCoder comprehension failure: prompt ambiguity → wrong directory.** Given
  "refactor the five hook files", OmniCoder ran a glob for `*hook*` files and
  found `.husky/` hooks instead of `.agents/hooks/`. The correct files were in
  the grep output from the Explore subagent but were not selected. Root cause:
  the model lacks enough context about the repo layout to disambiguate "hook
  files" without explicit path guidance. Mitigation: be explicit in prompts
  ("the five `.agents/hooks/*.sh` files").
- **OpenCode agent `permission` config requires a `.opencode/agents/<name>.md`
  file.** Without a matching markdown file, `opencode.json`'s
  `agent.<name>.permission` config is silently ignored — the agent is unknown to
  OpenCode and runs as a nameless build-agent alias. The markdown file must
  exist in `.opencode/agents/` (or `~/.config/opencode/agents/`). Confirmed by
  test run where `@local-orchestrator` edited files despite
  `permission.edit: "deny"` in JSON config; fixed by creating
  `.opencode/agents/local-orchestrator.md` symlink. (May 2026)
- **`"write"` is NOT a valid OpenCode permission key.** Use `"edit"` instead —
  it covers `write`, `edit`, and `apply_patch` tools. `"write": "deny"` is
  silently ignored. Valid top-level permission keys include: `read`, `edit`,
  `glob`, `grep`, `list`, `bash`, `task`, `skill`, `lsp`, `question`,
  `webfetch`, `websearch`, `external_directory`, `doom_loop`, `todowrite`.
  Confirmed from `opencode.ai/docs/permissions` (May 2026).
- **`default_agent` key is snake_case** in `opencode.json` (not `defaultAgent`).
  Confirmed from `opencode.ai/docs/config` (May 2026).
- **`tools: false` is deprecated.** The current approach for per-agent tool
  restriction is `permission: { edit: "deny" }`. The old `tools: false` still
  works but is documented as legacy. Confirmed from `opencode.ai/docs/agents`
  (May 2026).
- **Broken symlinks are silent.** OpenCode does not error on a broken
  `.opencode/agents/` symlink — it just skips the agent silently. The agent
  won't appear in `opencode agent list` and all `opencode.json` permission
  config for it is ignored. Always verify with
  `cat .opencode/agents/<name>.md | head -5` (should print content, not a "No
  such file" error) and `opencode agent list` (agent should appear with correct
  deny rules). The correct symlink depth from `.opencode/agents/` is
  `../../.agents/agents/<name>.md` (two levels), not three.
- **`opencode agent list` is the authoritative verification command.** Run it
  after any agent config change to confirm: (a) the agent appears by name, (b)
  its mode is correct (`all`/`primary`/`subagent`), and (c) `deny` rules appear
  at the bottom of its permission list. Missing agent = broken symlink or YAML
  parse error. Present but missing deny rules = frontmatter not parsed correctly
  or wrong key names. (May 2026)
- **`@mention` routing only works at session start.** If you send any message
  that gets answered by the current primary agent first, then send
  `@local-orchestrator ...`, the TUI passes the full message text to the current
  model (Build/OmniCoder) which treats `@local-orchestrator` as freeform text
  and answers it itself. Always open a **fresh session** and make `@agent-name`
  the very first message. Alternatively, use
  `opencode run --agent local-orchestrator "..."` from the CLI for reliable
  agent-scoped invocation. **Tab-switching to a custom `all`-mode agent in an
  existing session works correctly.**
- **`edit: deny` on `local-orchestrator` is working correctly.** When given an
  edit task, the orchestrator correctly avoided using `replace_string_in_file`
  and instead used the `task` tool to delegate to a subagent. This is the
  expected behaviour. Confirmed May 2026.
- **`task` tool has a JSON serialization limit.** OmniCoder 9B caused an
  `Unterminated string` error by embedding the entire contents of multiple
  `package.json` files as a literal string inside the `task` prompt JSON. The
  `task` tool prompt is serialized as JSON; very long strings truncate and
  produce parse errors. Mitigation: instruct the orchestrator in its system
  prompt to tell workers _which files to read_ rather than quoting file contents
  inline. This has been added to `local-orchestrator.md`. (May 2026)
- **`ollama/arch-omni2-9b` is the wrong model identifier for the llama-server
  instance.** The correct ID is `llama-server/arch-omni2-9b-native` (verify with
  `opencode models | grep arch`). Using the wrong ID causes an immediate "cannot
  load model" error when the agent is invoked. Fixed in `opencode.json` and
  `local-orchestrator.md` frontmatter. (May 2026)

## Open Issues

Known bugs and stale claims identified during code review (see deleted
`agent-infrastructure-review.md` and `agent-infrastructure-review-pass2.md` for
full context). Not yet fixed.

### CRITICAL — `description:` empty in all generated agent/skill files

`scripts/generate-agents.ts` uses a hand-rolled YAML parser that silently drops
descriptions when they are written in block-scalar form (value on the next line
under the key). Every generated file in `.github/agents/`, `.github/skills/`,
`.opencode/agents/`, `.opencode/skills/` has a blank `description:` field.

`description:` is the primary routing signal for Copilot's
`SkillsContextComputer` and OpenCode's agent dispatch. Explicitly `@`-mentioning
an agent by name still works; description-triggered auto-routing does not.

**Fix**: Inline the description strings in the canonical `.agents/` source files
(change block-scalar to `key: 'value'` format). The existing parser handles
inline strings correctly. Add a `generate:agents:check` assertion that every
generated file has a non-empty `description:`.

### MEDIUM — ~~`printf '%s'` regression in hooks breaks `\n` rendering~~ (resolved)

~~`.agents/hooks/post-tool-use.sh`, `session-start.sh`, and
`user-prompt-submit.sh` use `printf '%s' "$context" | node -e '...'` to
JSON-escape the context variable. `%s` does not interpret `\n` escape sequences,
so multi-line context strings (SELF-CHECK, DEBUGGING REMINDER, BFF REMINDER)
arrive at the model as single lines with literal `\n` characters.~~

**Verified fixed** (May 2026): all three hooks already use `printf '%b'`.

### LOW — ~~arXiv citation `2603.29957` unverified~~ (resolved)

~~`arXiv:2603.29957` (Jiang et al. 2026, "Think-Anywhere") appears in
`.agents/agents/research.md`, `.agents/agents/brainstorm.md`, and the Research
Foundation section above. Verify the ID resolves at
`https://arxiv.org/abs/2603.29957` and fix all references if it doesn't.~~

**Verified real** (May 2026): "Think Anywhere in Code Generation" by Xue Jiang,
Tianyu Zhang, Ge Li et al., submitted March 31, 2026, revised April 27, 2026
(v3), cs.SE. All existing citations are correct.

### LOW — ~~`.claude/` false claims in `tool-agnostic-agent-infra.md`~~ (resolved)

The file `docs/projects/tool-agnostic-agent-infra.md` no longer exists — already
deleted. No action needed.