dotfiles/.agents/docs/agent-infrastructure.md
Brydon DeWitt 6b07e4ccb2 feat: add shared agent infrastructure (.agents/)
- AGENTS.md: design principles, enforcement hierarchy, deferred loading
- agents/: brainstorm, build, orchestrator, research (auto-discovered by MCP server)
- skills/: research methodology (auto-discovered by MCP server)
- hooks/: pre-tool-use, post-tool-use (BFF block removed), session-start,
  stop, pre-compact, user-prompt-submit
- frameworks/: opencode/plugin.ts (resolves hooks via import.meta.url — works
  as project-local or global plugin), github/hooks.json
- mcp/index.ts: auto-discovers agents/*.md and skills/*.md from frontmatter
  (replaces hand-maintained registry); server renamed all-agents
- docs/: agent-infrastructure.md (generalized), research docs (7 files),
  ai_architectures.md, llama-server-cuda-wsl2.md
- install.sh: idempotent setup — Copilot global hooks, OpenCode global plugin +
  AGENTS.md + MCP entry, VS Code global MCP config
2026-05-22 13:13:43 -04:00

855 lines
49 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Agent Infrastructure
Shared agent infrastructure for VS Code Copilot and OpenCode — brainstorm
agent, research agent, nudge instructions, hooks, skills, and MCP server.
Project-specific overlays live in each project's `.agents/` directory.
> **See also:**
> [`docs/research/ai-coding-best-practices.md`](../research/ai-coding-best-practices.md)
> — research synthesis covering the Prompt/Context/Harness taxonomy, failure
> modes, enforcement hierarchy, small-model harness patterns, and all
> primary-source citations that underpin the design decisions here.
## Current State
### Architecture Overview
The infrastructure is **tool-agnostic**: canonical sources live in `.agents/`
and a generator (`npm run generate:agents`) distributes them to
`.github/agents/`, `.github/skills/`, `.opencode/agents/`, `.opencode/skills/`.
Edit the `.agents/` sources; never edit the generated output directories (they
are `.gitignore`d and blocked by pre-tool-use policy).
```
.agents/
├── AGENTS.md # Root design doc + enforcement hierarchy
├── agents/ # Agent definitions (canonical)
│ ├── brainstorm.md
│ ├── research.md
│ └── build-local.md # OmniCoder 9B via Ollama
├── hooks/ # Shared bash hooks (delegated by all harnesses)
│ ├── pre-tool-use.sh # Hard blocks (terminal cmds + file-path policies)
│ ├── post-tool-use.sh # Self-check counter + methodology reminders
│ ├── session-start.sh # Inject project state at session start
│ ├── user-prompt-submit.sh # Per-turn nudge detection + task capture
│ ├── pre-compact.sh # Export state before context summarization
│ └── stop.sh # Session-end verification
└── skills/
└── research/SKILL.md # Research methodology (any agent can load)
```
Generated output (do not edit — regenerated by `npm run generate:agents`):
- `.github/agents/` — VS Code Copilot agent files
- `.github/skills/` — VS Code Copilot skill files
- `.opencode/agents/` — OpenCode agent files
- `.opencode/skills/` — OpenCode skill files
Harness integration:
- **VS Code Copilot**: `.github/agent-support.json` — maps 4 hook events to the
shared bash scripts in `.agents/hooks/`
- **OpenCode**: `.opencode/plugins/agent-support.ts` — TypeScript plugin that
shells out to the same bash scripts
### Brainstorm Agent
- 4-phase workflow: Quick Frame → Diverge → Converge → Capture & Hand Off
- 6 techniques: Rapid Ideation, SCAMPER, Worst Possible Idea, How Might We,
Inversion/Pre-mortem, Constraint Flipping
- Counterbalances Opus 4.6 overthinking tendency
- Phase 2 includes "push past the obvious" nudge (Zhao et al. 2024: LLMs fall
short on originality, excel at elaboration — first ideas are "average")
- Phase 4 routes to `@research` for investigation, default agent for
implementation
- Creates exploration files at `docs/explorations/<name>.md` and session memory
notes
### Research Agent
- Two orientations that compose recursively:
- **Understand** (Grounded Theory): open coding → constant comparison → axial
coding → memo → saturation check
- **Diagnose** (Strong Inference + Satisficing): 5-factor triage gates between
satisficing (low risk) and full falsification (high risk)
- 5-factor triage: reversibility, blast radius, confidence, novelty, time cost
- Timing awareness: `time` prefix on unknown commands, session/repo memory for
baselines, timing feeds into triage decisions
- Investigation files at `docs/explorations/<name>.md`
- Techniques reference: Five Whys, Delta Debugging, Rubber Duck
- Delegates evidence-gathering to Explore subagent, keeps analytical thinking
local
### Nudge Instructions
- Brainstorm nudge: triggers on hesitation/overthinking language ('wait',
'actually', 'hmm', 'overcomplicating', etc.)
- Research nudge: triggers on debugging/investigation language ('why is this
broken', 'how does this work', 'root cause', etc.)
- Both are non-intrusive single-sentence suggestions, only fire once per topic
### Tool Mapping (Copilot ↔ OpenCode)
| Copilot | OpenCode equivalent |
| ---------------------------------------------------- | -------------------------------------------------------------------------------------------------- |
| `AGENTS.md` (root + nested) | `AGENTS.md` (root, native; nested via `instructions` glob in `opencode.json`) |
| `.github/agents/*.agent.md` | `.opencode/agents/*.md` (frontmatter: `description`, `mode`, `model`, `temperature`, `permission`) |
| `.github/skills/<name>/SKILL.md` | `.opencode/skills/<n>/SKILL.md` — also reads `.agents/skills/` and `.claude/skills/` |
| `.github/instructions/*.instructions.md` (`applyTo`) | No direct equivalent — fold into AGENTS.md stubs or `instructions` glob |
| `.github/hooks/*.sh` (JSON-configured shell) | `.opencode/plugins/*.ts` (TS modules, event-driven) — shells out via Bun's `$` |
| `runSubagent` / `Explore` agent | Built-in `general` and `explore` subagents; `@`-mention syntax |
| `vscode_askQuestions` | No equivalent — OpenCode uses agent's natural turn-taking |
OpenCode plugin event mapping:
| Copilot hook | OpenCode event |
| -------------- | ----------------------------------- |
| `SessionStart` | `session.created` |
| `PreToolUse` | `tool.execute.before` |
| `PostToolUse` | `tool.execute.after` |
| `PreCompact` | `experimental.session.compacting` |
| `Stop` | `session.idle` (closest equivalent) |
## Research Foundation
> For full research depth, citations, and failure-mode analysis, see
> [`docs/research/ai-coding-best-practices.md`](../research/ai-coding-best-practices.md).
> The list below records the specific papers and frameworks that shaped the
> design decisions in this project.
Methodologies and papers that informed the design:
- **Grounded Theory** (Glaser & Strauss): build understanding from data, not
assumptions. Applied to code-reading in the Understand orientation.
- **Strong Inference** (Platt 1964): multiple competing hypotheses → crucial
experiments → eliminate. Applied to the Diagnose orientation.
- **Satisficing** (Simon 1956): accept "good enough" when optimization cost
exceeds benefit. Gates between cheap confirmation and expensive falsification.
- **Dual Process Theory** (Kahneman): System 1 (fast, pattern-matching) vs
System 2 (slow, analytical). System 1 more accurate in familiar domains.
Informs the triage decision.
- **Zhao et al. 2024** (arxiv): LLMs fall short on originality, excel at
elaboration. First ideas are "average." Informs brainstorm agent's "push past
the obvious" nudge.
- **"Lost in the Middle"** (Liu et al. 2023): LLMs attend best to beginning/end
of context. Informs hook design — inject at context tail for high attention.
- **Delta Debugging**: binary search the change space between passing/failing
cases. Logic behind `git bisect`.
- **Five Whys**: iterative causal chain tracing. Starting point for hypothesis
generation, not sole diagnostic method.
- **Ronacher "Agent Design Is Still Hard"**: reinforce methodology after every
tool call at context tail. Structural injection outperforms relying on
instructions in the system prompt.
- **Think-Anywhere** (Jiang et al. arXiv:2603.29957, Mar 2026, Peking U + Tongyi
Lab): LLMs trained to invoke `<think>` blocks at any token position during
code generation, not just upfront. SOTA on LeetCode/LiveCodeBench with fewer
total tokens. The motivating insight: a model can plan correctly at the start
but introduce an off-by-one bug mid-implementation — only mid-loop reasoning
catches it. **Applied here**: the research agent's investigation checklist
includes "Re-evaluate hypothesis at every tool-call boundary." For Claude 4
models, interleaved thinking makes this automatic. Complements Plan-and-Solve:
upfront decomposition where structure is clear, mid-execution re-evaluation
when intermediate results change what to do next.
- **Anthropic interleaved thinking** (Claude 4 + adaptive thinking): Claude
Sonnet 4.6+ and Opus 4.6+ automatically insert thinking blocks between tool
calls. No separate implementation needed — agent instruction design drives it.
The research agent's "Re-evaluate at every tool-call boundary" instruction
explicitly activates this behavior.
- **Prompt/Context/Harness framework** (Alibaba Cloud, Apr 2026): Names the
three engineering layers. Prompt = task expression (stateless). Context = what
the model sees (AGENTS.md, skills, tools — engineering target is progressive
disclosure). Harness = system constraints + verification loops (hooks,
permission gates, sub-agent isolation). Diagnostic map: wrong output → Prompt;
hallucinated fact → Context; wrong tool selected → Context (fix description);
task drift → Harness (sub-agent boundary); destructive action → Harness
(permission hook). LangChain improved Terminal Bench 2.0 from 52.8% → 66.5% by
changing Harness alone.
- **Context engineering** (Rajasekaran et al., Anthropic, Sep 2025): Formally
distinguishes context engineering from prompt engineering. Key principles: (a)
just-in-time context — agents hold references and load on demand, not upfront;
(b) structured note-taking (NOTES.md) as external working memory for long
sequential tasks; (c) every new token depletes attention budget — validates
the <60-line AGENTS.md ceiling; (d) compaction strategy: maximize recall
first, then improve precision.
## MCP Server Lifecycle Hooks — Protocol Status (May 2026)
The `.agents/mcp/` server exposes prompts and tools to agents via the MCP
protocol. A recurring question: can the MCP server react to session lifecycle
events (session start/end, tool-use boundaries)?
### Current protocol state
**No lifecycle hooks exist in the MCP protocol.** The spec defines three phases
only: `initialize → operation → shutdown`. There is no `session.created`,
`post-tool-call`, or `session.ended` notification. This gap is why session
awareness currently lives in the OpenCode plugin layer
(`.opencode/plugins/agent-support.ts`) rather than the MCP server OpenCode
exposes `session.created`, `session.idle`, `session.compacted`,
`session.deleted`, and `tool.execute.before/after` events natively to plugins.
### Active work in the MCP spec
**SEP-2624: Interceptors for the Model Context Protocol**
([PR #2624](https://github.com/modelcontextprotocol/modelcontextprotocol/pull/2624))
The most organized effort. Supersedes SEP-1763 (closed as completed). Proposes
**Interceptors** as a new MCP primitive two types: **validators** (inspect,
return pass/fail) and **mutators** (transform context payloads) discoverable
and invocable via `interceptors/list` and `interceptor/invoke` JSON-RPC methods.
These fire at protocol-level operation events: `tools/call`, `prompts/get`,
`resources/read`, `sampling/createMessage`, `elicitation/create`. Not
session-start/stop hooks, but before/after wrapping for every operation.
There is now a formal **Interceptors Working Group** (Bloomberg + Saxo Bank
engineers, biweekly cadence). Reference implementations in progress for Go and
C# SDKs. Experimental repo:
[modelcontextprotocol/experimental-ext-interceptors](https://github.com/modelcontextprotocol/experimental-ext-interceptors).
Charter:
[modelcontextprotocol.io/community/interceptors/charter](https://modelcontextprotocol.io/community/interceptors/charter).
**SEP-2282: Server-Declared Behavioural Hooks**
([PR #2282](https://github.com/modelcontextprotocol/modelcontextprotocol/pull/2282))
Smaller, separate open PR. Proposes servers declare **context injections** in
`ServerCapabilities` text injected into the agent's context at client-side
lifecycle events (session start, post-tool-use, session end). The contract is
"here's context the model should have at this moment," not code execution. More
directly analogous to our OpenCode `session.created` / `session.idle` patterns.
Currently unsponsored needs a maintainer to pick it up.
### What to watch
- **Primary**:
[PR #2624](https://github.com/modelcontextprotocol/modelcontextprotocol/pull/2624) +
experimental-ext-interceptors repo
- **Secondary**:
[PR #2282](https://github.com/modelcontextprotocol/modelcontextprotocol/pull/2282)
(closest to session-lifecycle hooks)
- **Label filter**:
[`SEP` label](https://github.com/modelcontextprotocol/modelcontextprotocol/issues?q=label%3ASEP)
on the modelcontextprotocol repo
- **Milestone**: `2026-06-30-RC` is the next spec revision window
### Implication for this project
Until interceptors land in a shipping spec version and the TypeScript SDK, the
session lifecycle pattern stays at the OpenCode plugin layer. When SEP-2282 or
an equivalent lands, the MCP server could self-register context injection hooks
during `initialize`, removing the need for tool-specific plugin code.
---
## Model Scale Profiles
Different model sizes require different infrastructure strategies. The failure
modes are different, so the mitigations are different.
### Large-scale API models (Claude Sonnet / Opus)
**Primary failure modes**: overthinking, sycophancy, verbosity, tendency to add
unrequested features or comments.
**Infrastructure strategy**:
- Advisory methodology + structural reinforcement (hooks, circuit breakers)
- PostToolUse self-check nudges every ~15 calls
- PreToolUse hard blocks for high-risk operations
- Subagent delegation for isolated tasks (parent Opus child Sonnet/Haiku)
### Smaller-scale local models (OmniCoder 9B via Ollama)
**Primary failure modes** (different from "low reasoning" OmniCoder uses Qwen3
thinking blocks natively):
- Narrower training distribution (Python/JS heavy)
- Quantization degradation: JSON schema compliance drops as context fills
- Tool-call history is the primary context consumer responses must be
truncated aggressively
- Instruction drift: fewer attention heads (32 vs 64 in 32B) means system prompt
recall degrades faster
**Infrastructure strategy**:
- PostToolUse response truncation at ~1500 tokens (plugin layer, not bash hook)
- PreToolUse JSON validation with schema-specific error messages
- Context pressure injection at 70% fill (~22K/32K tokens)
- `steps: 20` cap + `ask` permission gates for natural checkpoints
- `explore` subagent delegation to reduce context pressure on the main agent
- `NOTES.md` working memory pattern enforced in agent body
- No `web` tool keeps context lean
- Reasoning guidance: "Hold references; load on demand" explicit in agent body
---
## OmniCoder 2 Orchestration — Pending Work
> Full historical rationale and audit findings were maintained in
> `docs/projects/local-ai-orchestration.md` (deleted May 2026 after merge). The
> plan used an orchestrator-workers pattern with structural `edit: deny`
> enforcement on the orchestrator. All OpenCode config values verified against
> opencode.ai/docs (May 2026).
### Goals
1. All agents run on `ollama/arch-omni2-9b` no cloud fallback
2. User can type vague prompts; the system decomposes and delegates
automatically
3. Context windows are isolated per subagent (no shared state bleed)
4. Changes scale forward: switching to cloud means changing model strings, not
architecture
### Pending Changes
#### Quick wins — under 5 minutes each, no testing required
1. - [x] **[CRITICAL] Fix `<tool\*call>` typo in `omnicoder2.modelfile`**
markdown-escape artifact; malformed opening tag paired with correct
closing tag. Highest-leverage change; everything below depends on
reliable tool-call JSON.
2. - [x] **Mark canonical/deprecated modelfiles** `# CANONICAL` header on
`omnicoder2.modelfile`; `# DEPRECATED` on `omnicoder.modelfile`;
`omnicoder-v2.modelfile.template` deleted (was dead code v2 now
served from HuggingFace path).
3. - [x] **Add `compaction.reserved: 3000` to `opencode.json`** default 10,000
fires compaction too early given ~812K baseline context.
4. - [x] **Fix `pre-compact.sh` prettier call** removes `npx prettier` which
violates pre-tool-use Policy 1 (self-violating policy).
5. - [x] **MCP server error handling** wrap `server.connect(transport)` in
try/catch with stderr + `process.exit(1)`.
#### Short session — 1530 minutes each, bounded scope
6. - [x] **Fix `stop.sh` JSON escaping** replace `sed`-based escaping with
`printf '%b' | node JSON.stringify` pattern used in every other hook.
7. - [x] **Per-session PostToolUse counter** repo-scoped path
`/tmp/.opencode-tool-count-<repo-hash>` (derived from REPO_ROOT via
md5sum); prevents cross-repo contamination; session-start.sh resets it
at session begin.
8. - [x] **Shrink compaction prompt to ~120 words** (in
`.opencode/plugins/agent-support.ts`) shorter instructions free
bandwidth for the 9B to actually summarize.
9. - [x] **Update `.agents/agents/build-local.md` for v2** pagination 100 50
lines; rule 4 now says "recipient not dispatcher"; rule 7 scope-check
says "tell the user, do not self-decompose".
#### Depends on orchestrator being proven first
10. - [x] **Trim root `AGENTS.md` to ~60 lines** reduced from 435 lines to 45
lines; all architecture rationale, code examples, quick task table,
and project context removed; cross-cutting rules and quality gate
preserved (May 2026).
11. - [x] **PostToolUse weighted counter** reads (`read_file`, `grep`, `list`)
+0.25; writes/shell +1; keeps 15-call SELF-CHECK from firing
mid-investigation sweep. Depends on #7 (per-session counter) first.
**Implementation** (`.agents/hooks/post-tool-use.sh`): bash has no
float arithmetic scale to integers: reads +1, writes/shell +4,
threshold 60 (equivalent to 15 effective write-units). Read-class
tools: `read_file`, `grep_search`, `list_dir`, `file_search`,
`semantic_search`, `explore_subagent`. Write/shell-class: all
`*_string_in_file`, `create_file`, `run_in_terminal`. Replace the
single `COUNT=$((COUNT + 1))` with a `case "$TOOL_NAME"` block that
does `COUNT=$((COUNT + 1))` for reads and `COUNT=$((COUNT + 4))` for
writes/shell. Change the self-check condition from
`(( COUNT % 15 == 0 ))` to `(( COUNT % 60 == 0 ))`.
12. - [x] **PostToolUse reminder priority filter** emit at most 2 reminders
per tool call; priority: SELF-CHECK > DEBUGGING > path-scoped >
tool-specific. Depends on #11.
**Implementation** (`.agents/hooks/post-tool-use.sh`): replace the
current single `context` string accumulator with an indexed array
`reminders=()`. Each block appends `reminders+=("$msg")` in priority
order (SELF-CHECK first, DEBUGGING second, BFF/QUALITY GATE third,
RENAME fourth). At output time: join only the first 2 elements.
Append with `\n\n` separator. Blocks that didn't fire don't append,
so the cap is natural.
13. - [x] **Broaden PostToolUse truncation to all `ollama/` agents**
(`.opencode/plugins/agent-support.ts`); differentiate limit:
orchestrator 2,500 tokens vs workers 1,500. Minor until orchestrator
exists.
**Implementation**: rename `BUILD_LOCAL_MAX_RESPONSE_TOKENS`
`LOCAL_WORKER_MAX_TOKENS = 1500`; add
`LOCAL_ORCHESTRATOR_MAX_TOKENS = 2500`. In `tool.execute.after`, the
existing `isLocalAgent` check covers all `ollama/` agents via
`input.model.startsWith('ollama/')`. Add a second check:
`input.agent === 'local-orchestrator'` → use orchestrator limit, else
worker limit. The `agent` field is available in `tool.execute.after`
(confirmed working for `build-local`).
14. - [x] **Create `.agents/agents/local-orchestrator.md`** — primary agent with
`edit: deny`, `write: deny`, `bash: deny`; whitelist `task` to
`build-local`, `research`, `brainstorm` only.
**Implementation**: new file modeled on `build-local.md`. Role: receive
high-level goal, decompose into bounded subtasks, show decomposition to
user before dispatching, delegate via `task` subagent. Permission
block in `opencode.json` `agent.local-orchestrator`:
`{ "edit": "deny", "write": "deny", "bash": "deny" }`. Agent body
rules: (1) read project root `AGENTS.md` first; (2) produce a task
list and confirm with user before dispatching; (3) one `task` call per
subtask, wait for result; (4) never attempt to edit files directly —
if a subtask requires context the worker needs, inject it via the
`task` prompt, not by reading files yourself; (5) after all subtasks,
report summary to user.
15. - [x] ~~**Set `default_agent: "local-orchestrator"` in `opencode.json`**~~
Done May 2026. Key is `default_agent` (snake_case, confirmed from
`opencode.ai/config.json` schema). `local-orchestrator` has
`mode: all` so it qualifies as a primary agent.
#### Done
- [x] ~~**Soften `opus-deep.modelfile` directive**~~ — file deleted (May 2026);
DeepSeek R1 available online when needed; OmniCoder 2 is the sole local
model.
### Known Tradeoffs
| Tradeoff | Impact | Mitigation |
| -------------------------------------------------- | --------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------ |
| Instructions glob trimmed to root `AGENTS.md` only | Agents miss project-specific patterns for subdirectories unless they read nested `AGENTS.md` explicitly | Add reminder in orchestrator + build-local agent body: "check nested `AGENTS.md` before working in subdirectories" |
| Same model for all roles | Orchestrator, worker, compaction agent are all same weights with different prompts | Structural `edit: deny` is the safety net; circuit breakers limit runaway loops |
| No cloud fallback | If task is too complex for 9B, no escalation path | Orchestrator includes "ask the user for direction" rule; user can switch to Copilot |
| Latency | Sequential dispatch: orchestrator decomposes → build-local runs → returns. ~2× wall time vs. direct build-local | Acceptable for local dev; no VRAM multiplier since Ollama keeps weights hot |
| Reminder-stacking cap | 2-per-call priority filter (pending work above) drops lower-priority warnings | Skipped reminders fire on next call if condition holds |
### Cloud Migration Path
When ready to add a cloud model, only `opencode.json` changes:
```json
{
"model": "ollama/arch-omni2-9b",
"agent": {
"local-orchestrator": {
"model": "anthropic/claude-haiku-4-5"
}
}
}
```
Schema verified against opencode.ai/docs/agents/ (May 2026). The `tools` key
inside agent configs is deprecated in favour of `permission` — the orchestrator
definition uses `permission`, so it is current. The `agent.{name}.model` key is
the correct per-agent override mechanism.
---
## Ecosystem Gap — Contextual AGENTS.md Injection
During local AI work (May 2026) we hit a fundamental limitation: OpenCode's
`instructions` glob in `opencode.json` loads **all matched files upfront** into
every session. For a 9B local model with a 32K context window, loading all of
`apps/*/AGENTS.md` and `packages/*/AGENTS.md` at startup consumes ~3040% of the
context budget before the first message, triggering early compaction and
degrading quality.
The correct behaviour — injecting only the AGENTS.md relevant to the file being
edited — does not exist natively in OpenCode or its plugin ecosystem. The
closest community plugin (`opencode-skillful`, 295 stars) is archived as of Feb
2026 and still requires the model to explicitly call `skill_find`/`skill_use`;
it provides no path-triggered structural injection.
### Open tasks
16. - [ ] **Assess: is filling this ecosystem gap worth the effort?** — Before
building a contextual-injection plugin, evaluate: (a) Is OpenCode
actively used for serious local AI coding work, or is the community
primarily cloud-model users for whom context cost is irrelevant? (b)
Are there better local AI coding stacks (e.g. Aider + litellm, Cursor
local mode, VS Code Copilot + Ollama) where this problem is already
solved? (c) Is the `tool.execute.before` event stable enough to build
on? Target: 30-minute research session, concrete go/no-go
recommendation.
17. - [ ] **Review + write up our issues and fixes as an ecosystem
contribution** — If the gap is worth filling: document the
context-bleed problem, the early-compaction root cause, our hook-based
mitigation, and the remaining structural gap. Publish as a GitHub
issue on the OpenCode repo and/or an npm plugin
(`opencode-contextual-rules`?) implementing `tool.execute.before`
path-triggered AGENTS.md injection. Depends on #16 go/no-go.
18. - [x] ~~**Trim `.agents/AGENTS.md`**~~ — Done May 2026. Condensed from
12,584 → 10,507 bytes (43 lines removed). Trimmed: Hook Architecture
Principle block (redirected to item 22 in project doc), Deferred
Loading example + "why not" paragraph, session-start/stop hook prose,
outdated `generate-agents.ts` references in Skills/Agents sections.
Agent body files updated to prompt-body-only convention (see items
25/26).
19. - [x] ~~**Block bash bypass of read pagination**~~ — Done May 2026. Added
Policy 14 to `pre-tool-use.sh`: blocks `cat`/`head`/`tail`/`jq` reads
of `apps/*/package.json` and `packages/*/package.json`. Scope limited
to package.json (confirmed live bypass vector); general `.ts`/`.md`
bash reads are not yet blocked (lower-urgency gap). Pattern verified
with Node.js unit test — exact bypass command
`cat apps/api/package.json | jq` is caught by P1.
20. - [ ] **Improve explore-first scope detection** — Policy 14 blocks
`manage_todo_list` with ≥4 items, but OmniCoder sometimes starts with
`Explore`/`find` before planning, bypassing the check. Options: (a)
block `explore_subagent` when the query looks like a multi-file
discovery sweep (glob patterns for source files across multiple dirs);
(b) add a pre-tool-use check on `run_in_terminal` that denies `find`
commands spanning the whole repo when the task hasn't been scoped yet;
(c) rely on the todo-list check firing when planning eventually
happens (current behavior — catches it late but still before edits
start).
21. - [x] ~~**Remove debug logging from plugin after verified cycle**~~ — Done
May 2026. Removed the full-input dump block from `tool.execute.before`
in `plugin.ts` (`/tmp/plugin-debug.jsonl` appender). Guards verified
via `opencode export` session transcript inspection — no longer need
the dump file. Hook error logger (`/tmp/plugin-hook-errors.log`) kept
as it only fires on failures, not every call.
22. - [ ] **Refactor hook scripts to be platform-agnostic** — currently
`pre-tool-use.sh` parses Copilot-specific JSON and outputs
Copilot-specific `permissionDecision` JSON. `plugin.ts` implements
duplicate guards inline rather than calling the script. This means
OpenCode and Copilot guards can drift (confirmed May 2026: Policy 14
in `pre-tool-use.sh` had no effect on OpenCode `bash` tool calls).
**Design target**: scripts accept normalized env vars (`TOOL_NAME`,
`COMMAND`, `FILE_PATH`), exit non-zero with plain-text denial reason
on stdout. Callers normalize input and translate output to their
native denial format. Tracked in `.agents/AGENTS.md` Hook Architecture
Principle section.
**Audit required first**: review all hook scripts for Copilot-specific
assumptions before refactoring.
23. - [ ] **Question-drift marker in `user-prompt-submit.sh`** — when the model
has committed to a prior position and follow-up questions are being
misread through that lens, prepend a disambiguation marker at the
prompt tail. Detected pattern: model answers "no" or "not possible" in
a prior turn → subsequent turns interpreted as defense of that
position. See §2.1 ("Position-anchored priming") in the research doc.
**Implementation**: in `user-prompt-submit.sh`, read the last N turns
of `$TRANSCRIPT_PATH` (injected by OpenCode's native hook env) and
look for a prior committed "no/impossible/can't" response within the
last 3 model turns. If detected, append to `ADDITIONAL_CONTEXT`:
`CURRENT QUESTION (answer only this — not the prior exchange): [prompt
text]`. The key is repeating the user's exact question at the tail,
after the marker, to counteract lost-in-the-middle effects. Fallback
trigger: user prompt contains "that's not what I asked" / "you're
answering the wrong question" / "I said" → always inject marker
regardless of transcript scan.
24. - [x] ~~**Review all custom agent files for local-model-specific framing**~~
— Done May 2026. `build-local.md` reframed: dropped "OmniCoder", "9B",
"Ollama", "Qwen3 thinking blocks", "32K tokens total"; replaced with
model-agnostic equivalents. `research.md` and `brainstorm.md` verified
clean — no model/provider mentions. `local-orchestrator.md` was fixed
earlier this session. All four agent body files are now
model-agnostic.
25. - [ ] **Failure-mode routing in SELF-CHECK** — when the periodic SELF-CHECK
fires in `post-tool-use.sh`, if a recent terminal failure or test
failure is also present in the same turn, classify the failure type
and inject the matched intervention rather than generic "step back."
Reference: failure-mode routing table in §3.5 of the research doc.
**Implementation**: in the SELF-CHECK block, if `context` already
contains `DEBUGGING REMINDER` (i.e., test/terminal failure co-occurred
this turn), append a classification hint:
`FAILURE TYPE HINT: If this is a test/build failure → Reflexion loop
(fix based on test output). If convention violation → grep for the
pattern and inject a canonical example. If wrong file/directory → stop
and re-read the project structure. Do not default to "try harder."`.
Low implementation cost — pure text append with a conditional on
`$context`.
26. - [x] ~~**Audit agent `.md` files for OpenCode-specific frontmatter**~~
Done May 2026. Audit result: only `local-orchestrator.md` had OpenCode
frontmatter keys (`mode`, `model`, `permission`). `brainstorm.md`,
`build-local.md`, `research.md` were already plain markdown. Went with
option (b): stripped `mode`/`model`/`permission` from
`local-orchestrator.md`; moved `mode: all` into `opencode.json`
(model + permission were already there). Kept `description` in
frontmatter as it is neutral and self-documenting. Body files are now
prompt-body only — valid in both OpenCode and Copilot.
27. - [ ] **`plugin.ts` local-agent detection uses provider prefix, not agent
name** — `tool.execute.after` detects local agents via
`input.model.startsWith('ollama/')`. This is provider-specific: if the
model is served via a different backend (e.g. `llama-server/`,
`lmstudio/`), truncation silently stops working. Fix: detect by agent
name (`input.agent.includes('build-local')`) only, removing the
`ollama/` fallback. The `input.agent` field is available in
`tool.execute.after` (confirmed May 2026).
28. - [ ] **`plugin.ts` context pressure threshold is hardcoded to 32,768
tokens** — `CONTEXT_LIMIT_TOKENS = 32768` assumes OmniCoder 9B's
context window. If the local model changes, the threshold silently
drifts out of calibration. Options: (a) read from `opencode.json`
model config if OpenCode exposes it to plugins; (b) make it a
top-of-file constant with a comment to update when changing models;
(c) accept the drift as low-severity (threshold is advisory only —
context pressure warnings are informational, not blocking). Option (b)
is the minimum; option (a) is ideal if OpenCode exposes model metadata
to plugins.
29. - [x] ~~**Move `permission` out of `local-orchestrator.md` frontmatter**~~
Done May 2026 as part of item 25. `mode: all` added to `opencode.json`
agent entry. `model` and `permission` were already in `opencode.json`.
`opencode.json` is now the single source of truth for all runtime
config; `.md` files are prompt-body only.
---
## Testing & Regression
**Research summary (May 2026):** No pre-existing tool exactly fits this use
case. Existing tools (RagaAI Catalyst, AgentEvalKit, agent-eval-arena,
intent-eval-lab, j-rig-skill-binary-eval) focus on LLM output quality,
hallucination detection, or cross-runtime behavior scoring — not config file
structure or policy enforcement regression. The closest analogue is
`j-rig-skill-binary-eval` (binary pass/fail criteria across 7 layers), which
uses the same conceptual approach we'd want here. Our testing is bespoke by
necessity: we're testing configuration files, shell scripts, and specific policy
enforcement behaviors, not general LLM response quality.
**Two layers of testing:**
| Layer | What it tests | Cost | When to run |
| --------------------------- | --------------------------------------- | ---------------- | -------------------------------------- |
| Config + policy unit tests | Schema validity, hook regex correctness | None (no model) | Always — CI, pre-commit |
| CLI integration smoke tests | Actual enforcement via `opencode run` | Local model only | On-demand; local model must be running |
**Cloud agents excluded from integration tests**`opencode run` with a cloud
model (Copilot, Anthropic) incurs API costs and rate limits. Tests must detect
the active model and skip if it's not a local provider.
### Open tasks
30. - [ ] **Config + policy unit test suite** — test config file structure and
hook regex patterns without invoking any model. Implementation:
a. **`opencode.json` schema validation**: the file references
`"$schema": "https://opencode.ai/config.json"` — validate it using
`ajv` (already used in the monorepo) against the live schema or a
cached copy. Catches permission typos, unknown agent keys,
unsupported field values.
b. **Hook JSON structure validation**: validate
`.agents/frameworks/github/hooks.json` and
`.agents/frameworks/opencode/plugin.ts` (TypeScript, already type-
checked). Write a schema for the hooks JSON format and run ajv on
it.
c. **Hook policy regex unit tests**: extract every regex used in
`pre-tool-use.sh` into a `tests/hooks.test.ts` file and run it
with `vitest`. For each policy, define 23 input strings that
SHOULD match and 23 that SHOULD NOT. Policy 14 already has an
informal Node.js test from this session — formalize it.
d. **Agent `.md` frontmatter validator**: check that no agent file
under `.agents/agents/` has frontmatter keys other than
`description`. Catches regression when someone adds `model:` or
`permission:` back to a body file.
**Suggested location**: `.agents/tests/` or root `test/agents/`.
**Stack**: vitest (already in monorepo), ajv (already available), Node
built-ins. No new dependencies needed.
31. - [ ] **CLI integration smoke tests (local model only)** — use
`opencode run` in non-interactive mode to verify enforcement is
actually firing via the real runtime. These tests exercise the
plugin + hook wiring end-to-end.
**Command shape**:
```
opencode run "prompt" --agent build-local \
--model llama-server/arch-omni2-9b-native \
--format json
```
**Assertions via `opencode export`**: after each run, export the
session with `opencode export <sessionID> 2>/dev/null` and parse the
JSON transcript. Assert on `parts` array: tool calls that SHOULD have
been blocked appear with error/denied status; tool calls that SHOULD
have passed completed normally.
**Test cases to start with** (all verified real enforcement gaps):
1. Attempt to `read` a nested `package.json` (e.g. `apps/api/package.json`) → BLOCKED by plugin
package.json guard
2. Attempt to `read` a source file with no `limit` → BLOCKED by
pagination guard
3. Attempt to `read` a source file with `limit: 51` → BLOCKED
4. Attempt to `read` a docs file with `limit: 501` → BLOCKED
5. Attempt to `read` a docs file with `limit: 50` → PASSES
6. Bash command `cat apps/api/package.json` → BLOCKED by pre-tool-use
Policy 14 (substitute your project's equivalent nested package.json)
**Guard rail**: skip all tests if `llama-server` is not reachable at
`http://127.0.0.1:8080/v1`. Do not run against cloud models. Add
an env var `AGENT_INTEGRATION_TESTS=1` required to enable (off by
default, never runs in standard `npm test`).
**Suggested location**: `.agents/tests/integration/`.
**Stack**: Node.js test runner or vitest, `opencode` CLI in PATH.
### Verified facts (May 2026)
- OpenCode's `read` tool input schema is
`{ filePath: string, limit?: number, offset?: number }` — NOT
`startLine`/`endLine`. Confirmed via plugin debug logging of real tool calls.
- `tool.execute.before` input contains only `{ tool, sessionID, callID }`. It
does NOT include `agent` or `model`, so plugin-layer gating cannot filter by
agent. Confirmed via plugin debug logging.
- **OpenCode has its own native hook system** that calls `pre-tool-use.sh`
directly for tools like `run_in_terminal`, `replace_string_in_file`, etc. This
is completely separate from the plugin's `runHook` calls. The native hook
payload includes `timestamp`, `hook_event_name`, `session_id`,
`transcript_path`, `tool_use_id`, and `cwd` — fields the plugin never sends.
The plugin `runHook` is a _second_ call, layered on top.
- **Bun shell `$` API does not have a `.stdin()` method.** The correct API for
piping stdin is `` $`cmd < ${Buffer.from(text)}` ``. `.stdin(text)` silently
throws `TypeError: $\`...\`.stdin is not a
function`, which was caught by `runHook`'s `catch`block and returned`''`. This caused the plugin's `runHook`to silently no-op for every call with`stdinJson`since the plugin was first written — hook enforcement (all 12 policies) was never running via the plugin path. It only ran via OpenCode's native hook system for the tools OpenCode natively supports. Confirmed via`/tmp/plugin-hook-errors.log`.
- **The silent `catch` in `runHook` is dangerous.** It masked the Bun `.stdin()`
bug entirely. Always log hook failures to a debug file during development;
remove only after enforcement is verified working.
- **Plugin-layer enforcement works for `read`** after fixing the Bun stdin API.
The `read` tool fires `tool.execute.before` in the plugin, which calls
`runHook('pre-tool-use.sh', ...)` via `< ${Buffer.from(...)}`, which applies
Policy 13 (50-line limit). Verified: bare `read` (no limit) → BLOCKED; `read`
with `limit: 50` → passes. (May 2026)
- **Plugin load failure: unescaped regex slashes caused silent syntax error.**
`plugin-debug.jsonl` was empty even after the Bun stdin fix because the plugin
file itself failed to parse. Line 84 had `/(^|/)(apps|packages)/[^/]+/...` —
forward slashes inside the regex literal were not escaped, producing a JS
syntax error at parse time. Bun silently drops plugins that fail to import.
Fixed to `/(^|\/)(apps|packages)\/[^/]+\/...`. The fix also corrected the
pagination guard to use `limit`/`offset` (not `startLine`/`endLine`) and added
an unbounded-read block (`limit === undefined`). All three guards verified
working in a live session (May 2026).
- **Package.json read guard verified working.** `local-orchestrator` attempting
to read `apps/*/package.json` and `packages/*/package.json` → BLOCKED by
plugin. Root `package.json` read correctly passes. (May 2026)
- **Policy 14 (`manage_todo_list` ≥ 4 items) catches some but not all broad task
attempts.** OmniCoder sometimes proceeds directly to `Explore`/`find` without
calling `manage_todo_list` first, bypassing the policy. When it does plan with
the todo tool before acting, the deny fires correctly.
- **OmniCoder comprehension failure: prompt ambiguity → wrong directory.** Given
"refactor the five hook files", OmniCoder ran a glob for `*hook*` files and
found `.husky/` hooks instead of `.agents/hooks/`. The correct files were in
the grep output from the Explore subagent but were not selected. Root cause:
the model lacks enough context about the repo layout to disambiguate "hook
files" without explicit path guidance. Mitigation: be explicit in prompts
("the five `.agents/hooks/*.sh` files").
- **OpenCode agent `permission` config requires a `.opencode/agents/<name>.md`
file.** Without a matching markdown file, `opencode.json`'s
`agent.<name>.permission` config is silently ignored — the agent is unknown to
OpenCode and runs as a nameless build-agent alias. The markdown file must
exist in `.opencode/agents/` (or `~/.config/opencode/agents/`). Confirmed by
test run where `@local-orchestrator` edited files despite
`permission.edit: "deny"` in JSON config; fixed by creating
`.opencode/agents/local-orchestrator.md` symlink. (May 2026)
- **`"write"` is NOT a valid OpenCode permission key.** Use `"edit"` instead —
it covers `write`, `edit`, and `apply_patch` tools. `"write": "deny"` is
silently ignored. Valid top-level permission keys include: `read`, `edit`,
`glob`, `grep`, `list`, `bash`, `task`, `skill`, `lsp`, `question`,
`webfetch`, `websearch`, `external_directory`, `doom_loop`, `todowrite`.
Confirmed from `opencode.ai/docs/permissions` (May 2026).
- **`default_agent` key is snake_case** in `opencode.json` (not `defaultAgent`).
Confirmed from `opencode.ai/docs/config` (May 2026).
- **`tools: false` is deprecated.** The current approach for per-agent tool
restriction is `permission: { edit: "deny" }`. The old `tools: false` still
works but is documented as legacy. Confirmed from `opencode.ai/docs/agents`
(May 2026).
- **Broken symlinks are silent.** OpenCode does not error on a broken
`.opencode/agents/` symlink — it just skips the agent silently. The agent
won't appear in `opencode agent list` and all `opencode.json` permission
config for it is ignored. Always verify with
`cat .opencode/agents/<name>.md | head -5` (should print content, not a "No
such file" error) and `opencode agent list` (agent should appear with correct
deny rules). The correct symlink depth from `.opencode/agents/` is
`../../.agents/agents/<name>.md` (two levels), not three.
- **`opencode agent list` is the authoritative verification command.** Run it
after any agent config change to confirm: (a) the agent appears by name, (b)
its mode is correct (`all`/`primary`/`subagent`), and (c) `deny` rules appear
at the bottom of its permission list. Missing agent = broken symlink or YAML
parse error. Present but missing deny rules = frontmatter not parsed correctly
or wrong key names. (May 2026)
- **`@mention` routing only works at session start.** If you send any message
that gets answered by the current primary agent first, then send
`@local-orchestrator ...`, the TUI passes the full message text to the current
model (Build/OmniCoder) which treats `@local-orchestrator` as freeform text
and answers it itself. Always open a **fresh session** and make `@agent-name`
the very first message. Alternatively, use
`opencode run --agent local-orchestrator "..."` from the CLI for reliable
agent-scoped invocation. **Tab-switching to a custom `all`-mode agent in an
existing session works correctly.**
- **`edit: deny` on `local-orchestrator` is working correctly.** When given an
edit task, the orchestrator correctly avoided using `replace_string_in_file`
and instead used the `task` tool to delegate to a subagent. This is the
expected behaviour. Confirmed May 2026.
- **`task` tool has a JSON serialization limit.** OmniCoder 9B caused an
`Unterminated string` error by embedding the entire contents of multiple
`package.json` files as a literal string inside the `task` prompt JSON. The
`task` tool prompt is serialized as JSON; very long strings truncate and
produce parse errors. Mitigation: instruct the orchestrator in its system
prompt to tell workers _which files to read_ rather than quoting file contents
inline. This has been added to `local-orchestrator.md`. (May 2026)
- **`ollama/arch-omni2-9b` is the wrong model identifier for the llama-server
instance.** The correct ID is `llama-server/arch-omni2-9b-native` (verify with
`opencode models | grep arch`). Using the wrong ID causes an immediate "cannot
load model" error when the agent is invoked. Fixed in `opencode.json` and
`local-orchestrator.md` frontmatter. (May 2026)
## Open Issues
Known bugs and stale claims identified during code review (see deleted
`agent-infrastructure-review.md` and `agent-infrastructure-review-pass2.md` for
full context). Not yet fixed.
### CRITICAL — `description:` empty in all generated agent/skill files
`scripts/generate-agents.ts` uses a hand-rolled YAML parser that silently drops
descriptions when they are written in block-scalar form (value on the next line
under the key). Every generated file in `.github/agents/`, `.github/skills/`,
`.opencode/agents/`, `.opencode/skills/` has a blank `description:` field.
`description:` is the primary routing signal for Copilot's
`SkillsContextComputer` and OpenCode's agent dispatch. Explicitly `@`-mentioning
an agent by name still works; description-triggered auto-routing does not.
**Fix**: Inline the description strings in the canonical `.agents/` source files
(change block-scalar to `key: 'value'` format). The existing parser handles
inline strings correctly. Add a `generate:agents:check` assertion that every
generated file has a non-empty `description:`.
### MEDIUM — ~~`printf '%s'` regression in hooks breaks `\n` rendering~~ (resolved)
~~`.agents/hooks/post-tool-use.sh`, `session-start.sh`, and
`user-prompt-submit.sh` use `printf '%s' "$context" | node -e '...'` to
JSON-escape the context variable. `%s` does not interpret `\n` escape sequences,
so multi-line context strings (SELF-CHECK, DEBUGGING REMINDER, BFF REMINDER)
arrive at the model as single lines with literal `\n` characters.~~
**Verified fixed** (May 2026): all three hooks already use `printf '%b'`.
### LOW — ~~arXiv citation `2603.29957` unverified~~ (resolved)
~~`arXiv:2603.29957` (Jiang et al. 2026, "Think-Anywhere") appears in
`.agents/agents/research.md`, `.agents/agents/brainstorm.md`, and the Research
Foundation section above. Verify the ID resolves at
`https://arxiv.org/abs/2603.29957` and fix all references if it doesn't.~~
**Verified real** (May 2026): "Think Anywhere in Code Generation" by Xue Jiang,
Tianyu Zhang, Ge Li et al., submitted March 31, 2026, revised April 27, 2026
(v3), cs.SE. All existing citations are correct.
### LOW — ~~`.claude/` false claims in `tool-agnostic-agent-infra.md`~~ (resolved)
The file `docs/projects/tool-agnostic-agent-infra.md` no longer exists — already
deleted. No action needed.