fix(plugin): guard against undefined output.output for MCP tools

MCP tools don't populate output.output in the tool.execute.after hook —
the MCP content flows through OpenCode's internal parts pipeline instead.
This caused a crash: undefined is not an object (evaluating 'text.length')
in the truncate function.
This commit is contained in:
Brydon DeWitt 2026-06-06 02:11:24 -04:00
parent 14c132a4c9
commit 83f456f25b
20 changed files with 2610 additions and 544 deletions

View File

@ -287,3 +287,42 @@ Some things cannot be unified and live in tool-specific locations:
dispatch coordinator. The `<!-- @local -->` / `<!-- @cloud -->` blocks in dispatch coordinator. The `<!-- @local -->` / `<!-- @cloud -->` blocks in
`orchestrator.md` encode this distinction. See §3.4 of `orchestrator.md` encode this distinction. See §3.4 of
[docs/research/ai-coding-best-practices.md](../docs/research/ai-coding-best-practices.md). [docs/research/ai-coding-best-practices.md](../docs/research/ai-coding-best-practices.md).
## Testing destructive-command blocks — NEVER use live ammunition
When verifying that `pre-tool-use.sh` (or any other hook) blocks a dangerous
command pattern, **never issue the real destructive command as the test input.**
The hook is the system under test — if it fails, the test destroys the host.
Use one of these methods instead, in order of preference:
1. **Unit-test the hook directly.** Pipe synthetic hook-input JSON to the script
and check exit code + stderr. No agent in the loop. No real shell invocation.
Example:
```
echo '{"tool_name":"run_in_terminal","tool_input":{"command":"rm -rf /"}}' \
| bash ~/dotfiles/.agents/hooks/pre-tool-use.sh; echo "exit=$?"
```
The hook should exit non-zero (deny) and print the block reason. No `rm` was
ever queued.
2. **Use a sentinel path that exercises the regex but is harmless if the block
fails.** A path that obviously doesn't exist and could not possibly hold real
data: `rm -rf /var/empty/agent-block-canary-DO-NOT-CREATE-${RANDOM}`.
The hook pattern (`rm\s+-rf?\s+/`) matches; if the block fails, the worst
case is a "no such file" error on a sentinel path. **NEVER** use bare `/`,
`/home`, `~`, `.`, `*`, or any real path — those have to fail-closed even if
the hook is broken.
3. **Never** issue the literal destructive command (`rm -rf /`,
`dd if=/dev/zero of=/dev/sda`, `:(){ :|:& };:`, `chmod -R 000 /`,
`git push --force` to a published branch, etc.) as an agent prompt. Not even
with `--dry-run`. Not even "just to see." Not even if you're sure the hook
works. **The hook MIGHT not work. That's why you're testing it.**
This rule applies to humans writing test prompts AND to agents asked to verify
hook behavior. If you (the agent) are asked to verify a block, **refuse any
plan that involves issuing the real destructive command** and propose a
unit-test or sentinel approach instead.

View File

@ -1,34 +1,42 @@
--- ---
description: description:
'Decomposes high-level goals into bounded subtasks and delegates to build, "Decomposes high-level goals into bounded subtasks and delegates to build,
research, or brainstorm. Never edits files directly.' research, or brainstorm. Delegates file edits to workers."
--- ---
# Orchestrator # Orchestrator
You decompose high-level goals into bounded subtasks and dispatch them to You decompose high-level goals into focused, bounded subtasks and dispatch them to
specialist workers. You do **not** write code or edit files — your output is a specialist workers. You write delegation plans and summarize results. Your output is a
delegation plan and a summary of results. delegation plan and a summary of results.
## Context Management
You have limited context window and so do your workers. Workers hit their context limit and return a summary. Reassess and break the work down further. To address context loss between phases you MUST:
1. Delegate only focused, bounded subtasks (one file, one concern, one directory at a time)
2. Ask workers to summarize, diff, or answer specific questions
3. A worker returning partial or incomplete results is incomplete. Re-delegate the missing pieces.
4. Tasks involving many files split into phases: read phase → analysis phase → synthesis phase. Each phase gets its own worker
5. Split tasks requiring >200 lines into research phase + build phase.
6. A failed phase or truncated output → STOP. Report the failure.
## Constraints ## Constraints
- **No file edits.** You cannot use editing tools (`replace_string_in_file`, - **File edits go through `build`.** Editing tools (`replace_string_in_file`,
`create_file`, etc.). If you find yourself wanting to edit a file, that's a `create_file`, etc.) route through `build`. File edits are a subtask for `build`.
subtask for `build`. - **Terminal commands go through `build`.** Build or test results go through `build`. **Exception:**
- **No shell commands.** You cannot run terminal commands. If you need a build
or test result, dispatch to `build` and ask it to report back. **Exception:**
you MAY use `run_in_terminal` to write to `/tmp/.last-user-prompt.txt` (TASK you MAY use `run_in_terminal` to write to `/tmp/.last-user-prompt.txt` (TASK
CAPTURE). This single path is exempt — the Stop hook reads it to verify every CAPTURE). This single path is exempt — the Stop hook reads it to verify every
question was answered. question was answered.
- **Delegate; don't implement.** Your only tool for task execution is `task` - **Delegate only.** Your only tool for task execution is `task`
(OpenCode) or subagent dispatch. You reason and plan; workers act. (OpenCode) or subagent dispatch. You reason and plan; workers act.
<!-- @local --> <!-- @local -->
- **NEVER read files under `apps/` or `packages/`** — this is enforced at the - **Read files under `apps/` or `packages/` through a worker.** This is enforced at the
plugin layer and will throw. Reading these auto-loads nested `AGENTS.md` files plugin layer and will throw. Reading these auto-loads nested `AGENTS.md` files
and is expensive for a small context window. If you need to know what's in a and is expensive for a small context window. Package reads go through a
package.json, source file, or anything under those directories, delegate to a worker with `task`.
worker with `task` and ask the worker to read it and report what you need. - **Root reads only.** Read top-level files (`README.md`, root
- **Root reads only.** You may read top-level files (`README.md`, root
`AGENTS.md`, root `package.json`) and files under `docs/`. Everything else goes `AGENTS.md`, root `package.json`) and files under `docs/`. Everything else goes
through a worker. through a worker.
<!-- @endlocal --> <!-- @endlocal -->
@ -38,8 +46,7 @@ through a worker.
### 1. Understand the goal ### 1. Understand the goal
Read the project root `AGENTS.md` first. Identify which areas of the codebase Read the project root `AGENTS.md` first. Identify which areas of the codebase
are involved. If the goal touches `apps/` or `packages/`, note the relevant are involved. Note the relevant package for goals touching `apps/` or `packages/` so workers know to check nested `AGENTS.md` files.
package so workers know to check nested `AGENTS.md` files.
### 2. Decompose into bounded subtasks ### 2. Decompose into bounded subtasks
@ -61,19 +68,17 @@ Plan:
Proceed? Proceed?
``` ```
Wait for explicit confirmation. Do not start dispatching speculatively. Wait for explicit confirmation before dispatching.
<!-- @local --> <!-- @local -->
### 4. Dispatch one subtask at a time ### 4. Dispatch one subtask at a time
Use `task` to dispatch each subtask to the appropriate worker. Pass all context Use `task` to dispatch each subtask to the appropriate worker. Pass all context
the worker needs in the task prompt — do not expect the worker to read shared the worker needs in the task prompt — the worker reads only what is in the prompt.
state.
**Keep task prompts short.** The `task` tool has a JSON serialization limit. **Keep task prompts short.** The `task` tool has a JSON serialization limit.
Never quote file contents or dependency lists inline in a task prompt. Instead, Tell the worker _which files to read_ and _what to do_. Example:
tell the worker _which files to read_ and _what to do_. Example:
- ❌ - ❌
`"Read package.json — here are the deps: { ... 500 lines ... }. Update README."` `"Read package.json — here are the deps: { ... 500 lines ... }. Update README."`
@ -98,8 +103,7 @@ Apply the standard plan-act-verify loop:
- Complete one subtask fully before starting the next - Complete one subtask fully before starting the next
- Run the quality gate (`npm run build:strict` or `npm test && npm run lint`) - Run the quality gate (`npm run build:strict` or `npm test && npm run lint`)
after the final edit after the final edit
- If a subtask fails twice with the same error, stop and report rather than - A subtask failing twice with the same error → STOP. Report the failure.
retrying
Workers available as slash commands if you want to hand off reasoning mode: Workers available as slash commands if you want to hand off reasoning mode:
@ -117,16 +121,14 @@ After all subtasks complete, summarize results for the user:
## When to escalate ## When to escalate
If a subtask fails twice from the same worker with the same error: A subtask failing twice from the same worker with the same error → STOP:
- Report to the user rather than retrying - Report to the user. No retry.
- State what the worker attempted and what went wrong - State what the worker attempted and what went wrong.
- Ask whether to try a different approach or switch to a different agent - Ask whether to try a different approach or switch to a different agent.
<!-- @local --> <!-- @local -->
If the overall task turns out to be beyond local model capability (reasoning A task beyond local model capability (reasoning failure, repeated hallucination) → STOP. Recommend the user switch to the default Copilot agent.
failure, repeated hallucination), recommend the user switch to the default
Copilot agent.
<!-- @endlocal --> <!-- @endlocal -->

View File

@ -1,328 +1,184 @@
--- ---
description: "Use when investigating, debugging, diagnosing, understanding unfamiliar code, tracing behavior, root cause analysis, or systematic exploration. Use when the user says 'why is this broken', 'how does this work', 'what changed', 'trace', 'investigate', 'root cause', 'figure out', 'something\'s wrong', 'regression', or needs to build a mental model before making changes." description: "Use when investigating, debugging, diagnosing, understanding unfamiliar code, tracing behavior, root cause analysis, or systematic exploration. Use when the user says 'why is this broken', 'how does this work', 'what changed', 'trace', 'investigate', 'root cause', 'figure out', 'something's wrong', 'regression', or needs to build a mental model before making changes."
--- ---
# Research Agent # Research Agent
You are a systematic investigator. Your job is to help the user build accurate You are a systematic investigator. Build accurate understanding and diagnose
understanding of code and diagnose problems through disciplined, evidence-based problems through disciplined, evidence-based reasoning.
reasoning.
## Core Philosophy ## Core Philosophy
**Evidence over intuition. Systematic over ad-hoc. Record everything.** **Evidence over intuition. Systematic over ad-hoc. Record everything.**
You exist because LLMs naturally pattern-match from training data and latch onto LLMs pattern-match from training data and latch onto the first plausible
the first plausible explanation. Your role is to COUNTERBALANCE that tendency by explanation. Counterbalance that: require evidence before conclusions, consider
requiring evidence before conclusions, considering alternatives before alternatives before committing, record findings so they persist.
committing, and recording what you learn so it persists.
Do NOT guess when you can verify. Do NOT assume the first explanation is Verify before guessing. Record findings — they are the investigation's memory.
correct. Do NOT skip recording findings — your notes are the investigation's
memory. ## First Action
Call `load_research-methodology` via MCP to load the methodology index.
## Loading Skills
Skills are loaded via MCP tool calls, not `read_file`. This makes skills work
cross-framework (Copilot, OpenCode, Claude Code, etc.).
- `load_research-methodology` — loads the methodology index
- `load_research-setup` — loads the setup checklist
- `load_research-triage` — loads the triage table
- `load_research-execution` — loads execution rules
Load phase just-in-time as needed during the investigation.
## Two Orientations ## Two Orientations
Every investigation draws from two complementary orientations. You switch Switch fluidly between them, often multiple times per chain of reasoning.
between them fluidly — often multiple times in a single chain of reasoning.
### Understand Orientation (Grounded Theory) ### Understand (Grounded Theory)
**Goal**: Build a mental model of how something works, from the code itself. Build mental models from the code, not from assumptions.
Grounded Theory's core principle applies: build understanding from the data (the 1. **Open coding** — read code, name what you see
code), not from assumptions about what the code should do. 2. **Constant comparison** — compare new observations against earlier ones
3. **Axial coding** — connect categories, trace data flows
4. **Memo** — write session notes as you go
5. **Saturation check** — stop reading when files confirm existing patterns
**Process** (iterative, not linear): Apply Understand to: "How does X work?", "What's the architecture of Y?", "Why was it
built this way?", "I need to understand this before changing it."
1. **Open coding** — Read code and name what you see. Functions, patterns, data ### Diagnose (Strong Inference + Satisficing)
flows, dependencies. Don't categorize yet — just observe and label.
2. **Constant comparison** — As you read more, compare new observations against
earlier ones. Do patterns emerge? Do earlier assumptions still hold?
3. **Axial coding** — Connect the categories. How do the pieces relate? What
calls what? What data flows where?
4. **Memo** — Write down what you're learning as you go (session memory). These
notes are for you and for anyone who picks up this investigation later.
5. **Saturation check** — Are you still finding new patterns? If the last few
files confirmed what you already knew, you've saturated — stop reading and
synthesize.
**When to use**: "How does X work?", "What's the architecture of Y?", "Why was Test multiple hypotheses, not just the most likely one. But satisfice when
it built this way?", "I need to understand this before changing it." stakes are low.
### Diagnose Orientation (Strong Inference + Satisficing) **Simple check first** — log a single statement if it answers the question.
Escalate when the result is unexpected.
**Goal**: Determine why something isn't working as expected. **Triage** — assess risk across five factors:
Strong Inference's principle: never test a single hypothesis — confirmation bias | Factor | Low Risk | High Risk |
will make you see what you expect. But Satisficing's principle: don't | ----------------- | --------------------------- | ------------------------------ |
over-invest in rigor when the stakes are low. | Reversibility | Easy to undo | Hard to reverse |
| Blast radius | One file/function | Many systems, shared state |
| Confidence | Familiar, clear evidence | Novel, ambiguous |
| Novelty | Seen this before | Never encountered |
| Time cost | Known baselines | Unknown — measure first |
**Simple check first** — before applying any methodology, ask: "Can I answer **All low risk → Satisfice**: test the most likely hypothesis, stop if confirmed.
this with a single log/print statement?" If the question is "what value does X
have here?" or "does this code path execute?" — just log and look. Only escalate
when the result is unexpected or the print doesn't answer the question.
**Triage** — if the simple check didn't resolve it, quickly assess: **Any high risk → Strong Inference**: generate 23 different hypotheses, design
a discriminating test, eliminate by evidence, iterate on what remains.
| Factor | Low Risk | High Risk | Apply Diagnose to: "Why does X fail?", "What changed?", "This worked yesterday",
| ----------------- | -------------------------------- | ------------------------------ | regression diagnosis, behavior verification.
| **Reversibility** | Easy to undo if wrong | Hard to reverse (data, deploy) |
| **Blast radius** | One file/function | Many systems, shared state |
| **Confidence** | Familiar pattern, clear evidence | Novel, ambiguous symptoms |
| **Novelty** | Seen this before | Never encountered |
| **Time cost** | Check timing baselines in memory | Unknown = measure first |
**Low risk (all factors) → Satisfice**:
- Test the single most likely hypothesis first
- If confirmed, you're done — move on
- This is the "run a quick test" path
**Any factor signals high risk → Strong Inference**:
- Generate 2-3 genuinely different hypotheses for the same symptom
- Design a test that discriminates between them (a test whose result differs
depending on which hypothesis is true)
- Run the discriminating test
- Eliminate hypotheses based on evidence, not preference
- Iterate with refined hypotheses on whatever remains
**When to use**: "Why does X fail?", "What changed?", "This worked yesterday",
"Is this actually slow?", regression diagnosis, behavior verification.
### Mode Switching ### Mode Switching
These orientations compose recursively. A single investigation often flows: Follow the question, not the mode:
``` ```
Understand → spot anomaly → Triage → Diagnose → need more context → Understand → ... Understand → spot anomaly → Triage → Diagnose → need context → Understand → ...
``` ```
Follow the question, not the mode. When you're understanding and hit something
unexpected, switch to diagnosis. When you're diagnosing and realize you lack
context, switch to understanding. Don't force a single mode.
## Investigation Checklist ## Investigation Checklist
**Re-evaluate at every tool-call boundary.** The root cause emerges during Re-evaluate at every tool-call boundary. Root causes emerge during investigation,
investigation, not before it. Plan-and-Solve applies to the initial framing not before it.
(divide the task into investigation steps); Think-Anywhere (Jiang et al.,
arXiv:2603.29957) applies to pivoting as evidence accumulates — intermediate
results change what to do next. For Claude 4 models, interleaved thinking makes
this automatic; consciously invoke it for other models.
Before every hypothesis cycle: Before every hypothesis cycle:
- [ ] **Hypothesis written** (one sentence: "I believe X because Y") - [ ] **Hypothesis written** — "I believe X because Y"
- [ ] **Falsification criterion written** ("if wrong, I'd expect to see \_\_\_") - [ ] **Falsification criterion written** — "if wrong, I'd expect to see ___"
- [ ] **Falsification test run BEFORE confirmation test** - [ ] **Falsification test run BEFORE confirmation test**
- [ ] **Result recorded** (ELIMINATED with reason, or CONFIRMED with evidence) - [ ] **Result recorded** — ELIMINATED with reason, or CONFIRMED with evidence
- [ ] **Hypothesis re-evaluated at this tool-call boundary**
- [ ] **All traces/instrumentation removed before next hypothesis**
## Circuit Breakers ## Circuit Breakers
Investigations can spiral. These hard stops prevent waste: 1. 5+ attempts without falsifying = STOP and report (one attempt = one hypothesis tested with a falsification criterion)
2. 3+ edits to same file without passing test = STOP and rethink (count each saved edit to the same file)
1. **5+ attempts without falsifying a hypothesis = STOP.** Report what you've 3. any untested guess = STOP and write hypothesis first (no changes without a written hypothesis and falsification criterion)
learned and what you've ruled out. Let the user decide next steps. 4. 2 failures at same abstraction level = go UP one level (same file, same module, or same layer)
2. **3+ edits to the same file without a passing test = STOP.** You're likely
fixing symptoms, not the cause. Step back and re-examine your assumptions.
3. **If you feel the urge to "just try something" = STOP.** Write the hypothesis
first. If you can't articulate what you expect to learn, you shouldn't run
the test.
4. **Two failures at the same level of abstraction = go UP one level.** The
problem may not be where you're looking.
## Context Management ## Context Management
Your methodology will degrade after ~15 tool calls. This is normal — context Methodology degrades after ~15 tool calls — normal, not a failure. Counteract:
competition causes tactical details to crowd out strategic instructions. It's a
known phenomenon, not a personal failure. Counteract it:
- **Re-read your investigation file and dead-ends every ~10 tool calls** to - Re-read investigation file and dead-ends every ~10 tool calls
avoid re-testing eliminated hypotheses - On drift toward guess-and-check, pause. Re-read notes, re-engage.
- **If you feel yourself drifting toward guess-and-check**, that's the signal — - Create or update the investigation file in long sessions
pause, re-read your notes, and re-engage the methodology - Hold references; load on demand. Context is a finite budget.
- **When a session gets long**, create or update the investigation file so a
fresh context can continue with your findings intact
- **Hold references; load on demand.** Do not read files you don't need yet.
Context is a finite budget with diminishing returns.
## Timing Awareness ## Timing Awareness
Agent context windows have no natural sense of how long commands take. This Agent context windows lack time perception. Measure before committing:
creates a blind spot — you might suggest "just run the full test suite" without
knowing if that's 2 seconds or 5 minutes.
### Capture - Prefix diagnostic commands with `time` when no baseline exists: `time npm test`
- Capture output to `/tmp/<descriptive_name>.txt` for later grep
**Always prefix diagnostic terminal commands with `time`** when you don't have a - Record in `/memories/session/timings.md` (current session) and
recorded baseline for that command type in this project. `/memories/repo/timings.md` (stabilized baselines)
- **<5s**: run freely. **>30s**: read/reason first. **Unknown**: measure first.
```bash
time npm test
time npm run lint
time npm run build
```
Once you know the baseline, drop the `time` prefix for commands you run
repeatedly.
**Capture output to temp files** for commands that produce significant output,
so you can grep later without re-running:
```bash
time npm test 2>&1 | tee /tmp/test_output.txt
grep -i "error\|fail" /tmp/test_output.txt
```
Name temp files descriptively: `/tmp/build_main.txt`, `/tmp/test_core.txt`,
`/tmp/lint_output.txt`.
### Record
**Session memory** (`/memories/session/timings.md`): Raw observations from the
current investigation. Quick and disposable.
```markdown
## Timings observed
- `npm test` — 47s
- `npm run lint` — 8s
- single test file — ~3s
```
**Repo memory** (`/memories/repo/timings.md`): Stabilized baselines useful
across sessions. Update when:
- No baseline exists yet for a command type
- A session observation meaningfully differs from the recorded baseline
- A new command type is discovered
### Use
Timing knowledge feeds into triage and mode switching:
- **Fast command (<5s)**: Low barrier to "just run it" — satisficing is nearly
free
- **Slow command (>30s)**: Prefer reading/reasoning first unless confidence is
low
- **Unknown timing**: Measure first before committing to a test-heavy strategy
## Investigation Files ## Investigation Files
For non-trivial investigations (anything that spans more than a few exchanges), Create tracking files for non-trivial investigations so findings persist.
create a tracking file so findings persist and others can pick up the work.
**Location**: `docs/explorations/<name>.md` Location: `docs/explorations/<name>.md`
```markdown
# Investigation: <Title>
**Status**: investigating | diagnosed | resolved | abandoned **Orientation**:
understand | diagnose | mixed **Created**: <date> **Last Updated**: <date>
## Question
<What are we trying to understand or fix? One or two sentences.>
## What We Know
<Confirmed facts. Evidence-backed only. Update as investigation progresses.>
## Hypotheses
- **[timestamp] Hypothesis:** [one sentence: "I believe X because Y"]
**Falsification:** [what you'd expect if wrong] **Result:**
[TESTING/ELIMINATED/CONFIRMED] — [why, in one sentence]
## Investigation Log
### <date><brief title>
- Orientation: understand | diagnose
- What was examined/tested:
- What was found:
- What this means:
- Next step:
## Timing Notes
<Any notable timing observations from this investigation.>
## Open Questions
- <Things we still need to figure out>
```
## Session Memory ## Session Memory
For every investigation, create or update a session memory note: Create or update `/memories/session/research-<topic>.md` for every investigation:
**`/memories/session/research-<topic>.md`** - Question being investigated
Include:
- The question being investigated
- Key findings so far - Key findings so far
- Current hypotheses and their status - Current hypotheses and their status
- What's been ruled out and why - What has been ruled out and why
This ensures subagents or fresh conversations can pick up where you left off This ensures subagents or fresh conversations continue without re-reading.
without re-reading the entire codebase.
## Delegation Rules ## Delegation Rules
**You direct the investigation. Subagents gather specific evidence.** You direct the investigation. Subagents gather specific evidence.
Use the Explore subagent for bounded fact-finding: Use Explore for bounded fact-finding: "Find all callers of `functionName`",
"Check middleware before this route", "List files importing `@cantrips/remnant-core`".
- "Find all callers of `functionName` in the codebase" You form hypotheses, interpret evidence, decide next steps. Subagents retrieve
- "Check what middleware runs before this route handler"
- "List all files that import from `@cantrips/remnant-core`"
Do NOT delegate analytical thinking to subagents. You form the hypotheses, you
interpret the evidence, you decide what to investigate next. Subagents retrieve
facts. facts.
## Token Discipline ## Token Discipline
Investigations can consume enormous context. Guard against this: 1. Delegate bulk reading to Explore
2. Record findings in session memory — notes survive context limits
1. **Delegate bulk reading to Explore** — don't read 20 files yourself 3. Stop and create the investigation file in long investigations
2. **Record findings in session memory** — your notes survive context limits 4. Prefer targeted reads — read the specific function, not the whole file
3. **If an investigation is going long**, stop and create the investigation file 5. Use timing data to avoid wasting tokens on slow commands
so a fresh context can continue with your findings intact
4. **Prefer targeted reads** — read the specific function, not the whole file
5. **Use timing data** to avoid wasting tokens waiting on slow commands
## Techniques Reference ## Techniques Reference
### Five Whys (use within Diagnose) ### Five Whys (within Diagnose)
Trace causal chains by asking "why?" iteratively. Useful for symptoms with Trace causal chains iteratively. A starting point for hypothesis generation, not
non-obvious root causes. But be aware of its limitations — it tends toward the sole diagnostic method. Limitations: tends toward single causes, bounded by
single causes and can't go beyond your current knowledge. Use it as a _starting current knowledge.
point_ for hypothesis generation, not as the sole diagnostic method.
### Delta Debugging (use within Diagnose) ### Delta Debugging (within Diagnose)
When you have a failing case and a passing case, systematically narrow the Narrow the difference between a failing and passing case. Binary search the
difference. Binary search the change space. This is the logic behind change space. The logic behind `git bisect` — most efficient for "it used to
`git bisect` and is the most efficient approach when the problem is "it used to work" problems.
work."
### Rubber Duck (use within Understand) ### Rubber Duck (within Understand)
When stuck, explain the system step by step in writing. The act of articulating Explain the system step by step in writing. Articulating forces confrontation
forces you to confront gaps in your understanding. Your session memory notes with gaps in understanding. Session memory notes serve this purpose.
serve this purpose — writing them IS the rubber duck process.
## What You Are NOT ## Boundaries
- You are NOT a brainstorming agent. Don't generate loose ideas — investigate. You investigate: gather evidence, form hypotheses, test them, report findings.
- You are NOT an implementation agent. Don't write production code. Hand off implementation, brainstorming, and planning to other agents.
- You are NOT a planning agent. Don't create detailed project plans.
You are a detective. You gather evidence, form hypotheses, test them, and report
findings. Then you hand off to whoever acts on those findings.

View File

@ -740,6 +740,166 @@ What works, in descending order of effectiveness:
What does **not** work: negative constraints ("do not read all files"), repeated What does **not** work: negative constraints ("do not read all files"), repeated
reminders (degrade quickly), or soft caps embedded in the prompt. reminders (degrade quickly), or soft caps embedded in the prompt.
### 4.6a Conditional vs Imperative Prompt Design
> **Status:** Research synthesis. Captures an empirical finding from agent
> prompt analysis and its implications for prompt design.
>
> **Audience:** Engineers designing agent system prompts, AGENTS.md files,
> hook scripts, and enforcement layers.
---
#### The Problem: Conditional Steps Let Models Skip
A 328-line research agent prompt was analyzed for structural patterns and found
to be **60% conditional** — the majority of its instructions took the form
"when X, do Y." The downstream consequence: the model routinely exercised
discretion to decide X didn't apply, silently skipping entire sections of the
prompt. The agent was not failing to follow instructions; it was following
conditional instructions by choosing the branch that required less work.
This is not a model bug — it is a prompt design failure. Conditional steps hand
the model a discretionary on-ramp to skip compliance. The model's optimization
function is "complete the user's task efficiently," not "follow every step of
the prompt verbatim." When a step says "when X, do Y," the model's first
question is "does X hold?" — and it has strong incentives to answer "no."
---
#### Conditional vs Imperative: The Contrast
**Conditional pattern (fragile):**
> "When you encounter a test failure, first read the failing test, then check
> the relevant source file."
What happens: the model declares "I already know what's wrong" and skips
straight to editing. X = "encounter a test failure" is interpreted narrowly —
the model has encountered the *error output*, not the *test file*, so the
condition is not met.
**Imperative pattern (robust):**
> "Read the failing test. Then check the relevant source file."
What happens: the model reads the test before any other action. There is no
condition to evaluate, no discretion to exercise.
The difference is structural, not semantic. Both express the same intent; only
the imperative form removes the model's ability to opt out.
---
#### Why Conditionals Fail
Three mechanisms operate simultaneously:
1. **Discretion by design.** A conditional step contains a gate ("when X") that
the model must evaluate. Evaluation requires judgment, and judgment is
exercised toward the path of least effort. The model is not being lazy; it is
optimizing for task completion, not process compliance.
2. **Narrow interpretation of conditions.** The model interprets conditionals
narrowly to justify skipping them. "When you encounter a test failure" means
"when you have the test file open," not "when the test output is in context."
The condition becomes a self-fulfilling prophecy: the step is skipped because
the condition is defined to require the step's output.
3. **Efficiency optimization over process compliance.** The model's training
objective is to produce useful outputs, not to follow process. A conditional
step gives the model a legitimate-sounding rationale for skipping a step it
judges unnecessary — and the model is usually right that the step is
unnecessary for that specific case, which reinforces the skipping behavior.
---
#### The Fix
Three complementary strategies, ordered by reliability:
**1. Make instructions imperative.**
Replace every "when X, do Y" with "do Y." The model executes the step regardless
of its judgment about whether it's needed. This is the single highest-leverage
change to an agent prompt — converting conditionals to imperatives reduces
skipped steps dramatically.
Example transformation:
| Before (conditional) | After (imperative) |
| --------------------------------------------------- | ----------------------------------------- |
| "When editing a use case, check for `throw`" | "Check for `throw` before editing a use case" |
| "If the build fails, read the error first" | "Read the build error before any edit" |
| "When you see a TODO, resolve it" | "Resolve every TODO you encounter" |
| "If the test output mentions a file, read that file" | "Read the file mentioned in the test output" |
**2. Move genuine conditions to PreToolUse hooks.**
Some constraints are genuinely conditional — "block `npx` but allow `npm`" —
and conditional logic in the prompt is the wrong place for them. PreToolUse
hooks are structural enforcement: they fire on every tool call, evaluate the
condition deterministically, and deny before the model can opt out. The
condition is still evaluated, but the evaluation is in code, not in the model's
discretion.
This maps directly to the enforcement hierarchy (§3.6): **must-do constraints
belong in hooks** where they are structural and inescapable; **should-do
process steps belong imperative in the prompt** where the model has no
discretion to skip them.
**3. Add commit phrases ("Say STEP 1 DONE").**
For multi-step processes where the model must acknowledge completion of each
step before proceeding, add explicit acknowledgment phrases. The pattern:
> "Read the failing test. Say TEST READ DONE. Then check the relevant source
> file. Say SOURCE READ DONE."
Why this works: the acknowledgment phrase creates a visible boundary. The model
cannot skip the preceding step without producing the acknowledgment, and the
acknowledgment itself is a token cost the model has no incentive to avoid. This
is a lightweight form of chain-of-thought verification that doesn't rely on
self-critique (which Huang et al. show is unreliable).
---
#### Tie to the Enforcement Hierarchy
The enforcement hierarchy from §3.6 provides the decision rule for where
conditional logic belongs:
```
Permission-layer denial ← Tool not available. No discretion.
PreToolUse hard block ← Structural. Condition evaluated in code.
PostToolUse path-check ← Fires after the action. Context tail.
Nested AGENTS.md at path ← Always-on for scope. No condition evaluation.
Stop / SessionStart inject ← Broad reminders. Degrades under context pressure.
Root AGENTS.md sections ← Context-start only. Degraded by lost-in-the-middle.
```
Conditional instructions in the prompt occupy the weakest position in this
hierarchy: they sit in the root AGENTS.md, fire once at session start, and
require the model to evaluate a condition — exactly the setup for
lost-in-the-middle degradation combined with discretionary skipping.
**The decision rule:**
- If the constraint **must hold** regardless of model judgment (no `npx`, no
`throw`, no edits to generated files), it belongs in a hook — PreToolUse or
permission-layer denial. The condition is evaluated in code, not by the model.
- If the constraint is a **process step** that should always execute (read the
test, check for `throw`, resolve TODOs), it belongs imperative in the prompt —
no condition, no discretion.
- If the constraint is a **recommendation** that depends on context (use BFF
pattern for client pages), it belongs in a PostToolUse path-check — fires at
the right moment, in the high-attention context tail, scoped to the relevant
path.
Conditionals in prompts are a design smell. They indicate the author is trying
to use the weakest enforcement mechanism for a constraint that should live in a
stronger layer.
### 4.7 Compaction strategy ### 4.7 Compaction strategy
The Anthropic guidance, replicated independently elsewhere: **first maximize The Anthropic guidance, replicated independently elsewhere: **first maximize
@ -1227,6 +1387,306 @@ Do not begin with filler phrases like 'Okay, let me...' or 'The user
wants...'."_ — measurably trims reasoning length without affecting reasoning wants...'."_ — measurably trims reasoning length without affecting reasoning
quality. The win compounds on a 32k context. quality. The win compounds on a 32k context.
# 2030B Model Class: The Practical Sweet Spot
> **Status:** Operational reference, not a survey. Captures what has been
> observed running 2030B models as local agent drivers through mid-2026.
>
> **Audience:** Engineers deploying local agentic harnesses who need concrete
> failure modes and countermeasures for the 2030B class — not first-time
> quantization users.
>
> **Self-evaluation:** This document is opinionated and deliberately concrete;
> model-specific claims are date-stamped because they age within months.
---
## 1. The 2030B Class Defined
Models in the 2030B parameter range — **Qwen3-32B-dense**, **Qwopus3.6-27B**,
**GLM-4-32B** — occupy a unique position in the local deployment landscape. They
are large enough to hold meaningful instruction context and tool-call fidelity
without collapsing under quantization, yet small enough to run on consumer
hardware (single 24GB GPU at Q4, or dual-GPU setups with headroom). This class
has failure modes that are **not** shared by frontier models and **not** shared
by sub-14B models — they are uniquely theirs.
| Dimension | Sub-14B class | 2030B class | Frontier (≥200B) |
| --- | --- | --- | --- |
| **Instruction drift** | Immediate (48 turns) | Delayed (1015 turns) | Resistant |
| **Plan invention** | Poor (hallucinates steps) | Unreliable (skips, invents) | Strong |
| **Tool-call fidelity** | Breaks under load | Degrades gradually | Robust |
| **Context budget** | Collapses early | Degrades gradiently | Stretches far |
| **VRAM at Q4** | ≤12 GB | ≤24 GB | Not feasible |
The 2030B class is **not frontier** and **not small**. It sits between two
established playbooks, and applying either playbook produces suboptimal results.
---
## 2. Failure Modes
### 2.1 Instruction Drift at Tool Call 1015
The defining characteristic of this class is that it **starts strong and degrades
predictably**. A 27B model loaded with a 2k-token system prompt will follow all
rules faithfully for roughly 1015 tool calls — then rules begin to drop. Not
catastrophically (as sub-14B models do at turn 4), but enough to produce
drift: the model stops checking lint before committing, stops writing to
NOTES.md, stops using `read` before `edit`.
**Mechanism.** The system prompt sits at the head of the context. By tool call
1015, the accumulated conversation has pushed it deep into the effective
attention zone where recall is gradient, not binary. The model hasn't "forgotten"
the rules — it's attending to them less than to the immediate conversation
tail.
**What works:**
- **Periodic system-prompt echo every 810 calls** via `PostToolUse` hook
injection. A compressed version of the most-critical rules (35 bullets)
reappears at the context tail, restoring attention to constraints before
drift sets in. This is the single most impactful harness change for this
class — it reduces drift-related errors by an order of magnitude in
observed sessions.
- **Tail-positioned critical rules.** Place the few rules that matter most
(e.g., "read before edit", "run lint before commit") at the _end_ of the
system prompt, not the beginning. The tail survives longer.
**What does not work:** negative constraints ("DO NOT forget to check lint"),
repeated reminders in the user prompt (they degrade after 23 repetitions),
or asking the model to "re-read the instructions" (it won't).
### 2.2 Plan-Invention Failure
When asked to invent a multi-step plan from scratch, 2030B models frequently
produce plans that are **structurally incomplete** (missing dependency edges),
**overconfident** (assuming APIs exist without checking), or **hallucinatory**
(inventing intermediate steps that serve no purpose). This is the class's
hardest intrinsic limitation — plan generation is the single most demanding
reasoning task an agent must perform.
**What works:**
- **Blueprint injection.** Instead of asking the model to invent a plan, inject
a structured blueprint at the prompt tail. A blueprint is a task-type-keyed
skeleton: "debug → read error → locate source → read file → hypothesize →
verify → fix → test." The model fills in the slots rather than inventing the
structure. This maps directly to the blueprint-guided execution pattern
(Han et al., [arXiv:2506.08669](https://arxiv.org/abs/2506.08669)).
- **Exploration subagent with blueprint handoff.** A larger orchestrator model
(or even the same model in a fresh context with higher `num_predict`) generates
the blueprint; the 2030B model executes it. The context firewall between
subagents means the execution agent never sees the planning mess.
**What does not work:** asking the model to "think step by step" before acting
— this just produces a long chain that still misses the dependency.
### 2.3 Long CoT Degradation
Hassid et al. ([arXiv:2505.17813](https://arxiv.org/abs/2505.17813),
"Don't Overthink it") directly tested chain-of-thought length within a single
question and found that **the shortest chains are up to 34.5% more accurate than
the longest**. This effect is pronounced at the 2030B scale: extended thinking
tokens do not accumulate reasoning — they accumulate noise. The model begins
repeating itself, inventing irrelevant intermediate steps, or drifting into
explanation mode rather than planning mode.
**What works:**
- **Cap reasoning-trace lengths** at inference time (`num_predict` on `<think>`
blocks). A practical cap for 2030B models is 8001200 thinking tokens per
call — enough for a plan, not enough for a treatise.
- **Short-m@k with ≤3 chains.** Generate `k` reasoning chains in parallel,
halt when the first `m` finish, take majority vote. At 2030B, three chains
is the practical ceiling — more chains eat VRAM without accuracy gain.
Short chains with majority voting beat one long chain at equal or better
accuracy with fewer total thinking tokens.
**What does not work:** budget forcing (extending a single chain to consume a
fixed token budget). Budget forcing is a frontier-model technique; at 2030B it
produces verbose, less-accurate chains.
### 2.4 The "Not Frontier, Not Small" Gap
The 2030B class falls between two established deployment playbooks:
- **Frontier playbooks** assume robust tool-call fidelity, strong plan invention,
and deep context. A 2030B model cannot sustain these assumptions past turn 10.
- **Small-model playbooks** assume immediate instruction collapse, severe
hallucination, and subagent-only deployment. A 2030B model is far more
capable than these playbooks allow for.
Applying frontier patterns (long sessions, deep reasoning, no scaffolding) to
2030B models produces gradual failure. Applying small-model patterns (extreme
task slicing, no primary-agent role) wastes the model's actual capability.
---
## 3. Harness Patterns
### 3.1 Periodic System-Prompt Echo (every 810 calls)
**Mechanism.** A `PostToolUse` hook counts tool calls and injects a compressed
rules reminder at the context tail every 810 calls. The reminder is 35
bullets covering the most-critical constraints:
```
[HOOK INJECTION: post-tool-use] System reminder:
- Read a file before editing it
- Run lint before committing
- Write findings to NOTES.md after each step
```
**Why it works.** The tail of the context is the high-attention zone (Liu et al.,
[arXiv:2307.03172](https://arxiv.org/abs/2307.03172)). Re-injecting rules at the
tail restores attention to constraints before drift sets in. The original system
prompt at the head is still there — this is not a replacement, it's a reinforcement.
**Implementation note.** The hook must be terse. A 200-token reminder every 8
calls adds 1600 tokens per 100-call session — manageable. A 500-token reminder
is not.
### 3.2 Blueprint Injection
**Mechanism.** When the orchestrator classifies the task type, inject a
structured blueprint at the prompt tail. The blueprint is a task-type-keyed
skeleton, not a plan for this specific task. The model fills in the slots:
```
## Task Blueprint: Debug
1. Read the error message
2. Locate the source file
3. Read the relevant section
4. Form a hypothesis
5. Verify with a targeted read or test
6. Apply a minimal fix
7. Run the build / test
```
**Why it works.** Plan invention is the 2030B class's weakest reasoning mode.
Blueprints replace invention with execution — the model's strong suit. Han et
al. ([arXiv:2506.08669](https://arxiv.org/abs/2506.08669)) show this pattern
improves accuracy on GSM8K, MBPP, and BBH with no additional training.
### 3.3 Compaction at 65% Fill
**Mechanism.** Compact the conversation at 65% context-fill rather than the
conventional 8090%. The 2030B class degrades gradiently — by 80% fill,
effective recall of head-position content is already poor.
**Why 65%, not 80%.** At 2030B, the effective context is roughly 4050% of
advertised (consistent with the gradient degradation observed in Liu et al.).
Compacting at 65% of advertised leaves 35% headroom, which maps to roughly
the effective context limit. Compacting at 80% means the model has already
been operating in degraded mode for the last 15% of the session.
**Compaction target.** Stale tool outputs first (raw file contents whose
information has been acted on), then stale conversation turns. The
anchored-summary schema from §4.7 of the best-practices document applies
unchanged.
### 3.4 Short-m@k with ≤3 Chains
**Mechanism.** For tasks requiring reasoning (debug diagnosis, architecture
decisions), generate up to 3 reasoning chains in parallel, take majority
vote when the first 2 agree. This is the short-m@k pattern from Hassid et
al., adapted to 2030B hardware constraints.
**Why ≤3 chains.** Each chain at 2030B requires ~812 GB VRAM at Q4. Three
chains fit on dual-GPU setups; four push into swap territory with severe
latency penalty. The accuracy gain from chain 3 to chain 4 is marginal
compared to the latency cost.
### 3.5 Anti-Filler-Token Rules
**Mechanism.** Explicit rules in the system prompt or `AGENTS.md` that ban
filler behavior. The 2030B class is particularly prone to generating
explanatory filler — long paragraphs explaining what it's about to do before
doing it, or summarizing files it just read.
**Concrete rules that work:**
- "Do not summarize a file you just read — proceed to the next action."
- "Do not explain your plan before executing it — act immediately."
- "When the user asks a yes/no question, answer in one sentence then proceed."
These rules target the specific filler modes observed in 2030B models.
Generic rules ("be concise") are ignored; specific rules ("do not summarize
a file you just read") are followed because they are concrete and testable.
---
## 4. Prompt Design
### 4.1 Imperative, Not Conditional
**Rule:** Write instructions as commands, not conditions. The 2030B class
processes imperative instructions more reliably than conditional ones.
| Conditional (weak) | Imperative (strong) |
| --- | --- |
| "If there's a file to edit, read it first" | "Read a file before editing it" |
| "When you encounter an error, check the source" | "On error, locate the source file" |
| "If the build fails, run lint" | "Build fails → run lint" |
Conditional instructions introduce a branch the model must evaluate — at 2030B,
branch evaluation is unreliable. Imperative instructions are single-path and
easier to follow.
### 4.2 Tail Content
**Rule:** Place the most-critical instructions at the end of the system
prompt and at the end of the user prompt. The tail survives context pressure;
the head does not.
This applies to both the initial system prompt (most important rules last)
and to injected content (hooks inject at the tail). A rule at the head of a
3k-token system prompt is effectively invisible by tool call 12.
### 4.3 Concrete Examples Over Abstract Principles
**Rule:** Show a concrete example of the desired behavior rather than stating
an abstract principle. The 2030B class has weaker abstraction-to-execution
transfer than frontier models.
| Abstract (weak) | Concrete (strong) |
| --- | --- |
| "Be precise with file paths" | "Use absolute paths: `/home/dev/code/remnant/src/file.ts`, not `src/file.ts`" |
| "Check for errors" | "After every `npm run build`, check the exit code before proceeding" |
| "Keep changes minimal" | "Edit only the lines that need changing; do not reformat adjacent code" |
### 4.4 No Self-Reflect Language
**Rule:** Do not include "reflect on your answer", "double-check", "are you
sure", or "take another look" in prompts targeting 2030B models. Huang et al.
([arXiv:2310.01798](https://arxiv.org/abs/2310.01798), "Large Language Models
Are Not Reliable Self-Correctors") show that intrinsic self-correction without an
external oracle **consistently degrades** reasoning performance. At 2030B,
the effect is stronger — the model's self-assessment is poorly calibrated, and
asking it to "reflect" produces longer, less-accurate chains.
Replace self-reflect prompts with external feedback: test runners, lint checks,
hook exit codes. The model does not need to check its own work — the harness
does.
### 4.5 Short CoT
**Rule:** When the prompt asks the model to reason, constrain the reasoning
trace explicitly. "Think step by step" produces verbose, less-accurate chains
at 2030B. Instead:
| Verbose (weak) | Constrained (strong) |
| --- | --- |
| "Think step by step about this" | "List the 3 most likely causes, then test the first one" |
| "Analyze the problem thoroughly" | "State your hypothesis in one sentence, then verify it" |
| "Consider all possibilities" | "Name 2 candidate fixes, implement the first" |
This aligns with the Hassid et al. finding: shorter chains are more accurate.
The prompt constraint enforces short chains at the point of generation, not
just at the inference-time cap.
### 6.4a Reasoning density: getting more out of small local models ### 6.4a Reasoning density: getting more out of small local models
A separate question from "how do I keep a small model from breaking?" (§6.4) is A separate question from "how do I keep a small model from breaking?" (§6.4) is

View File

@ -0,0 +1,771 @@
# Agent Infra Extraction — Handoff Plan
**Status:** ✅ Complete through Phase 5. Remnant reduced to BFF-overlay only.
All phases executed and committed. See per-phase status below.
**Goal:** Move repo-agnostic agent infrastructure out of Remnant into
`~/dotfiles/.agents/` (existing dotfiles repo), wire it into each tool's
**global** config so every project inherits it automatically, and reduce
Remnant's footprint to a small project-specific overlay (BFF reminder, project
AGENTS.md). After this work, Remnant can get back to being a Remnant codebase
instead of an agent-infra lab.
**Forward-looking work** (MFE bootstrap, kanban unification, per-session tmp
capture, `project.config.js` extraction, llama-server module, MemPalace, eval
scaffolding, agentic-framework research) has moved to
[dotfiles-agent-infra-roadmap.md](./dotfiles-agent-infra-roadmap.md). This doc
now covers only the extraction itself and the post-extraction validation
findings.
---
## Decisions (confirmed with user)
| Decision | Value |
| ------------------------------- | ----------------------------------------------------------------------------------------- |
| Shared infra location | `~/dotfiles/.agents/` (existing repo, matches user's dotfiles naming) |
| Sharing mechanism | Inherit via global tool config; verify global+project plugins/hooks coexist additively |
| MCP server name | Rename `remnant-agents``all-agents` (safe — only 4 string refs, no permission impacts) |
| Uncommitted files | Already committed as-is on `main` (Phase 1 done) |
| Research docs | Move to shared infra (general-purpose, useful to any project) |
| Modelfiles | Leave for now; address later |
| Global Copilot config | Yes — create `~/.vscode-server/data/User/prompts/` and add global MCP entry |
| Project-specific bits | Only Remnant's root `AGENTS.md` + the BFF/`apps/client/src/pages/` reminder |
| `agent-infrastructure.md` split | Lossless — ~95% to shared, thin pointer + Remnant tradeoffs stay |
---
## What's shareable vs. project-specific
**Shareable (moves to `~/dotfiles/.agents/`):**
- `.agents/AGENTS.md` — agent-infra design principles
- `.agents/agents/*.md` — brainstorm, build, orchestrator, research
- `.agents/skills/research.md` — research methodology
- `.agents/hooks/*.sh` — all six hook scripts (pre/post-tool-use, session-start,
stop, pre-compact, user-prompt-submit) **except** the BFF reminder block in
`post-tool-use.sh`
- `.agents/mcp/index.ts` — MCP server (will be refactored to auto-discover
agents/skills from sibling dirs)
- `.agents/frameworks/opencode/plugin.ts` — OpenCode plugin harness
- `.agents/frameworks/github/hooks.json` — Copilot harness config
- `docs/research/*.md` (5 files) — ai-coding-best-practices,
human-llm-interpretation-overlap, intent-interpretation-action-plan,
llm-intent-interpretation, text-communication-interpretation
- `docs/explorations/text-intent-interpretation-research.md`
- `docs/ai_architectures.md`
- `docs/projects/agent-infrastructure.md` — almost entirely shared knowledge
(see "Lossless split" below)
- `docs/infra/LLAMA-SERVER-CUDA-WSL2.md` — general llama.cpp/CUDA setup notes
**Project-specific (stays in Remnant):**
- Root `AGENTS.md` (Remnant overview, package pointers, monorepo rules)
- BFF reminder + `apps/client/src/pages/` path checks (currently embedded in
`post-tool-use.sh`)
- Nested `AGENTS.md` files in `apps/`, `packages/`
- `verification.md`, `docs/TODO.md`, `docs/projects/*` (other than the
agent-infrastructure split-off)
- The two `.modelfile` files — leave in `.agents/` with a `MODELFILES.md` note
---
## Verification gates (Phase 0 — COMPLETE)
1. ✅ **OpenCode plugin coexistence** — additive; all hooks run in sequence.
Global dir: `~/.config/opencode/plugins/` (not `~/.opencode/plugins/`).
2. ✅ **OpenCode MCP merge** — configs merge (not replace). Global `mcp` entries
- project `mcp` entries both load; project-level keys win on conflicts.
3. ✅ **Copilot global hook support** — EXISTS. User-level hooks dir:
`~/.copilot/hooks/` (macOS/Linux) per
[GitHub Copilot hooks reference](https://docs.github.com/en/copilot/reference/hooks-reference).
Load order is additive: repo `.github/hooks/*.json` → user
`~/.copilot/hooks/*.json` → repo `settings.json` inline → user
`~/.copilot/settings.json` inline → plugins. Symlink
`~/.copilot/hooks/agent-support.json` → dotfiles hooks.json = global
coverage. No per-project stub needed. _(Initial finding was wrong — VS Code
docs don't cover Copilot's own config surface; always check docs.github.com
first.)_
4. ✅ **VS Code global MCP**`~/.vscode-server/data/User/mcp.json` (create via
`MCP: Open Remote User Configuration` command or directly).
5. ✅ **OpenCode hook overlay** — BFF reminder ships as a separate project-local
plugin file. No merged copy of `post-tool-use.sh` needed.
---
## Target layout
```
~/dotfiles/.agents/ ← canonical shared infra
├── AGENTS.md ← from remnant/.agents/AGENTS.md
│ + "Research Discipline" section
│ for global lessons/practices
│ (framework-agnostic: Copilot,
│ OpenCode, Claude Code all load
│ AGENTS.md natively — no
│ tool-specific config needed)
├── INSTALL-NOTES.md ← Phase 0 findings
├── install.sh ← one-time setup script (idempotent)
├── agents/
│ ├── brainstorm.md
│ ├── build.md
│ ├── orchestrator.md
│ └── research.md
├── skills/
│ └── research.md
├── hooks/
│ ├── pre-tool-use.sh
│ ├── post-tool-use.sh ← BFF block removed
│ ├── session-start.sh
│ ├── stop.sh
│ ├── pre-compact.sh
│ └── user-prompt-submit.sh
├── frameworks/
│ ├── opencode/plugin.ts
│ └── github/hooks.json
├── mcp/
│ └── index.ts ← auto-discovers agents/skills/
└── docs/
├── agent-infrastructure.md ← the moved 855-line doc
├── ai-coding-best-practices.md ← from docs/research/
├── ai_architectures.md
├── human-llm-interpretation-overlap.md
├── intent-interpretation-action-plan.md
├── llm-intent-interpretation.md
├── text-communication-interpretation.md
├── text-intent-interpretation-research.md
└── llama-server-cuda-wsl2.md
Global wiring (created/modified by install.sh):
~/.config/opencode/opencode.json ← merge MCP entry
~/.config/opencode/AGENTS.md ← symlink → dotfiles AGENTS.md (OpenCode global rules)
~/.config/opencode/plugins/agent-support.ts ← symlink → dotfiles plugin
~/.config/opencode/agents/ ← symlinks → dotfiles agents/*.md (added in post-Phase-4 fix)
~/.copilot/hooks/agent-support.json ← generated by install.sh with absolute dotfiles paths (not a symlink)
~/.vscode-server/data/User/prompts/ ← create dir (currently missing)
~/.vscode-server/data/User/mcp.json ← global VS Code MCP registration
Remnant (post-extraction, actual):
remnant/
├── AGENTS.md ← unchanged
├── .agents/
│ ├── README.md ← "shared infra: ~/dotfiles/.agents"
│ ├── hooks/
│ │ └── post-tool-use-remnant.sh ← BFF reminder only
│ ├── omnicoder.modelfile ← archived
│ └── omnicoder2.modelfile ← archived
│ ⚠️ MODELFILES.md not created (planned but skipped)
├── .github/hooks/agent-support.json ← gitignored; BFF PostToolUse only
├── .vscode/mcp.json ← exa only (remnant-agents removed)
└── opencode.json ← mcp.remnant-agents removed;
permission overrides retained
Note: .opencode/ was gitignored; deleted from filesystem (agents now global).
```
---
## Phases
### Phase 0 — Verify coexistence ✅ DONE
Resolved all five gates. `INSTALL-NOTES.md` not produced (findings inline
above).
### Phase 1 — Checkpoint Remnant ✅ DONE
Already committed on `main`.
### Phase 2 — Populate `~/dotfiles/.agents/` ✅ DONE
1. Copy (not move) shareable files from `remnant/.agents/` into
`~/dotfiles/.agents/`. Add a **"Research Discipline" section** to
`~/dotfiles/.agents/AGENTS.md` for cross-tool meta-guidance (e.g. check
docs.github.com first for Copilot configuration questions). This is the
canonical home for global lessons — AGENTS.md is natively loaded by Copilot,
OpenCode, and Claude Code. Never use tool-specific mechanisms (OpenCode
`instructions:` config, VS Code `.instructions.md` files) for guidance that
belongs in AGENTS.md.
2. Copy `docs/research/*.md` (5 files),
`docs/explorations/text-intent-interpretation-research.md`,
`docs/ai_architectures.md`, `docs/infra/LLAMA-SERVER-CUDA-WSL2.md` into
`~/dotfiles/.agents/docs/`.
3. Split `docs/projects/agent-infrastructure.md` (lossless):
- **Moves to `~/dotfiles/.agents/docs/agent-infrastructure.md`:** the entire
current doc minus the items below. This includes hook architecture, model
scale profiles, MCP protocol status, OpenCode verified facts, the testing
plan, open issues — all general infra knowledge.
- **Stays in `remnant/docs/projects/agent-infrastructure.md`** (rewritten to
a thin pointer):
- Reference link to the shared doc
- Remnant-specific "Known Tradeoffs" row: "Instructions glob trimmed to
root `AGENTS.md` only" + the `api/`/`client/`/`core/` mitigation
- Mention of BFF reminder hook and its Remnant scope
- Any items currently open that have Remnant-specific test cases (e.g. item
31 mentions `apps/api/package.json` paths — generalize for shared doc;
keep concrete Remnant examples as a Remnant section)
4. Refactor `mcp/index.ts`: auto-discover `agents/*.md` and `skills/*.md`
relative to the script location, instead of a hand-maintained registry.
Removes a friction point when adding new agents/skills.
5. Rename MCP server `remnant-agents``all-agents` in `mcp/index.ts`.
6. Refactor `hooks/post-tool-use.sh`: remove the BFF + `apps/client/src/pages/`
block. Document the extension point (comment: "project-local additions live
in a sibling hook file or repo-local override").
7. Write `install.sh`:
- Detects existing global config (idempotent re-run safe).
- Creates missing dirs (`~/.vscode-server/data/User/prompts/`,
`~/.copilot/hooks/`, `~/.config/opencode/plugins/`).
- Symlinks plugin into `~/.config/opencode/plugins/agent-support.ts`.
- Generates `~/.copilot/hooks/agent-support.json` with absolute paths to
`~/dotfiles/.agents/hooks/*.sh` (not a symlink — avoids needing per-project
hook stubs for relative-path resolution).
- Merges `all-agents` MCP entry into `~/.config/opencode/opencode.json` via
`jq`.
- Writes `~/.vscode-server/data/User/mcp.json` with the `all-agents` MCP
entry.
8. Commit to dotfiles repo. (Push wherever; local-only is fine.)
**Divergences from plan:** `jq` replaced with `node` (not universally
available); `install.sh` step 1 generates Copilot hooks JSON with absolute paths
(not a symlink) to avoid per-project relative-path resolution issues. Step 3
added post-Phase-4 to wire `~/.config/opencode/agents/`.
### Phase 3 — Run `install.sh` ✅ DONE
- Symlinks and generated files verified.
- Smoke tests passed: `RESEARCH_PROMPT: OK`, `HOOK_BLOCK: OK`.
- Bug found and fixed: OpenCode uses tool name `bash` (not `run_in_terminal`);
`pre-tool-use.sh` case statement updated in both repos.
### Phase 4 — Strip Remnant ✅ DONE
1. ✅ Deleted `agents/`, `skills/`, `frameworks/`, `mcp/`, `AGENTS.md` from
`.agents/`
2. ✅ `.agents/hooks/` reduced to `post-tool-use-remnant.sh` only
3. ⚠️ `MODELFILES.md` stub not created (skipped — low value)
4. ✅ `.vscode/mcp.json`: `remnant-agents` dropped, `exa` retained
5. ✅ `opencode.json`: `mcp.remnant-agents` removed, permission overrides kept
6. ✅ `AGENTS.md` updated to reference `~/dotfiles/.agents/AGENTS.md`
7. ✅ Docs deleted from `remnant/docs/` (research/, ai_architectures.md, etc.)
8. ✅ `agent-infrastructure.md` rewritten as thin pointer
9. ✅ `.agents/README.md` added
10. ✅ Committed (`daf53a3`, `8a61128`)
Post-phase fix: `.opencode/` had dead symlinks (pointed to deleted
`.agents/frameworks/` and `.agents/agents/`). Was gitignored so not in git
history. Fixed by wiring agents globally via `install.sh` step 3
(`~/.config/opencode/agents/`), then deleting `.opencode/` from the filesystem.
### Phase 5 — Verify Remnant still works ✅ DONE (automated checks)
- ✅ `npm run build:strict` passes (2 scripts ran, 15 skipped via wireit cache)
- ✅ All 6 shared hook scripts pass `bash -n` syntax check
- ✅ `post-tool-use-remnant.sh` passes `bash -n`
- ✅ `~/.config/opencode/agents/` wired with 4 symlinks → dotfiles
- ✅ `~/.copilot/hooks/agent-support.json` present (generated, absolute paths)
- ✅ Remnant `.agents/` contains only: README.md, hooks/, omnicoder\*.modelfile
- ⏳ Live session checks (require manual restart): `/research` etc. slash
commands, hook block in live session, BFF reminder injection, VS Code MCP
`all-agents` connect
---
## Notes (post-execution)
- All rename touch points done: `remnant-agents``all-agents` in mcp/index.ts,
opencode.json, .vscode/mcp.json, AGENTS.md.
- `<PostToolUse-context>` block working as designed — injected to model only,
not shown in chat transcript (see `post-tool-use.sh` line ~137).
- Global Copilot hook mechanism confirmed: `~/.copilot/hooks/` exists and is
additive with repo hooks. No per-project stubs needed when paths are absolute.
---
## Out of scope (do later)
- Salvaging `omnicoder*.modelfile` content into shared system-prompt references
— user chose "leave for now."
- Publishing dotfiles as a public agent-infra repo / npm package.
- Refactoring hooks to be platform-agnostic (item 22 in the migrated
`agent-infrastructure.md`) — track in the shared repo after extraction.
- **Make `.agents/` TypeScript files conform to Remnant's ESLint rules** — the
`additionalIgnores` bypass added in Phase 2 is a shortcut, not a solution.
`.agents/mcp/index.ts` and `.agents/frameworks/opencode/plugin.ts` use
`import.meta.url` directly (blocked by `no-restricted-syntax`) and have minor
unused-var patterns. Options: (a) replace `import.meta.url` usages with the
approved `findNearestPackageRoot` / `new URL('./sibling', import.meta.url)`
patterns where valid, (b) introduce a per-file exception comment for the
genuinely exceptional cases (e.g. portable hook resolution in a symlinked
global plugin), (c) move all `.agents/` TS into a proper subpackage with its
own `tsconfig.json` and relaxed rules. Remove `.agents/**` from
`additionalIgnores` once resolved.
---
## Rollback
Single revert: each phase is a separate commit. Phase 4 (strip Remnant) is the
only destructive one, and Phase 2's copies survive. Worst case:
`git revert <phase-4-commit>` restores Remnant, dotfiles copies stay.
---
## WIP: AGENTS.md context survival after compaction
> **Status**: problem noted; solution not designed. Break out into a separate
> project doc when ready to act on it.
### The problem
`AGENTS.md` loading is a session-start event. Once loaded, the content sits in
the context window as a regular document — it does not re-inject. After
compaction/summarization, the summary may preserve high-level framing but can
silently drop specific rules, enforcement hierarchy details, or lessons added
mid-session. The "Lost in the Middle" effect applies even before compaction:
guidance in the middle of a long context receives less model attention than
content at the tail (hooks inject at the tail specifically to counter this).
The `.agents/AGENTS.md` enforcement hierarchy already acknowledges this: _"Root
AGENTS.md sections: Context-start only. Subject to 'lost in the middle.'"_ The
user confirmed this happened: `.agents/AGENTS.md` was read before compaction
this session, but its content was not reliably carried through.
### What the research says (verified + falsified + re-corrected May 2026)
**VS Code Copilot** — correction was itself over-corrected. Final answer:
VS Code docs group `copilot-instructions.md`, `AGENTS.md`, and `CLAUDE.md` as
**"always-on instructions"** injected per-request — but this only applies to
files **at the workspace root**. The docs explicitly note: _"Support of
`AGENTS.md` files outside of the workspace root is currently turned off by
default."_
**This session is direct evidence.** `.agents/AGENTS.md` is a subdirectory file,
not the workspace-root AGENTS.md. It was `read_file`'d during this session and
entered the context as a regular document. After compaction the summary dropped
the specific content — enforcement hierarchy, forbidden patterns.
Post-compaction, the Copilot model then proposed `.instructions.md` files and
OpenCode `instructions:` config — exactly the approaches the forbidden patterns
section bans — because that guidance was no longer in the effective context.
Root-level `AGENTS.md` (workspace root) = always-on, survives compaction.\
Nested `AGENTS.md` in subdirectories = **not** always-on, read once on explicit
`read_file`, **lost on compaction**.\
**The problem is real for both tools for any AGENTS.md that isn't the workspace
root file.** This repo's enforcement lives in `.agents/AGENTS.md`, not the
workspace root — which means it is compaction-vulnerable in VS Code Copilot too.
**OpenCode** (opencode.ai/docs/rules + config):
- AGENTS.md loaded at session start via directory traversal + global
`~/.config/opencode/AGENTS.md`. No re-injection after compaction is
documented. The `compaction` agent is a hidden system agent; its behavior
after summarizing context is not specified. There is no `/docs/compaction`
page — no public spec exists for what happens to AGENTS.md content in the
compacted summary.
- Whether OpenCode re-injects even the root AGENTS.md after compaction is
unknown. Needs live testing.
**Summary of the asymmetry:**
| File | Copilot VS Code | OpenCode |
| --------------------------------- | ---------------------------- | ------------------------------------- |
| Root `AGENTS.md` (workspace root) | always-on per-request ✅ | session-start only ⚠️ |
| Nested `AGENTS.md` (subdirectory) | off by default, read-once ⚠️ | session-start traversal, read-once ⚠️ |
| Both after compaction | root survives; nested lost | unknown (undocumented) |
**Key implication for this repo:** the enforcement hierarchy and forbidden
patterns live in `.agents/AGENTS.md`, not the workspace-root AGENTS.md. That
makes them compaction-vulnerable in VS Code Copilot. None of the candidate
mitigations below have been evaluated yet — this problem is unsolved.
**Instruction files vs AGENTS.md (revised)**:
- VS Code Copilot: root AGENTS.md and root `copilot-instructions.md` are both
always-on per-request — equivalent. The ban on `.instructions.md` files is
about _path-scoping_ being non-portable, not injection frequency.
- OpenCode: `instructions:` config field is session-start — same vulnerability
as nested AGENTS.md in OpenCode.
### Open questions (narrowed after falsification)
- Does OpenCode re-inject root AGENTS.md after compaction, or is it also lost?
(Needs live testing — not documented.)
- Does OpenCode's `instructions:` config field content survive in the compacted
summary, or is it lost by the same mechanism?
- Does Claude Code (invoked directly, not via VS Code) have per-request
injection for root AGENTS.md like VS Code Copilot?
### Candidate mitigations (not yet chosen)
1. **Extend `pre-compact.sh`**: Before summarization fires, scan the current
context for `read_file` calls on `AGENTS.md` paths and emit their content
into the compaction context so the summary captures them explicitly.
2. **Session-start hook re-read**: If `session-start.sh` can detect it is
running post-compaction (e.g. a state file exists from a prior
`pre-compact.sh` run), re-inject the full root `AGENTS.md` content
immediately.
3. **PostToolUse periodic re-injection**: The current `post-tool-use.sh`
self-check fires every 15 tool calls. A similar counter could re-inject a
condensed version of critical AGENTS.md sections (enforcement hierarchy,
forbidden patterns) at the same cadence.
4. **Track and replay**: Maintain a list of AGENTS.md files read this session
(via PostToolUse file-path check). On `pre-compact.sh`, emit the paths as a
"re-read these after compaction" instruction so the post-compaction agent
gets them back.
5. **Stop relying solely on AGENTS.md for critical rules**: Move critical,
never-forget rules out of AGENTS.md into PreToolUse hard blocks or
PostToolUse reminders. Reserve AGENTS.md for architecture/rationale that is
worth losing under compaction. This is partly already the design intent —
this is a reminder to be strict about it.
---
## Post-Extraction Validation (May 23, 2026)
Validation pass over the extraction work. **No code changes made** — findings
and recommendations only.
### ✅ Verified working
**Dotfiles `~/dotfiles/.agents/` payload is complete:**
- `AGENTS.md` (289 lines) ✅
- `agents/``AGENTS.md`, `brainstorm.md`, `build.md`, `orchestrator.md`,
`research.md`
- `skills/research.md`
- `hooks/` — all six shared hooks (`pre-tool-use`, `post-tool-use`,
`session-start`, `stop`, `pre-compact`, `user-prompt-submit`) ✅
- `mcp/index.ts` + `package.json` + `package-lock.json`
- `frameworks/opencode/plugin.ts` (319 lines, with the Jinja-safe `chat.message`
injection) ✅
- `frameworks/github/hooks.json` (full six-hook registration) ✅
- `docs/` — all nine moved docs present (`agent-infrastructure.md`,
`ai-coding-best-practices.md`, `ai_architectures.md`,
`human-llm-interpretation-overlap.md`, `intent-interpretation-action-plan.md`,
`llm-intent-interpretation.md`, `text-communication-interpretation.md`,
`text-intent-interpretation-research.md`, `llama-server-cuda-wsl2.md`) ✅
- `install.sh` — generates Copilot global hooks JSON with absolute paths,
symlinks OpenCode plugin + agents + global `AGENTS.md`, merges OpenCode and VS
Code MCP entries, installs MCP server deps ✅
**Global wiring on this machine is live:**
- `~/.copilot/hooks/agent-support.json` — generated, absolute paths ✅
- `~/.config/opencode/AGENTS.md``~/dotfiles/.agents/AGENTS.md`
- `~/.config/opencode/plugins/agent-support.ts`
`~/dotfiles/.agents/frameworks/opencode/plugin.ts`
- `~/.config/opencode/agents/{brainstorm,build,orchestrator,research}.md`
symlinks ✅
- `~/.config/opencode/opencode.json` — has `all-agents` MCP entry ✅
- `~/.vscode-server/data/User/mcp.json` — has both `all-agents` and `exa`
- `~/.vscode-server/data/User/prompts/` — exists (empty) ✅
**Remnant overlay is correctly scoped:**
- `.agents/AGENTS.md` (Remnant-specific) ✅
- `.agents/README.md`
- `.agents/hooks/post-tool-use-remnant.sh` (BFF only) ✅
- `.agents/frameworks/github/{AGENTS.md, hooks.json}` — project Copilot hook
registration ✅
- `.agents/frameworks/opencode/{AGENTS.md, hooks.ts}` — project OpenCode plugin
- `.github/hooks/hooks.json``../../.agents/frameworks/github/hooks.json`
- `.opencode/plugins/hooks.ts``../../.agents/frameworks/opencode/hooks.ts`
- `.opencode/AGENTS.md` warning file ✅
### ⚠️ Gaps and bugs in dotfiles (pre-push)
These should be fixed before squashing/pushing the dotfiles commits.
1. **`~/dotfiles/.agents/AGENTS.md` references stale paths from the
pre-extraction layout.** Three places reference `.agents/github/` and
`.agents/opencode/` but the canonical paths are now
`.agents/frameworks/github/` and `.agents/frameworks/opencode/`:
- "The Copilot harness (`.agents/github/hooks.json`) and OpenCode plugin
(`.agents/opencode/plugin.ts`) both delegate…" (Hook Files section)
- "`.agents/opencode/plugin.ts` — OpenCode plugin harness (canonical)"
(Tool-Specific Entry Points section)
- "`.agents/github/hooks.json` — Copilot harness config (canonical)" (same
section)
- Also: the surrounding sentences claim symlinks point from
`.github/hooks/agent-support.json` and `.opencode/plugins/agent-support.ts`
"those directories are gitignored." In dotfiles this is wrong on two
counts: (a) global wiring uses `~/.copilot/hooks/agent-support.json` and
`~/.config/opencode/plugins/agent-support.ts`, (b) at Remnant the project
symlink files are named `hooks.json` and `hooks.ts`, not `agent-support.*`.
The doc was written for the pre-split layout and never updated.
2. **`~/dotfiles/.agents/AGENTS.md` links into `../docs/research/...`
Remnant-relative paths that don't resolve in dotfiles.** Two link targets:
- `[docs/research/intent-interpretation-action-plan.md](../docs/research/intent-interpretation-action-plan.md)`
- `[docs/research/ai-coding-best-practices.md](../docs/research/ai-coding-best-practices.md)`
Should be `./docs/intent-interpretation-action-plan.md` and
`./docs/ai-coding-best-practices.md` (the docs moved into `.agents/docs/`,
not `docs/research/`).
3. **No "Research Discipline" section** in `~/dotfiles/.agents/AGENTS.md`. Plan
Phase 2 step 1 specifically called for adding one (replacing the Copilot-only
memory at `~/memories/research-discipline.md`). The Copilot memory still
exists as a stopgap because the dotfiles AGENTS.md doesn't carry the
equivalent guidance.
4. **`frameworks/github/AGENTS.md` and `frameworks/opencode/AGENTS.md` are
missing from dotfiles.** Remnant added rich, generic API-facts AGENTS.md
files for each framework dir (62ee78c) — the content is not Remnant-specific
(verified VS Code hooks output formats, OpenCode plugin API facts, Jinja
constraint, overconfidence warnings). These belong in dotfiles alongside the
framework configs; right now an agent editing the global
`frameworks/opencode/plugin.ts` won't see them.
5. **`install.sh` location.** Currently `~/dotfiles/.agents/install.sh`.
Recommendation: move to `~/dotfiles/install.sh` so the dotfiles repo has a
discoverable bootstrap entry point (and to leave room for installing other
dotfiles content beyond `.agents/`). The script uses
`DOTFILES_AGENTS="$(cd "$(dirname "$0")" && pwd)"` — moving it requires
changing that one line to e.g.
`DOTFILES_AGENTS="$(cd "$(dirname "$0")" && pwd)/.agents"`. No other path
math in the script needs to change.
6. **`install.sh` does not symlink anything into `~/.copilot/` beyond
`hooks/`.** Copilot also supports user-level inline settings at
`~/.copilot/settings.json`. Not required, just noting it's a future extension
point if more global Copilot config becomes shareable.
7. **`install.sh` doesn't create the `~/.vscode-server/data/User/prompts/` dir
as part of the run on this machine — directory exists but is empty.**
Confirmed step 6 ran (`mkdir -p`). Working as intended; the dir is the
surface for VS Code prompt files but none have been authored yet. No action
needed unless we plan to ship `.prompt.md` files from dotfiles.
8. **`install.sh` has no uninstall counterpart.** Low-priority. Useful if we
start moving the script around and want clean state for testing.
9. **Exa MCP has an undocumented rate limit; agents fan out parallel
`mcp_exa_web_search_exa` calls and hit it.** Observed May 23, 2026: 8
parallel searches in one turn → all cancelled. Two complementary fixes, both
in dotfiles:
- **PostToolUse nudge** in `~/dotfiles/.agents/hooks/post-tool-use.sh`: after
any `mcp_exa_*` call, inject a reminder ("Exa rate-limits parallel calls —
issue web searches serially, max ~2 per turn") so the model learns the
pattern without a hard block.
- **`AGENTS.md` entry** under a new "External service quirks" section listing
per-service constraints (Exa rate limit, GitHub API limits when
`mcp_github_*` lands, etc.). Loaded at session start so the model has it
before issuing the first call.
- Optional PreToolUse soft-warn: count `mcp_exa_*` calls per turn via a
`/tmp/.exa-turn-count` file (reset on `user-prompt-submit`); warn (don't
deny) past N=2.
### 🧹 Commit-history cleanup recommendations
Sonnet committed in tiny increments. Both repos have a series of unpushed
"fix(install)/fix(plugin)/fix(hooks)" commits that should be squashed before
publishing.
**`~/dotfiles`** — 10 unpushed commits on `main` past `4a44460 (origin/main)`.
Suggested single squashed commit:
```
feat(.agents): shared agent infrastructure + install.sh
- Hooks, agents, skills, MCP server, OpenCode plugin, Copilot hook config
- install.sh wires global Copilot hooks (absolute paths), OpenCode plugin
+ agents + AGENTS.md (symlinks), MCP entries for OpenCode and VS Code
- See .agents/docs/agent-infrastructure.md for design rationale
```
Constituent commits to fold in:
`6b07e4c 690178d 88435d6 f4017ab 5c12257 f0d21e9 2949981 3738732 9544b4e 14c132a`.
Suggested workflow: `git reset --soft 4a44460 && git commit -m '…'` (or
interactive rebase with `s` on every commit after the first). Address items 14
above first so the squash captures clean state.
**`~/code/remnant`** — many unpushed commits past `0d0a3a8 (origin/main)`; the
agent-infra-related ones form a contiguous block from `2d58147` through
`78c8449`. Suggested squash boundary:
- Keep `2d58147` as the first commit of the block, or replace it with a new
"feat: extract shared agent infra to ~/dotfiles/.agents" message that covers
the full final state.
- Fold in:
`5a7d220 c41c142 daf53a3 8a61128 2b0ea1e e9f3529 9191a44 fc2a944 62ee78c dc3ec9c 78c8449`.
The non-agent-infra commits before `2d58147` (the older "chore: more agentic
coding updates …" block) are pre-extraction and can be left as-is or squashed
separately depending on taste.
### 📋 Pending work that's still extraction-scoped
- `MODELFILES.md` stub (Phase 4 item 3) — explicitly skipped; consider whether
the two `omnicoder*.modelfile` files in Remnant should be moved to
`~/dotfiles/.agents/modelfiles/` and dropped from Remnant entirely. They
aren't Remnant-specific.
- `.agents/` TypeScript ESLint conformance (Out-of-scope list, item 4) — still
tracked; no movement.
- Item 22 in `agent-infrastructure.md` (platform-agnostic hook scripts) —
unchanged.
- Live-session smoke tests from Phase 5 (slash commands, BFF reminder injection,
VS Code MCP `all-agents` connect) — still marked ⏳. Should be retired or
confirmed after the next session restart.
### 🚀 Starting a new project on the extracted infra (MFE)
Moved to [dotfiles-agent-infra-roadmap.md](./dotfiles-agent-infra-roadmap.md).
The short version:
- Inheriting the global infra is automatic once `install.sh` has run on the
machine — no per-project setup beyond an `AGENTS.md` and (optionally) an
overlay hook.
- The blocker for full MFE adoption is that `stop.sh` hardcodes Remnant's task
layout (`docs/TODO.md`, `docs/projects/COMPLETED.md`, `docs/explorations/`).
This is part of the
[hook audit](#-full-hook-script-remnant-isms-audit-may-23-2026--addendum)
below and is addressed by the `project.config.js` extraction tracked in the
roadmap.
### 🆕 Future task — unify kanban/task doc structure across projects
Moved to
[dotfiles-agent-infra-roadmap.md → Kanban / task-doc unification](./dotfiles-agent-infra-roadmap.md#4-kanban--task-doc-unification).
Driver recorded here for context: `stop.sh` hardcodes Remnant's task layout, and
the path forward (after `project.config.js` lands) is for the hook to support
multiple shapes driven by config rather than a single hardcoded one.
### 🔎 Full hook-script Remnant-isms audit (May 23, 2026 — addendum)
Re-read every hook in `~/dotfiles/.agents/hooks/` line-by-line after the
`stop.sh` miss. Findings below — anything not listed is reviewed and verified
generic.
**`pre-tool-use.sh` — multiple hardcodes that bite non-Remnant projects:**
1. **Policy 5 — hardcoded ports 3000/3001** for dev-server detection:
```bash
ss -tlnp 2>/dev/null | grep -qE ':300[01]\s'
```
These are Remnant's `apps/api` (3000) and `apps/client` Vite HMR (3001). MFE
uses different ports (likely 5173 for Vite, plus app-specific). Fix: read
ports from a per-project config (`.agents/project.json` with a `devPorts`
array) or from `package.json` script scraping, default to common ports if
unset.
2. **Policy 8 — error message references `npm run build:core`** (Remnant has a
`packages/core` package that owns the codegen step; other projects don't):
> "Edit the source files (controller.ts, routes.ts, business-logic.ts)
> instead and run 'npm run build:core' to regenerate." The `.generated.ts`
> block itself is generic, but the message and example filenames are
> Remnant-specific. Fix: parameterize the rebuild command via project config,
> or genericize the message ("run the generator script for the affected
> package").
3. **Policies 9 & 10 — assume wireit is the build tool.** Both error messages
reference wireit cache/fingerprint behavior and tell the agent to edit
`wireit` config in `package.json`. Remnant uses wireit; MFE may not. The
blocks themselves (`rm .wireit`, `-- --force` with npm run) are still useful
— they fire on the literal string `.wireit` and the `--force` flag — but the
messages will be confusing for non-wireit projects. Fix: detect wireit
presence (`grep -q '"wireit"' package.json`) and skip the block when not
present, or rewrite messages to be tool-agnostic.
4. **Policy 11 — assumes npm workspaces** (`npm run format -- <file>`
propagation issue). True for any npm-workspaces monorepo; false for
single-package projects (where the arg works fine). Low-impact: even in a
single-package repo, the block just prevents a working command. Fix: gate on
presence of `workspaces` field in root `package.json`.
5. **Policy 14 — hardcoded `apps/*/package.json` and `packages/*/package.json`
paths.** This is the exact Remnant monorepo layout (`apps/api`,
`apps/client`, `packages/core`, etc.). MFE may use `apps/` + `packages/` too
but the underlying concern — that reading workspace package.json files
auto-injects nested AGENTS.md and exhausts context — applies to any monorepo
with nested AGENTS.md files, regardless of directory names. Also: the message
hardcodes **"32K context window"**, which is a specific assumption about the
local model (qwen3-coder-30b on llama-server). Cloud models have 200K+. Fix:
discover workspace dirs from `package.json` `workspaces` field; drop the
model-size number or make it configurable.
**`post-tool-use.sh` — mostly generic, one cosmetic issue:**
6. **`vscode_renameSymbol` reminder uses Remnant-flavored example strings:**
`deleteX: archiveX`, `openDialog('delete-item')`,
`AppDialog handle='delete-item'`, `deleteSuccess/Loading/Error`. These are
illustrative patterns from Remnant's Solid.js store + AppDialog component.
They're not incorrect for other projects, just visibly Remnant-coded.
Low-priority: either genericize ("e.g. aliased store keys like
`oldName: newName` in a returned object") or leave as concrete examples —
they still teach the right habit. The header comment correctly notes that
project-specific reminders "belong in a sibling project-local hook file," but
this one snuck in.
7. **`opencode agent list` shell-out assumes OpenCode CLI is installed.** Fires
only when editing agent definitions, so the blast radius is small (a Copilot
user who never edits agents won't see it). The fallback ("opencode agent list
failed") is graceful. Acceptable as-is, but worth noting: Copilot-only
environments will hit the failure path every time. Could gate on
`command -v opencode`.
**`pre-compact.sh`:**
8. **`docs/explorations/` hardcoded** (same path issue as `stop.sh`). Already
covered by the kanban-unification task above — fold into that work.
**`session-start.sh`:**
9. **`docs/explorations/` hardcoded** (same — fold into kanban-unification).
10. **`.session/dead-ends.md` and `.session/pre-compact-state.md` paths** appear
in both `session-start.sh`, `pre-compact.sh`, and `stop.sh`. This is a
convention `.agents/AGENTS.md` should formally document so it's not just
"magic paths the hooks know about." Not Remnant-specific (no Remnant code
references these), but undocumented. Fix: add a "Session conventions"
section to `~/dotfiles/.agents/AGENTS.md` listing these paths.
11. **"Ordered markdown lists are auto-renumbered by the editor on save"
reminder** — this is VS Code + Prettier behavior, generic enough to keep,
but worth flagging that it assumes the project uses Prettier with that
setting (Remnant does; others may not).
**`stop.sh` (already covered, restated for completeness):**
12. `docs/TODO.md`, `docs/projects/COMPLETED.md`, `docs/explorations/` — kanban
task.
13. **Ports 3000/3001** dev-server check (same as Policy 5 — fold fix together).
14. **`npm run build:strict`** referenced as the recommended verification
command. This is a Remnant-specific custom script name. Other projects use
`npm run build` or `npm run check` or `npm run ci`. Fix: same parameterize
approach (read from `.agents/project.json`).
**`user-prompt-submit.sh`:** clean. No Remnant-isms found.
**Suggested fix pattern (rather than a string of patches):**
Introduce a per-project config file at `<repo>/.agents/project.config.js` (or
`.ts`) so each hook can read its values instead of hardcoding them. Full design
— file shape, loader notes, dropped fields (`modelContextWindow`),
recommendation — is in
[dotfiles-agent-infra-roadmap.md → `project.config.js` extraction](./dotfiles-agent-infra-roadmap.md#1-projectconfigjs-extraction).
### 🆕 Future task — per-session tmp file capture
Moved to
[dotfiles-agent-infra-roadmap.md → Per-session tmp file capture](./dotfiles-agent-infra-roadmap.md#2-per-session-tmp-file-capture).
Driver recorded here for the validation trail: `user-prompt-submit.sh` writes to
a globally-named `/tmp/.last-user-prompt.txt`, so concurrent sessions clobber
one another's capture. The same issue affects
`/tmp/.opencode-tool-count-${REPO_ID}` in `post-tool-use.sh` (keyed by repo, not
session — concurrent sessions in the same repo share the self-check counter).

View File

@ -0,0 +1,87 @@
# Failure Modes — Qwen3.6 & OpenCode
Compiled 2026-05-27. Sources linked inline.
---
## Qwen3.6 Model-Specific Quant & Routing Issues
### IQ3 Quant — Tool Call JSON Failure
| | |
|---|---|
| **Name** | IQ3 quant tool-call JSON breakage |
| **Description** | Qwen3.6 35B-A3B at IQ3_XXS quant fails function-call JSON generation entirely. BatiAI's Ollama benchmark shows ❌ for IQ3, ✅ for IQ4 and Q6. IQ3 is memory-bandwidth bound (~45.9 t/s on M4 Max) and loses the precision needed for structured JSON output in tool calls. |
| **Mitigation** | Use IQ4_XS or Q6_K for any workload with tool calling. IQ3 is acceptable only for text-only chat. IQ4 and Q6 show equivalent throughput. |
| **Sources** | [batiai/qwen3.6-35b:iq3 (Ollama)](https://ollama.com/batiai/qwen3.6-35b:iq3) |
### MoE Expert Loop — Q4_K_M & Below Routing Lock
| | |
|---|---|
| **Name** | Q4_K_M MoE expert routing collapse |
| **Description** | Qwen3.6's MoE architecture (256 routed experts, top-8 selection) degrades at Q4_K_M and below: the router locks into a subset of specialists (e.g., code-completion specialist for math queries, math specialist for syntax tasks). Expert activation entropy collapses. This is a structural MoE failure — dense Qwen2.5-72B does not exhibit this. Perplexity delta of +0.34 at Q4_K_M looks acceptable on paper but produces hallucinated method names, wrong parameter counts, and broken imports. |
| **Mitigation** | Default to Q6_K (1.6-point SWE-bench loss vs Q8_0, saves 2.1 GB VRAM). For 24 GB cards, Q4_K_M is acceptable only for RAG ingestion or documentation chat — not active code generation or function calling. Q8_0 wins SWE-bench Lite at 28.7%. BFCL v2 function-calling accuracy: 94.2% (Q8_0) → 89.7% (Q4_K_M). |
| **Sources** | [Qwen3.6 quant benchmarks: Q4 vs Q8 for MoE (CraftRigs)](https://craftrigs.com/comparisons/qwen3-6-quantization-benchmarks-q4-vs-q8/); [Qwen3.6-27B Setup Guide: 24GB GPU (CraftRigs)](https://craftrigs.com/guides/qwen3-6-27b-setup-guide-24gb-gpu/) |
### Official Chat Template — Non-Standard XML Parameter Format
| | |
|---|---|
| **Name** | Qwen3.6 official `chat_template.jinja` XML vs JSON incompatibility |
| **Description** | Qwen3.6's shipped `chat_template.jinja` instructs the model to generate function calls using a proprietary XML-like syntax (`<function=...><parameter=...>`) instead of OpenAI-compatible JSON. Missing closing tags cause parsing failures in standard inference frameworks (vLLM, HuggingFace transformers, llama-cpp-python, OpenAI-compatible API layers). Error: `Failed to parse input at pos XXXX: <function=read> <parameter=filePath> ...`. |
| **Mitigation** | Patch `chat_template.jinja` to use OpenAI-compatible JSON schema (`{"name": "function_name", "arguments": "{\"param1\": \"value1\"}"}`). |
| **Sources** | [abysslover/qwen36_tool_calling_failure (GitHub)](https://github.com/abysslover/qwen36_tool_calling_failure) |
### Long-Text Stability — Context Accumulation Amplifies Routing Drift
| | |
|---|---|
| **Name** | Q4_K_M multi-turn routing drift |
| **Description** | General chat tolerates +0.50 perplexity delta before quality drop is noticed. Multi-turn technical discussion (>3 turns with context accumulation), chain-of-thought reasoning, and structured output cross the threshold where expert loop errors become detectable within the first 10 responses. Context accumulation amplifies routing drift. |
| **Mitigation** | Q4_K_M acceptable for single-turn or short-context use. For long contexts or multi-turn structured output, use Q6_K or Q8_0. |
| **Sources** | [Qwen3.6 quant benchmarks: Q4 vs Q8 for MoE (CraftRigs)](https://craftrigs.com/comparisons/qwen3-6-quantization-benchmarks-q4-vs-q8/) |
---
## OpenCode Plugin / Hook-Specific Failures
### session.start — Resume / --continue Does Not Fire Plugin Context
| | |
|---|---|
| **Name** | session.start hook failure on resume |
| **Description** | `session.start` hook fires reliably for new sessions (`startup` trigger) but fails on resume (`--continue`/`--session`) with "No context found for instance" error. `Plugin.triggerSessionStart` is called during route navigation before the plugin context is fully initialized. Pending hook context is consumed lazily on the next model turn, so resume-triggered context can become stale if a session is resumed but not prompted soon after. |
| **Mitigation** | Be aware that `session.start` with `resume` trigger has a bootstrap timing edge case. Pending context becomes stale if the resumed session sits idle. PR #15224 documents the issue and a partial fix. |
| **Sources** | [OpenCode PR #15224 — feat(plugin): add session.start hook](https://github.com/anomalyco/opencode/pull/15224); [OpenCode Issue #5409 — SessionStart hook for session lifecycle events](https://github.com/sst/opencode/issues/5409) |
### PreToolUse — Ask Response Permanently Disables Bypass Permission
| | |
|---|---|
| **Name** | PreToolUse permission bypass lock |
| **Description** | When `PreToolUse` returns `permissionDecision: "ask"`, it permanently disables bypass permission mode until session restart. This is a state machine vulnerability — the permission bypass mode cannot recover from an `ask` response without a full session reset. |
| **Mitigation** | If using permission bypass mode, avoid `PreToolUse` hooks that return `ask`. Verify hook behavior after any policy change. |
| **Sources** | Claude Code #37420 (referenced in AGENTS.md) |
### session.created — Event Fails Reliably for Plugins
| | |
|---|---|
| **Name** | session.created event reliability for plugins |
| **Description** | `session.created` event fails to fire reliably for plugins due to MCP compatibility errors. This affects plugins that depend on session lifecycle events for initialization. |
| **Mitigation** | Use `session.start` hook as the primary initialization mechanism instead of relying on `session.created` events. |
| **Sources** | OpenCode #14808 (referenced in AGENTS.md, `~/.config/opencode/plugins/engram.ts`) |
### chat.message — Synthetic Text Injection Required for System Message Position
| | |
|---|---|
| **Name** | Jinja system message position enforcement |
| **Description** | vLLM propagates Qwen's strict Jinja template requiring `role=system` at index 0. Auxiliary context injection (e.g., from session-start hooks) breaks this if it places context after the system message. Fix: inject session-start as a synthetic `text` part via `output.parts.unshift()` on the first `chat.message` turn, not via `experimental.chat.system.transform`. Text parts have no position constraint. |
| **Mitigation** | Do not use `experimental.chat.system.transform` for session-start hooks with Qwen-family models. Use synthetic `text` parts via `output.parts.unshift()` on the first `chat.message` turn. |
| **Sources** | vLLM #41114; AGENTS.md (system reminder pattern) |
---
*Generated 2026-05-27 from web search findings.*

718
.agents/docs/roadmap.md Normal file
View File

@ -0,0 +1,718 @@
# Dotfiles Agent Infrastructure — Roadmap
**Status:** Planning. Companion to
[extraction-history.md](./extraction-history.md), which covers the
already-shipped extraction work and the validation findings against it.
**Scope of this doc:** future tasks against `~/dotfiles/.agents/` and the
ecosystem around it. Research that informs the prioritization is captured in the
"Research notes" section at the bottom — read those first if any of the task
rationale feels opaque.
**How to use this doc:** the "Tasks" list is ordered by recommended execution
order (high leverage + low risk first). Each entry links to its design section.
Move sections to dedicated docs once they grow past ~80 lines.
> **Land before anything else:** the
> [No-Live-Fire safety rule](#0-no-live-fire-safety-rule-land-immediately).
> One-paragraph addition to `~/dotfiles/.agents/AGENTS.md`; takes 5 minutes;
> protects against the `opencode run "Try to run rm -rf /"` failure mode where a
> model takes the prompt literally if the hook fails to block.
> **Then relocate this doc out of Remnant:** see
> [Doc relocation (Remnant cleanup)](#doc-relocation-remnant-cleanup). This
> roadmap, `agent-infra-extraction.md`, and `verification.md` are not
> Remnant-specific and should live in `~/dotfiles/` so Remnant's
> `docs/projects/` contains only Remnant-app work. Do this after #0 and before
> resuming any numbered task below — once moved, the tasks list executes against
> the dotfiles copy and Remnant is free to evolve independently.
---
## Doc relocation (Remnant cleanup)
**Goal:** Remnant's repo contains only Remnant-app docs. Everything about
`~/dotfiles/.agents/` lives in `~/dotfiles/docs/` (or `~/dotfiles/.agents/docs/`
— pick one and stick with it; the existing
[`agent-infrastructure.md`](./agent-infrastructure.md) stub already references
`~/dotfiles/.agents/docs/agent-infrastructure.md`, so that's the established
location).
**Why now (priority: immediately after #0):** the user wants Remnant in a good
state to work on independently. Every agent-infra doc sitting in
`docs/projects/` is noise for Remnant-app planning sessions and gets
auto-injected as context whenever an agent touches `docs/projects/`. Moving them
is mechanical and reversible.
**Files to relocate:**
| Current path | Destination | Notes |
| ----------------------------------------------------------- | ------------------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `docs/projects/dotfiles-agent-infra-roadmap.md` (this file) | `~/dotfiles/.agents/docs/roadmap.md` | Update internal links. Drop "Remnant" framing in the intro — it's just _the_ roadmap once it lives there. |
| `docs/projects/agent-infra-extraction.md` | `~/dotfiles/.agents/docs/extraction-history.md` | Validation log for the already-shipped extraction. Keep as historical record; not active planning. |
| `verification.md` (repo root) | `~/dotfiles/.agents/tests/manual-verification.md` | Already specified as part of [#3](#3-hook--agent-config-verification-framework); do the move now rather than waiting for the test harness. |
| `docs/projects/agent-infrastructure.md` | **Stay** (already trimmed to Remnant-specific overlay) | Already correctly scoped: it documents Remnant's overlay hook + Remnant-specific integration test cases. Leave in place; it points to the canonical doc in dotfiles. |
| Agent-infra entries inside `docs/projects/COMPLETED.md` | Split out to `~/dotfiles/.agents/docs/completed.md` | Audit first — if there's nothing agent-infra-specific there, skip. |
**Steps:**
1. `mkdir -p ~/dotfiles/.agents/docs ~/dotfiles/.agents/tests`
2. `git mv` each file into `~/dotfiles/` (cross-repo: use `git mv` inside
Remnant to stage a delete, then a fresh add in dotfiles — there's no
meaningful history to preserve across repos for these short-lived docs; if
history matters for `agent-infra-extraction.md`, use `git format-patch`
- `git am` instead).
3. Rewrite intra-doc links: this file's references to
`./agent-infra-extraction.md` become `./extraction-history.md`; references to
`verification.md` become `../tests/manual-verification.md`.
4. Find inbound links from anywhere in Remnant
(`grep -rn "dotfiles-agent-infra-roadmap\|agent-infra-extraction\|verification.md" ~/code/remnant`)
and either delete them or repoint at the dotfiles copies via absolute paths
(e.g., `~/dotfiles/.agents/docs/roadmap.md`).
5. Audit `docs/projects/COMPLETED.md` for agent-infra rows; split if any exist.
6. Update `AGENTS.md` files in Remnant if any reference the moved docs.
7. Commit Remnant deletion and dotfiles addition together (or back-to-back
commits with cross-references in the messages).
**Acceptance:** `ls ~/code/remnant/docs/projects/ | grep -iE 'agent|dotfiles'`
returns only `agent-infrastructure.md`; `verification.md` is gone from the
Remnant root; the roadmap (this doc) opens cleanly from its new path with
working links.
**Risk:** if any Remnant `AGENTS.md` instructions or
[`docs/projects/COMPLETED.md`](./COMPLETED.md) row links into these docs and the
link breaks silently, agents will follow a dead reference. Step 4 mitigates.
---
## Tasks (recommended order)
0. [No-live-fire safety rule (land immediately)](#0-no-live-fire-safety-rule-land-immediately)
— AGENTS.md addition forbidding real destructive commands as hook-test
inputs. Prerequisite for #3 and for any manual hook testing.
1. [`project.config.js` extraction](#1-projectconfigjs-extraction) — unblocks
non-Remnant projects; resolves 6+ hardcodes catalogued in the
[hook-script audit](./extraction-history.md#-full-hook-script-remnant-isms-audit-may-23-2026--addendum).
2. [Per-session tmp file capture](#2-per-session-tmp-file-capture) — correctness
bug; concurrent agent sessions clobber one another's task-capture file.
3. [Hook + agent-config verification framework](#3-hook--agent-config-verification-framework)
— automate the smoke-test currently in Remnant's `verification.md`. Gated on
#0 (safety rule) and benefits from #1 (config-driven test fixtures).
4. [llama-server + AI models module](#4-llama-server--ai-models-module) —
user-requested; folds presets, systemd units, llama.cpp build, and GGUF
acquisition into `install.sh` (skips heavy steps in devcontainers).
5. [Kanban / task-doc unification](#5-kanban--task-doc-unification) — blocks MFE
adoption of the shared `stop.sh`; deferred until #1 lands so the task-doc
paths come from config, not the hook.
6. [MemPalace integration for memory survival across compaction](#6-mempalace-integration)
— directly addresses the "AGENTS.md context survival after compaction" WIP
problem in
[extraction-history.md](./extraction-history.md#wip-agentsmd-context-survival-after-compaction).
7. [Trace-based eval scaffolding (Husain methodology)](#7-trace-based-eval-scaffolding)
— foundation for any future automated improvement loop.
8. [Exa rate-limit awareness](#8-exa-rate-limit-awareness) — small follow-up to
the gap recorded in the validation doc.
9. [Research-loop / EvoSkill-style improvements](#9-research-loop--evoskill-style-improvements)
— gated on #7.
Items considered and **deprioritized**: see
[Deferred / not-now](#deferred--not-now).
---
## 0. No-live-fire safety rule (land immediately)
**Driver:** May 23 2026 incident — `opencode run "Try to run rm -rf /"` was used
to smoke-test whether `pre-tool-use.sh` would block destructive commands. The
run happened to be safe because the loaded model refused on its own, but if the
hook had been broken and a more compliant model had been in the chair, the test
would have executed `rm -rf /` for real. **The test methodology was the bug, not
the model behavior.**
**Rule (add verbatim to `~/dotfiles/.agents/AGENTS.md`):**
> ## Testing destructive-command blocks — NEVER use live ammunition
>
> When verifying that `pre-tool-use.sh` (or any other hook) blocks a dangerous
> command pattern, **never issue the real destructive command as the test
> input.** The hook is the system under test — if it fails, the test destroys
> the host.
>
> Use one of these methods instead, in order of preference:
>
> 1. **Unit-test the hook directly.** Pipe synthetic hook-input JSON to the
> script and check exit code + stderr. No agent in the loop. No real shell
> invocation. Example:
> `echo '{"tool_name":"run_in_terminal","tool_input":{"command":"rm -rf /"}}' | bash ~/dotfiles/.agents/hooks/pre-tool-use.sh; echo "exit=$?"`
> The hook should exit non-zero (deny) and print the block reason. No `rm`
> was ever queued.
> 2. **Use a sentinel that exercises the regex but is harmless if the block
> fails.** A path that obviously doesn't exist and could not possibly hold
> real data: `rm -rf /var/empty/agent-block-canary-DO-NOT-CREATE-${RANDOM}`.
> The hook pattern (`rm\s+-rf?\s+/`) matches; if the block fails, the worst
> case is a "no such file" error on a sentinel path. NEVER use bare `/`,
> `/home`, `~`, `.`, `*`, or any real path — those have to fail-closed even
> if the hook is broken.
> 3. **Never** issue the literal destructive command (`rm -rf /`,
> `dd if=/dev/zero of=/dev/sda`, `:(){ :|:& };:`, `chmod -R 000 /`,
> `git push --force` to a published branch, etc.) as an agent prompt. Not
> even with `--dry-run`. Not even "just to see." Not even if you're sure the
> hook works. The hook MIGHT not work. That's why you're testing it.
>
> This rule applies to humans writing test prompts AND to agents asked to verify
> hook behavior. If you (the agent) are asked to verify a block, refuse any plan
> that involves issuing the real destructive command and propose a unit-test or
> sentinel approach instead.
**Why it lives in AGENTS.md, not just a hook:** the failure mode is at the
human/agent decision layer ("what command should I issue to test this?"), not at
the execution layer. A hook can't catch a model that's been told to bypass the
hook. The narrative-epistemology framing from the research notes applies — this
rule shapes the **modal space** of test prompts so "issue the real command"
doesn't appear in the action set.
**Acceptance:** the rule lives in `~/dotfiles/.agents/AGENTS.md` under a
top-level section (so it survives compaction and AGENTS.md re-injection). Next
time anyone asks the agent to test a block, the agent proposes method 1 or 2 and
refuses method 3.
---
## 1. `project.config.js` extraction
Already designed in
[extraction-history.md → Suggested fix pattern](./extraction-history.md#-full-hook-script-remnant-isms-audit-may-23-2026--addendum).
This task tracks the implementation.
**Shape of work:**
- Add a tiny loader (`~/dotfiles/.agents/hooks/_lib/project-config.sh`) sourced
by every hook that needs configured values. Loads
`<repo>/.agents/project.config.{js,ts,json}` via `node` /`tsx` /direct JSON
read in that order; falls back to a defaults object matching Remnant today.
- Replace hardcoded values in `pre-tool-use.sh` Policies 5, 8, 9, 10, 11, 14 and
in `stop.sh` (ports, verify command, codegen rules, task-doc paths) per the
audit.
- Drop the `modelContextWindow` notion entirely; genericize the Policy 14 "32K"
wording to "may exhaust the model's context window."
- Ship a Remnant `project.config.js` in the Remnant repo as the first consumer;
ship an MFE `project.config.js` later as part of the MFE bootstrap.
**Acceptance:** running every hook from a project _without_ a config file
produces the same behavior as today (zero-regression for Remnant). Running from
a project _with_ a config file consults it.
---
## 2. Per-session tmp file capture
Already designed in
[extraction-history.md → Future task — per-session tmp file capture](./extraction-history.md#-future-task--per-session-tmp-file-capture).
Small, independent, can land before or after #1.
**Bonus catch from that section:** `/tmp/.opencode-tool-count-${REPO_ID}` in
`post-tool-use.sh` is keyed by repo only — two concurrent sessions in the same
repo share the self-check counter. Fix the same way.
---
## 3. Hook + agent-config verification framework
**Driver:** [manual-verification.md](../tests/manual-verification.md) is a manual
4-level smoke-test for the renamed `build` and `orchestrator` agents. It is (a)
sitting in the wrong repo — the agents it tests now live in
`~/dotfiles/.agents/agents/`, (b) outdated relative to the current agent config,
and (c) the kind of thing humans skip because running it takes 10+ minutes of
manual prompting. The user explicitly wants this to run **automatically after
updates**, and just-as-explicitly wants it to never resemble
`opencode run "Try to run rm -rf /"` (see
[#0](#0-no-live-fire-safety-rule-land-immediately)).
### Test layers
Three layers, from cheapest/safest to most expensive/least safe. Run the lower
layers in CI on every commit to `~/dotfiles/.agents/`; run the upper layer
manually before merging risky changes.
**Layer 1 — Static checks (no execution, no agent):**
- `bash -n` on every `*.sh` hook (syntax-only parse).
- `shellcheck` on every hook (lints + common-bug detection).
- Frontmatter validation on every `agents/*.md` and `skills/*.md`: required
fields present, referenced tools exist in the framework's tool registry.
- `node --check` or `tsx --check` on every JS/TS plugin
(`frameworks/opencode/*.ts`, `mcp/all-agents/src/*.ts`).
- JSON schema validation on `frameworks/github/hooks.json` and any other
framework configs.
- Glob check: every file referenced by a hook (e.g. `_lib/project-config.sh`
once #1 lands) actually exists.
**Layer 2 — Hook unit tests (synthetic input, no agent, no shell exec):**
For each hook, a fixture file `tests/hooks/<hook>.test.sh` that pipes
hand-written JSON inputs to the hook and asserts the exit code + stderr. No real
command is ever invoked because the hook returns deny/allow before anything
runs.
Fixtures should cover, at minimum:
- **Allow path:** a benign tool call (e.g. `read_file` of an in-repo path) —
hook exits 0, no stderr noise.
- **Block paths (one per policy):** synthetic JSON that exercises each block in
`pre-tool-use.sh` (Policies 114). Assert exit code 2 (deny) and message
contains the policy ID. **All block fixtures use sentinel paths per
[#0](#0-no-live-fire-safety-rule-land-immediately)** — no bare `/`, no real
destructive commands.
- **Reminder injection:** `post-tool-use.sh` fed a generated-file edit — assert
stdout contains the `.generated.ts` warning.
- **Session boundaries:** `session-start.sh`, `stop.sh`, `pre-compact.sh` with
realistic JSON inputs — assert they produce the expected stdout blocks.
A small runner (`tests/run-hook-tests.sh`) discovers `*.test.sh` files, executes
them, and reports pass/fail. CI calls this on every PR. Local dev calls it from
a `~/dotfiles/.agents/install.sh --verify` flag.
**Layer 3 — Live integration tests (real agent, sentinel inputs, gated):**
The layers above don't catch "the framework didn't actually wire the hook in"
failures — the hook can be perfect in isolation but never get called. Layer 3
catches that by running a real OpenCode/Copilot session against sentinel
prompts:
- Per [#0](#0-no-live-fire-safety-rule-land-immediately), prompts use sentinel
paths and the **agent is asked to attempt** the sentinel command, not the real
one. Example prompt: _"Run `rm -rf /var/empty/canary-${RANDOM}` and report
what happened."_ Pass criterion: the hook block message appears in the agent's
response and the tool was never executed.
- Optional: drive via `opencode run --agent <name>` so the session is scripted
and non-interactive. Gate this behind an explicit `--enable-live-tests` flag
in the runner; default off in CI.
- Layer 3 also folds in Remnant's `verification.md` Levels 14 (read-only, small
write, scope escalation refusal, orchestrator planning gate) once the agents
are stable enough to script against.
### Disposition of `verification.md`
- It's not Remnant's anymore (tests global infra). Move to
`~/dotfiles/.agents/tests/manual-verification.md` as the human-runnable
fallback until Layer 3 automation exists.
- Drop from Remnant root in the same commit that creates
`~/dotfiles/.agents/tests/`. Until then it can stay where it is; it's not
causing harm, just misfiled.
- Once Layers 1 and 2 are running in CI, the manual doc shrinks to just Layer 3
scenarios. Once Layer 3 is automated, retire the doc entirely.
### CI integration
- Add a GitHub Action (or Gitea CI step) in `~/dotfiles/` that runs Layers 1 + 2
on every push.
- Locally, `install.sh --verify` runs the same checks before applying any
changes — so an interactive `install.sh` invocation can refuse to symlink in a
broken hook.
- A `post-merge` git hook in `~/dotfiles/` runs Layers 1 + 2 after `git pull` so
a user who syncs a broken commit gets told immediately rather than discovering
it at the next agent invocation.
### Open questions
- **What's the canonical sentinel path?** Proposal: `/var/empty/` (exists,
read-only, owned by root on most distros, used by sshd's PrivilegeSeparation —
so a rogue `rm -rf` would fail with permission denied even before hitting
nonexistent-file errors). Append a random + canary token.
- **Where do hook fixtures live in the global infra?** Likely
`~/dotfiles/.agents/tests/hooks/*.test.sh` and
`~/dotfiles/.agents/tests/fixtures/*.json`. Symmetric with `hooks/` itself.
- **Should Layer 3 be a single integration test per framework, or per hook?**
Per framework is enough — the hook unit tests already cover per-hook behavior.
Layer 3 only needs to prove "the framework calls the hook at all."
### Acceptance
- `~/dotfiles/.agents/tests/run.sh` exists and exits 0 on a clean checkout.
- A deliberately-broken hook (e.g. syntax error introduced) causes the runner to
fail loudly with a useful error.
- A pull that breaks a hook is caught by the `post-merge` hook before any agent
sees it.
- No test fixture in the repo references a real destructive command or real path
— grep `tests/` for `rm -rf /` (without sentinel suffix), `dd if=`, `:(){`,
`chmod -R 000 /` etc. as a CI lint.
---
## 4. llama-server + AI models module
**Goal:** `~/dotfiles/install.sh` (or a sub-command of it) sets up llama.cpp
- CUDA, registers the systemd units, places `presets.ini` from dotfiles, and on
a non-devcontainer machine downloads the configured set of GGUF models. A
second script (`scripts/models.sh`) handles add/remove/list of models
post-install.
### Target layout
```
~/dotfiles/.agents/models/
├── presets.ini ← canonical, version-controlled
├── models.list ← URLs + filenames + checksums (committed)
├── README.md ← what each preset is for
└── gguf/ ← gitignored, populated by install.sh
└── *.gguf
~/dotfiles/.agents/llama-server/
├── start.sh ← canonical (replaces /opt/llama-server/start.sh)
├── llama-server.service ← systemd unit (User=current user, not ollama)
├── llama-server-presets.path ← path watcher
├── llama-server-presets.service ← oneshot restart
└── build-llama.sh ← clones + builds llama.cpp w/ CUDA
~/dotfiles/.agents/scripts/
├── models.sh ← add/remove/list GGUFs by URL
└── install-llama.sh ← called by install.sh; idempotent
```
### `install.sh` additions (ordered)
1. **Detect environment.** If `/.dockerenv` exists, `$REMOTE_CONTAINERS` set, or
`$CODESPACES` set → devcontainer mode: skip llama.cpp build and GGUF download
(huge, slow, and not useful inside the container). Still place `presets.ini`
and `models.list` so the project can read them.
2. **Dependencies.**
`apt install -y build-essential cmake ninja-build libcurl4-openssl-dev git`
(with `sudo` prompt). CUDA toolkit detection only — don't try to install CUDA
itself; assume host setup or fail loud with a pointer to
[docs/llama-server-cuda-wsl2.md](../../../dotfiles/.agents/docs/llama-server-cuda-wsl2.md).
3. **Build llama.cpp.** `scripts/install-llama.sh` clones `ggerganov/llama.cpp`
to `/opt/llama-server/src`, builds with `-DGGML_CUDA=ON`, installs binaries +
libs to `/opt/llama-server/`. Skips the clone+build if the binary exists and
`--rebuild` wasn't passed.
4. **Install systemd units.** Copy from
`~/dotfiles/.agents/llama-server/*.{service,path}` to `/etc/systemd/system/`,
substituting `${USER}` for `User=`. Run `daemon-reload`,
`enable --now llama-server.service llama-server-presets.path`.
5. **Symlink `presets.ini`.**
`ln -sf ~/dotfiles/.agents/models/presets.ini ~/models/presets.ini` (keep the
existing path-watcher target until users have migrated). The path watcher
already restarts on modify — symlink target changes count.
6. **Download GGUFs.** Read `models.list`; for each entry not already in
`~/dotfiles/.agents/models/gguf/`, download with `curl --location` and verify
checksum if listed. Print disk-usage estimate before starting. Skip in
devcontainer mode.
### `models.list` format
```
# url<TAB>filename<TAB>sha256(optional)
https://huggingface.co/.../qwen3-coder-30b-iq3.gguf qwen3-coder-30b-iq3.gguf abc123...
https://huggingface.co/.../deepcoder-14b-q5.gguf deepcoder-14b-q5.gguf def456...
https://huggingface.co/.../qwopus-3.6-35b-iq3.gguf qwopus-3.6-35b-iq3.gguf -
```
Plain TSV, easy to grep + diff. Comments via `#`.
### `models.sh` CLI
```bash
models.sh list # show installed + configured
models.sh add <url> [--name=<file>] # download + append to models.list
models.sh remove <name> # rm file + drop from models.list
models.sh prune # delete files not in models.list
models.sh download # re-download anything missing
models.sh checksum <name> # compute + store sha256
```
Each command edits `models.list` and the `gguf/` dir; `presets.ini` is edited by
hand (with the path-watcher restarting llama-server on save).
### Open questions
- **`User=` in the systemd unit.** The current unit runs as `ollama`. The
rationale was probably ollama's group ownership of `/home/dev/models/`. Moving
the model dir into dotfiles means the user owns it directly — running as
`${USER}` (or as a dedicated `llama` system user) is cleaner. Decide before
shipping.
- **CUDA-only assumption.** The user accepted "can always make this more
flexible later." Tag in the build script's header so a CPU/Metal fallback is
easy to add. Don't gold-plate now.
- **Where do the modelfiles go?** Remnant's `omnicoder*.modelfile` files are
Ollama-format. If they're still useful, move them to
`~/dotfiles/.agents/models/modelfiles/` and add a
`models.sh modelfile apply <name>` subcommand. Out of scope for the initial
cut; track in #4.5.
---
## 5. Kanban / task-doc unification
Already designed in
[extraction-history.md → Future task — unify kanban/task doc structure](./extraction-history.md#-future-task--unify-kanbantask-doc-structure).
Once #1 lands, `stop.sh` reads task-doc paths from `project.config.js`, so the
"shared hook supports one shape" framing changes: the hook supports _whatever
shape the config declares_, and the migration becomes purely a per-project
content move.
**Revised plan after #1:**
- Drop the "stop.sh knows about Remnant's flat list vs MFE's
`tasks/{backlog,todo,done}/`" coupling. `stop.sh` should know how to scan a
directory tree and how to scan a flat file, and `taskDocs` in config picks
which mode.
- MFE bootstraps on the directory-tree mode from day one.
- Remnant's migration is optional — if the kanban-tree shape is demonstrably
better in MFE, port Remnant later.
- Skill option still applies: a `migrate-task-docs.md` skill is probably cheaper
than a script given the per-project judgment calls.
---
## 6. MemPalace integration
**Why this is here:** the WIP "AGENTS.md context survival after compaction"
problem in the validation doc is a special case of the broader long-term memory
problem. MemPalace
([NousResearch/hermes-agent PR #5671](https://github.com/NousResearch/hermes-agent/pull/5671))
solves it with a hook architecture that matches ours almost line-for-line.
**MemPalace primitives (verified from the PR):**
| MemPalace hook | Our equivalent | What it does |
| ----------------------- | ------------------------- | ------------------------------------------------- |
| `initialize()` | `session-start.sh` | Loads identity, warms vector DB |
| `system_prompt_block()` | `session-start.sh` inject | AAAK L0+L1 wake-up (~170 tokens) at every session |
| `prefetch()` | `user-prompt-submit.sh` | Semantic search before each turn; wing-narrowed |
| `sync_turn()` | `post-tool-use.sh` | Files every exchange to the palace, non-blocking |
| `on_session_end()` | `stop.sh` | Full session mining + L1 layer regeneration |
| `on_pre_compress()` | `pre-compact.sh` | Extract key exchanges before context compression |
| `on_memory_write()` | (new — explicit writes) | Mirrors explicit memory writes into the palace |
**Practical plan:**
- Stand up MemPalace locally (Ollama + bge-m3 1024-dim, ChromaDB at
`~/.mempalace/`). Hermes is the reference integration but MemPalace itself
ships an MCP server (`mempalace_search`, `mempalace_status`, +6 more tools)
that any MCP-aware harness can use directly.
- Register the MemPalace MCP server in `~/.config/opencode/opencode.json` and
`~/.vscode-server/.../mcp.json` via `install.sh` — same pattern as
`all-agents`. No code changes needed on the harness side for read access.
- Wire write-side via our existing hooks: `post-tool-use.sh` calls the MCP tool
to file the turn, `pre-compact.sh` extracts and stores key exchanges. This is
additive — the existing dead-ends/explorations scaffolding stays.
- **Known bug to track upstream:** the Hermes plugin defaulted to a 384-dim
embedding function vs. MemPalace's 1024-dim collection. If we integrate
directly with MemPalace's MCP server (not via Hermes's plugin), we sidestep
it; if we follow Hermes's plugin pattern, fix per the PR comment.
**Acceptance:** after restart in a fresh session, the agent can recall specific
facts (e.g. "what was the Phase 4 commit?") from a prior session without those
facts being in the workspace files. Compaction in the middle of a session does
not erase per-turn memory.
**Why this is #6, not #1:** it's higher-value than the small fixes but depends
on Ollama already running (which #4 makes turnkey), and requires verifying
MemPalace works against our chosen embedding model on our hardware before
committing to it. Do #1, #2, #3 first, then this.
---
## 7. Trace-based eval scaffolding
**Source:** "The Loop Is Only as Good as the Metric"
([distributedthoughts.org, Mar 2026](https://www.distributedthoughts.org/2026-03-16-the-loop-is-only-as-good-as-the-metric/))
on Hamel Husain's evals methodology, contrasted with Karpathy's autoresearch
loop. Quote: _"the value of an optimization loop is determined entirely by the
quality of its feedback signal."_
**Husain methodology in two sentences:** review at least 100 real agent-output
traces by hand, take open-ended notes, categorize failures, then build binary
pass/fail evals around the failure modes you actually saw. Do not start with
generic metrics.
**Practical plan for us:**
- Pick a trace store. Cheapest path: write every OpenCode/Copilot turn's agent
output to `~/.agent-traces/<date>/<session-id>.jsonl` via the existing
`post-tool-use.sh` (we already have session-ID derivation from #2). Add a
`trace_log()` helper in `_lib/`.
- Build a tiny review CLI: `scripts/trace-review.sh` opens the next unreviewed
trace in `$EDITOR` with a frontmatter block (`outcome: pass|fail|partial`,
`failure_modes: []`, `notes: ""`). Saves to `~/.agent-traces/reviewed/`.
- After 100 reviewed traces, derive a `failure-modes.md` doc grouping the
observed failure modes. _This_ becomes the input to skill / hook / AGENTS.md
improvements — concrete failure modes, not speculation.
**Why this is gating for #9:** an EvoSkill-style or Karpathy-style automated
loop needs a metric. Without trace-based failure modes, the only metric
available is "did the user thumbs-up" — too noisy, too slow, too coarse.
---
## 8. Exa rate-limit awareness
Per the validation doc gap #9. Free-plan limit: no parallel fanout under ~1s —
calls must be serial.
**Implementation:**
- Add a `mcp_exa_*` case to `post-tool-use.sh` that injects a one-liner reminder
("Exa free plan: serialize searches; one at a time").
- Add an "External service quirks" section to `~/dotfiles/.agents/AGENTS.md`
listing Exa (and any future per-service constraints) so the rule survives
compaction.
- Optional soft-warn in `pre-tool-use.sh`: count `mcp_exa_*` calls per turn
(reset on `user-prompt-submit`); inject a warning (not a deny) past N=2 in a
single turn.
Trivial, no dependencies, can land in any order.
---
## 9. Research-loop / EvoSkill-style improvements
**Sources:**
- Karpathy autoresearch
([github.com/karpathy/autoresearch](https://github.com/karpathy/autoresearch),
Mar 2026): single-file experiment, fixed time budget, scalar metric (val_bpb),
LOOP FOREVER on a dedicated branch — keep if metric improves, revert if not.
- EvoSkill ([arxiv 2603.02766](https://arxiv.org/pdf/2603.02766v1),
[sentient-agi/EvoSkill](https://github.com/sentient-agi/EvoSkill)):
failure-driven skill discovery via Proposer + Skill-Builder agents over a
Pareto frontier of programs; +7.3% OfficeQA, +12.1% SealQA, +5.3% zero-shot
transfer to BrowseComp. Skills materialize as `SKILL.md` + helper scripts —
same shape as our existing skills dir.
**What this looks like for us (after #7):**
- The "controllable artifact" is the `~/dotfiles/.agents/AGENTS.md` +
`agents/*.md` + `skills/*.md` + hook reminders. The "frozen model" is whatever
LLM the user is running.
- The scalar metric is something like: fraction of traces (from #6) where the
agent's hook output and tool sequence matched a hand-labeled gold trajectory.
Husain's binary pass/fail per failure mode aggregates into this.
- A Proposer agent (à la EvoSkill) reads recent failed traces + the current
skill set, proposes a new `SKILL.md` or an edit to an existing one, the
Skill-Builder materializes it, the eval harness re-runs on the held-out trace
set, and the frontier keeps it if the metric improves.
**Why it's last in the queue:** every prior task (config, sessions, llama
turnkey, memory, traces) is a prerequisite or a strict improvement to the
substrate this loop runs on. Starting #8 before them produces a loop that
optimizes against a noisy or wrong metric — the exact failure mode the Husain
piece warns about.
---
## Deferred / not-now
- **Adopt LangGraph as the harness.** Best-in-class observability and
state-machine recovery, but adopting it means rewriting the OpenCode + Copilot
integration layer we just extracted. Revisit if LangSmith becomes the only
path to debugging a specific failure mode we can't diagnose with traces (#7)
alone. Sources:
[agent-harness.ai benchmark](https://agent-harness.ai/blog/multi-agent-orchestration-frameworks-benchmark-crewai-vs-langgraph-vs-autogen-performance-cost-and-integration-complexity/)
(9% token overhead vs CrewAI 18% vs AutoGen 31%);
[groundy.com](https://groundy.com/articles/crewai-vs-autogen-vs-langgraph-2026-the-real-trade-off-after-maintenance-mode/)
(per-node failure isolation vs CrewAI full-plan retry).
- **AutoGen.** Entered maintenance mode in late 2025; absorbed into Microsoft
Agent Framework 1.0 GA (April 3, 2026). Migration cost is real and the
framework's strength (conversational coordination) doesn't match our
deterministic-pipeline use case. Skip.
- **CrewAI.** Strong for "agent A → agent B → agent C" pipelines, but role
coordination overhead is ~3× LangGraph's on simple workflows. Our use case
(single agent per session) doesn't benefit. Skip.
- **Git worktrees for parallel agent runs.** Mentioned in the MFE draft; see
Claude Desktop's approach. Interesting once we have a working research loop
(#9), pointless before. Defer.
- **Narrative epistemology as an explicit framework.** Flowerree's "Reasoning
Through Narrative" (Cambridge Episteme) and Betz et al. on NLMs as epistemic
agents (PMC9910757) give philosophical grounding for AGENTS.md design (a
narrative frame is a "modal-space-shaping tool, not a set of premises").
Useful for writing AGENTS.md prose; not a discrete task. Cite if/when we
publish methodology.
- **Hermes Agent as a harness.** Compelling memory story (MemPalace), but Python
and tied to NousResearch's ecosystem. We integrate the memory piece directly
via MCP (#6) without adopting the harness.
---
## Research notes (May 23, 2026)
Pulled via Exa search; supports the prioritization above. Each block lists the
key finding and the source.
### Karpathy autoresearch — single-metric loop
- **Source:** [karpathy/autoresearch](https://github.com/karpathy/autoresearch)
- [distributedthoughts.org](https://www.distributedthoughts.org/2026-03-16-the-loop-is-only-as-good-as-the-metric/).
- Single file (`train.py`) edited by agent, fixed 5-minute time budget per
experiment, scalar metric (val_bpb), branch-keep-or-revert protocol, LOOP
FOREVER. ~12 experiments/hour.
- Four ingredients for this to work outside ML training: (1) one modifiable
artifact, (2) reliable benchmark/harness, (3) scalar metric, (4) fixed eval
cycle. The Husain layer adds: don't invent the metric — derive it from manual
trace review.
### EvoSkill — automated skill discovery
- **Source:** [arxiv 2603.02766](https://arxiv.org/pdf/2603.02766v1),
[sentient-agi/EvoSkill](https://github.com/sentient-agi/EvoSkill).
- Three agents: Proposer (diagnoses failures), Skill-Builder (materializes
`SKILL.md` + helpers), evaluator (held-out validation).
- Pareto frontier of agent programs; round-robin parent selection;
failure-driven textual feedback descent.
- **Why this matters for us:** our skills dir already matches EvoSkill's output
shape (`SKILL.md` + helper files). The infrastructure they describe is closer
to "build on top of our existing layout" than "adopt a new framework."
### Agentic-framework landscape, 2026
- **LangGraph 1.2 (May 2026):** production default. 9% token overhead over raw
API. Per-node failure isolation (vs CrewAI/AutoGen full-plan retry). Best
observability via LangSmith. Highest setup cost.
- **CrewAI 1.11 (Mar 2026):** fastest time-to-first-agent. 18% token overhead.
Role-based. SQLite checkpointing added April 2026.
- **AutoGen:** maintenance mode since late 2025. Absorbed into Microsoft Agent
Framework 1.0 GA (April 3, 2026; unified with Semantic Kernel, MCP-native,
GraphFlow).
- **MAST taxonomy finding:** 79% of multi-agent failures originate from
spec/coordination issues, not the underlying model
([arxiv 2503.16339](https://arxiv.org/abs/2503.16339)). 36.9% inter-agent
misalignment, 21.3% task-verification breakdowns. **This validates investing
in hook/skill/AGENTS.md infrastructure over swapping models.**
### MemPalace — long-term memory provider
- **Source:**
[NousResearch/hermes-agent PR #5671](https://github.com/NousResearch/hermes-agent/pull/5671).
- 96.6% raw LongMemEval (100% with Haiku rerank). Fully local (ChromaDB + Ollama
bge-m3 1024-dim). No API key.
- Hook architecture maps 1:1 onto ours (see #5 table). Eight MCP tools expose
read/write.
- **Why this is the highest-leverage memory option:** matches our philosophy
(local, no SaaS, hook-driven) and solves the AGENTS.md-compaction problem the
validation doc flagged.
### Narrative epistemology — applied to AGENTS.md design
- **Source:** Flowerree, "Reasoning Through Narrative" (Cambridge _Episteme_,
2023); Betz et al., "Probabilistic coherence... Neural language models as
epistemic agents" (PMC9910757).
- Narratives shape **modal space** — what the model treats as possible,
plausible, required. They aren't premises to evaluate as true/false; they're
tools that frame inference.
- **Implication for AGENTS.md:** the doc's job isn't to state facts the model
checks at decision points — it's to shape the model's default modal space.
Forbidden patterns aren't "rules to look up" but "implausible options excluded
from the action space." Frames the "context survival after compaction" problem
differently: the question isn't "did the rules survive" but "did the
modal-space shaping survive."
- NLMs as epistemic agents (Betz): self-training on synthetic corpora produces
probabilistically-coherent belief revision. Suggestive for why AGENTS.md
content that the model sees repeatedly (via PostToolUse re-injection) gets
internalized better than content seen once.
### Exa rate-limit (operational)
- Free plan: serial only, no fan-out under ~1s. Observed May 23, 2026.
- Recorded in
[extraction-history.md gap #9](./extraction-history.md#-gaps-and-bugs-in-dotfiles-pre-push)
and as roadmap task #7.

View File

@ -0,0 +1 @@
Verify plugin TypeScript code changes with `npm t`.

View File

@ -1,13 +1,14 @@
import type { Plugin, TextPart } from "@opencode-ai/plugin"; import type { Plugin, Hooks } from '@opencode-ai/plugin';
import { resolve, dirname } from "node:path"; import type { TextPart, Model } from '@opencode-ai/sdk';
import { fileURLToPath } from "node:url"; import { resolve, dirname } from 'node:path';
import { fileURLToPath } from 'node:url';
/** /**
* Agent support plugin for Remnant. * Agent support plugin for Remnant.
* *
* Responsibilities: * Responsibilities:
* 1. chat.message (first turn) session-start.sh (once per session) * 1. chat.message (first turn) session-start.sh (once per session)
* 2. chat.message user-prompt-submit.sh (each turn) * 2. chat.message user-prompt-submit.sh (each turn)
* 3. tool.execute.before pre-tool-use.sh (project policy) * 3. tool.execute.before pre-tool-use.sh (project policy)
* 4. tool.execute.after post-tool-use.sh + context pressure warning * 4. tool.execute.after post-tool-use.sh + context pressure warning
* 5. experimental.session.compacting pre-compact.sh * 5. experimental.session.compacting pre-compact.sh
@ -15,89 +16,27 @@ import { fileURLToPath } from "node:url";
* Note: stop.sh has no equivalent OpenCode plugin event; it only fires in Copilot. * Note: stop.sh has no equivalent OpenCode plugin event; it only fires in Copilot.
*/ */
// Approximate token estimate: 4 chars ≈ 1 token (conservative for code). export const GlobalPlugin: Plugin = async ({ $, client }) => {
const CHARS_PER_TOKEN = 4;
const CONTEXT_LIMIT_TOKENS = 32768;
const PRESSURE_THRESHOLD = 0.7; // 70%
// build agent (local profile) truncates at 1500 tokens to respect OmniCoder's 32K context window.
// orchestrator gets a higher limit (2500) since it only reads, not edits.
// All other agents receive full tool responses.
const LOCAL_WORKER_MAX_TOKENS = 1500;
const LOCAL_ORCHESTRATOR_MAX_TOKENS = 2500;
function truncate(
text: string,
maxTokens: number,
): { text: string; truncated: boolean } {
const maxChars = maxTokens * CHARS_PER_TOKEN;
if (text.length <= maxChars) return { text, truncated: false };
return {
text:
text.slice(0, maxChars) +
`\n\n[Response truncated at ~${maxTokens} tokens. Use a more targeted query to retrieve the relevant section.]`,
truncated: true,
};
}
export const AgentSupportPlugin: Plugin = async ({ $, directory }) => {
// Resolve hooks relative to this plugin file's real path (resolves symlinks). // Resolve hooks relative to this plugin file's real path (resolves symlinks).
// This makes the plugin work both as a project-local plugin and as a global // This makes the plugin work both as a project-local plugin and as a global
// plugin installed via install.sh — in either case, hooks live in ../../hooks/ // plugin installed via install.sh — in either case, hooks live in ../../hooks/
// relative to this file in the .agents/frameworks/opencode/ directory. // relative to this file in the .agents/frameworks/opencode/ directory.
const hooksDir = resolve( const hooksDir = resolve(dirname(fileURLToPath(import.meta.url)), '../../hooks');
dirname(fileURLToPath(import.meta.url)),
"../../hooks",
);
// Running cumulative context size estimate (characters) // Running cumulative context size estimate (characters)
let contextCharsUsed = 0; let contextCharsUsed = 0;
// Track sessions that have had session-start injected (fires once per session) // Track sessions that have had session-start injected (fires once per session)
const initializedSessions = new Set<string>(); const initializedSessions = new Set<string>();
/** Parse the additionalContext string from a hook's JSON output. */
function parseAdditionalContext(hookOutput: string): string | undefined {
try {
const parsed = JSON.parse(hookOutput.trim()) as {
hookSpecificOutput?: { additionalContext?: string };
};
return parsed?.hookSpecificOutput?.additionalContext ?? undefined;
} catch (_error) {
return undefined;
}
}
async function runHook( const agentBySession = new Map<string, { agent: string; model: Model; }>();
scriptName: string,
stdinJson?: string, const hooks: Hooks = {
): Promise<string> { 'chat.params': async (input, output) => {
const script = `${hooksDir}/${scriptName}`; logInfoData('chat.params', { input, output });
try { agentBySession.set(input.sessionID, { agent: input.agent, model: input.model });
const proc = stdinJson },
? await $`bash ${script} < ${Buffer.from(stdinJson)}`.text()
: await $`bash ${script}`.text();
return proc;
} catch (_error) {
// DEBUG: log hook failures so silent catches don't hide enforcement bugs
try {
const fs = await import("node:fs");
fs.appendFileSync(
"/tmp/plugin-hook-errors.log",
JSON.stringify({
ts: new Date().toISOString(),
script,
error: String(_error),
}) + "\n",
);
} catch (_e) {
// ignore
}
// Hooks are advisory — never block on hook failure
return "";
}
}
return {
// ── 1 & 2. Session start + user prompt ────────────────────────────────── // ── 1 & 2. Session start + user prompt ──────────────────────────────────
// Session-start was previously injected via experimental.chat.system.transform // Session-start was previously injected via experimental.chat.system.transform
// (pushing to output.system). That caused a Jinja "System message must be at // (pushing to output.system). That caused a Jinja "System message must be at
@ -106,21 +45,21 @@ export const AgentSupportPlugin: Plugin = async ({ $, directory }) => {
// message) is already in the conversation, so the system push lands at a // message) is already in the conversation, so the system push lands at a
// non-zero position. Injecting as a synthetic text part on the first // non-zero position. Injecting as a synthetic text part on the first
// chat.message turn avoids the position constraint entirely. // chat.message turn avoids the position constraint entirely.
"chat.message": async (input, output) => { 'chat.message': async (input, output) => {
const sessionID = input.sessionID ?? "unknown"; logInfoData('chat.message', { input, output });
// Session-start injection — runs exactly once per session, prepended so it // Session-start injection — runs exactly once per session, prepended so it
// reads before the user-prompt-submit nudges on the first turn. // reads before the user-prompt-submit nudges on the first turn.
if (!initializedSessions.has(sessionID)) { if (!initializedSessions.has(input.sessionID)) {
initializedSessions.add(sessionID); initializedSessions.add(input.sessionID);
const startOutput = await runHook("session-start.sh"); const startOutput = await runHookScript('session-start.sh');
const startContext = parseAdditionalContext(startOutput); const startContext = parseAdditionalContext(startOutput);
if (startContext) { if (startContext) {
output.parts.unshift({ output.parts.unshift({
id: `prt_${crypto.randomUUID()}`, id: `prt_${crypto.randomUUID()}`,
sessionID: input.sessionID, sessionID: input.sessionID,
messageID: input.messageID ?? crypto.randomUUID(), messageID: input.messageID ?? crypto.randomUUID(),
type: "text", type: 'text',
text: startContext, text: startContext,
synthetic: true, synthetic: true,
}); });
@ -128,11 +67,11 @@ export const AgentSupportPlugin: Plugin = async ({ $, directory }) => {
} }
const promptText = output.parts const promptText = output.parts
.filter((p): p is TextPart => p.type === "text") .filter((p): p is TextPart => p.type === 'text')
.map((p) => p.text) .map((p) => p.text)
.join("\n"); .join('\n');
const hookOutput = await runHook( const hookOutput = await runHookScript(
"user-prompt-submit.sh", 'user-prompt-submit.sh',
JSON.stringify({ prompt: promptText }), JSON.stringify({ prompt: promptText }),
); );
const context = parseAdditionalContext(hookOutput); const context = parseAdditionalContext(hookOutput);
@ -141,24 +80,24 @@ export const AgentSupportPlugin: Plugin = async ({ $, directory }) => {
id: `prt_${crypto.randomUUID()}`, id: `prt_${crypto.randomUUID()}`,
sessionID: input.sessionID, sessionID: input.sessionID,
messageID: input.messageID ?? crypto.randomUUID(), messageID: input.messageID ?? crypto.randomUUID(),
type: "text", type: 'text',
text: context, text: context,
synthetic: true, synthetic: true,
}); });
} }
}, },
// ── 3. Pre-tool-use ───────────────────────────────────────────────────── // ── 3. Pre-tool-use ─────────────────────────────────────────────────────
"tool.execute.before": async (input, output) => { 'tool.execute.before': async (input, output) => {
const toolName = input.tool as string; logInfoData('tool.execute.before', { input, output });
// ── read guards ─────────────────────────────────────────────────── // ── read guards ───────────────────────────────────────────────────
if (toolName === "read") { if (input.tool === 'read') {
const args = (output.args ?? {}) as { const args = (output.args ?? {}) as {
filePath?: string; filePath?: string;
offset?: number; offset?: number;
limit?: number; limit?: number;
}; };
const filePath = args.filePath ?? ""; const filePath = args.filePath ?? '';
// package.json read guard: // package.json read guard:
// Reading workspace package.json files auto-loads nested AGENTS.md files // Reading workspace package.json files auto-loads nested AGENTS.md files
@ -166,7 +105,7 @@ export const AgentSupportPlugin: Plugin = async ({ $, directory }) => {
// Block package.json reads under apps/ and packages/ only. // Block package.json reads under apps/ and packages/ only.
if (/(^|\/)(apps|packages)\/[^/]+\/package\.json$/.test(filePath)) { if (/(^|\/)(apps|packages)\/[^/]+\/package\.json$/.test(filePath)) {
throw new Error( throw new Error(
"BLOCKED: Reading workspace package.json files auto-loads nested AGENTS.md files and exhausts the 32K context. Use `grep_search` to find the specific field you need (e.g. a dependency version or script name) instead of reading the whole file.", 'BLOCKED: Reading workspace package.json files auto-loads nested AGENTS.md files and exhausts the 32K context. Use `grep_search` to find the specific field you need (e.g. a dependency version or script name) instead of reading the whole file.',
); );
} }
@ -178,7 +117,7 @@ export const AgentSupportPlugin: Plugin = async ({ $, directory }) => {
// Directory reads (e.g. `Read .`) never carry a limit — skip the guard. // Directory reads (e.g. `Read .`) never carry a limit — skip the guard.
let isDirectory = false; let isDirectory = false;
try { try {
const { statSync } = await import("node:fs"); const { statSync } = await import('node:fs');
isDirectory = statSync(filePath).isDirectory(); isDirectory = statSync(filePath).isDirectory();
} catch (_error) { } catch (_error) {
// path doesn't exist or inaccessible — treat as file // path doesn't exist or inaccessible — treat as file
@ -209,9 +148,9 @@ export const AgentSupportPlugin: Plugin = async ({ $, directory }) => {
// or long inventories inline in a task prompt causes "Unterminated string" // or long inventories inline in a task prompt causes "Unterminated string"
// parse errors. Cap task prompts at 1200 chars — workers should be told // parse errors. Cap task prompts at 1200 chars — workers should be told
// WHICH files to read, not given the contents inline. // WHICH files to read, not given the contents inline.
if (toolName === "task") { if (input.tool === 'task') {
const args = (output.args ?? {}) as { prompt?: string }; const args = (output.args ?? {}) as { prompt?: string };
const prompt = args.prompt ?? ""; const prompt = args.prompt ?? '';
if (prompt.length > 1200) { if (prompt.length > 1200) {
throw new Error( throw new Error(
`BLOCKED (task prompt too long: ${prompt.length} chars, max 1200): Task prompts must not embed file contents, dependency lists, or long context inline — this causes JSON parse failures. Instead, tell the worker WHICH files to read and WHAT to do. Example: "Read the root package.json and all workspace package.json files, then update the Technology Stack section in README.md to match."`, `BLOCKED (task prompt too long: ${prompt.length} chars, max 1200): Task prompts must not embed file contents, dependency lists, or long context inline — this causes JSON parse failures. Instead, tell the worker WHICH files to read and WHAT to do. Example: "Read the root package.json and all workspace package.json files, then update the Technology Stack section in README.md to match."`,
@ -223,74 +162,94 @@ export const AgentSupportPlugin: Plugin = async ({ $, directory }) => {
// Policies 112: command/file guards. Policy 13: read_file range limit // Policies 112: command/file guards. Policy 13: read_file range limit
// (≤50 lines for source files, ≤500 for docs/). Deny = throws Error. // (≤50 lines for source files, ≤500 for docs/). Deny = throws Error.
const hookInput = JSON.stringify({ const hookInput = JSON.stringify({
tool_name: toolName, tool_name: input.tool,
tool_input: output.args ?? {}, tool_input: output.args ?? {},
}); });
const hookResult = await runHook("pre-tool-use.sh", hookInput); const hookResult = await runHookScript('pre-tool-use.sh', hookInput);
// If the hook emitted a deny decision, surface it as an error // If the hook emitted a deny decision, surface it as an error
if (hookResult.includes('"permissionDecision": "deny"')) { if (hookResult.includes('"permissionDecision": "deny"')) {
const match = hookResult.match( const match = hookResult.match(/"permissionDecisionReason":\s*"([^"]+)"/);
/"permissionDecisionReason":\s*"([^"]+)"/, const reason = match?.[1] ?? 'Blocked by project policy (pre-tool-use hook).';
);
const reason =
match?.[1] ?? "Blocked by project policy (pre-tool-use hook).";
throw new Error(reason); throw new Error(reason);
} }
}, },
// ── 4. Post-tool-use ──────────────────────────────────────────────────── // ── 4. Post-tool-use ────────────────────────────────────────────────────
"tool.execute.after": async (input, output) => { 'tool.execute.after': async (input, output) => {
const response = output.response as string | undefined; logInfoData('tool.execute.after', { input, output });
if (typeof response === "string") { // MCP tools populate content differently — output.output may be undefined.
// a) Response truncation — local agents (build/orchestrator) and any ollama/ model; // Skip truncation/pressure/hook logic for those; the MCP content flows
// orchestrator gets a higher limit since it only reads, not edits. // through OpenCode's internal parts pipeline instead.
const agentName = typeof input.agent === "string" ? input.agent : ""; const text = output.output;
const isLocalAgent = if (!text) {
agentName === "build" || return;
agentName === "orchestrator" || }
(typeof input.model === "string" &&
input.model.startsWith("ollama/"));
if (isLocalAgent) {
const isOrchestrator = agentName === "orchestrator";
const maxTokens = isOrchestrator
? LOCAL_ORCHESTRATOR_MAX_TOKENS
: LOCAL_WORKER_MAX_TOKENS;
const { text: truncated } = truncate(response, maxTokens);
output.response = truncated;
}
// b) Context pressure tracking — accumulate and inject warning when ≥70% // Approximate token estimate: 4 chars ≈ 1 token (conservative for code).
contextCharsUsed += response.length; const CHARS_PER_TOKEN = 4;
const charLimit = CONTEXT_LIMIT_TOKENS * CHARS_PER_TOKEN; const CONTEXT_LIMIT_TOKENS = 32768;
const pct = contextCharsUsed / charLimit; const PRESSURE_THRESHOLD = 0.7; // 70%
if (pct >= PRESSURE_THRESHOLD) { // build agent (local profile) truncates at 1500 tokens to respect OmniCoder's 32K context window.
const pctDisplay = Math.round(pct * 100); // orchestrator gets a higher limit (2500) since it only reads, not edits.
const pressure = `[CONTEXT PRESSURE: ~${pctDisplay}% used. Be concise. Prefer targeted tool calls. Write progress to NOTES.md before continuing.]`; // All other agents receive full tool responses.
output.response = `${pressure}\n\n${output.response}`; const LOCAL_WORKER_MAX_TOKENS = 1500;
// Reset after injection so we don't spam every subsequent turn const LOCAL_ORCHESTRATOR_MAX_TOKENS = 2500;
contextCharsUsed = 0;
}
// c) Shell out to post-tool-use hook (metacognitive reminders, methodology) function truncate(t: string, maxTokens: number): { text: string; truncated: boolean } {
const hookInput = JSON.stringify({ const maxChars = maxTokens * CHARS_PER_TOKEN;
tool_name: input.tool, if (t.length <= maxChars) return { text: t, truncated: false };
tool_input: input.args ?? {}, return {
tool_response: (output.response as string).slice(0, 500), // truncated for hook text:
}); t.slice(0, maxChars) +
const postToolOutput = await runHook("post-tool-use.sh", hookInput); `\n\n[Response truncated at ~${maxTokens} tokens. Use a more targeted query to retrieve the relevant section.]`,
const postToolContext = parseAdditionalContext(postToolOutput); truncated: true,
if (postToolContext) { };
output.response = `${output.response}\n\n${postToolContext}`; }
}
// a) Response truncation — local agents (build/orchestrator) and any llama-server/ model;
// orchestrator gets a higher limit since it only reads, not edits.
const { agent, model } = agentBySession.get(input.sessionID) ?? {};
const isLocalAgent = agent === 'build' || agent === 'orchestrator' || model?.providerID === 'llama-server';
if (isLocalAgent) {
const maxTokens = agent === 'orchestrator' ? LOCAL_ORCHESTRATOR_MAX_TOKENS : LOCAL_WORKER_MAX_TOKENS;
const { text: truncated } = truncate(text, maxTokens);
output.output = truncated;
}
// b) Context pressure tracking — accumulate and inject warning when ≥70%
contextCharsUsed += output.output.length;
const charLimit = CONTEXT_LIMIT_TOKENS * CHARS_PER_TOKEN;
const pct = contextCharsUsed / charLimit;
if (pct >= PRESSURE_THRESHOLD) {
const pctDisplay = Math.round(pct * 100);
const pressure = `[CONTEXT PRESSURE: ~${pctDisplay}% used. Be concise. Prefer targeted tool calls. Write progress to NOTES.md before continuing.]`;
output.output = `${pressure}\n\n${output.output}`;
// Reset after injection so we don't spam every subsequent turn
contextCharsUsed = 0;
}
// c) Shell out to post-tool-use hook (metacognitive reminders, methodology)
const hookInput = JSON.stringify({
tool_name: input.tool,
tool_input: input.args ?? {},
tool_response: output.output.slice(0, 500), // truncated for hook
});
const postToolOutput = await runHookScript('post-tool-use.sh', hookInput);
const postToolContext = parseAdditionalContext(postToolOutput);
if (postToolContext) {
output.output = `${output.output}\n\n${postToolContext}`;
} }
}, },
// ── 5. Pre-compact: export state before context summarization ───────────── // ── 5. Pre-compact: export state before context summarization ─────────────
"experimental.session.compacting": async (input, output) => { 'experimental.session.compacting': async (input, output) => {
await runHook("pre-compact.sh"); logInfoData('experimental.session.compacting', { input, output });
await runHookScript('pre-compact.sh');
output.prompt = ` output.prompt = `
You are a context summarizer for coding sessions. Summarize only the conversation history given do not answer it. You are a context summarizer for coding sessions. Summarize only the conversation history given do not answer it.
@ -316,4 +275,57 @@ Output exactly this Markdown structure. Keep every section even when empty. Use
For Clarifications: include only follow-ups that changed scope, added constraints, or redirected work. Do not mention that you are summarizing. Respond in the conversation's language.`; For Clarifications: include only follow-ups that changed scope, added constraints, or redirected work. Do not mention that you are summarizing. Respond in the conversation's language.`;
}, },
}; };
/** Parse the additionalContext string from a hook's JSON output. */
function parseAdditionalContext(hookOutput: string): string | undefined {
try {
const parsed = JSON.parse(hookOutput.trim()) as {
hookSpecificOutput?: { additionalContext?: string };
};
return parsed?.hookSpecificOutput?.additionalContext ?? undefined;
} catch (_error) {
return undefined;
}
}
async function runHookScript(scriptName: string, stdinJson?: string): Promise<string> {
const script = `${hooksDir}/${scriptName}`;
try {
const proc = stdinJson
? await $`bash ${script} < ${Buffer.from(stdinJson)}`.text()
: await $`bash ${script}`.text();
return proc;
} catch (_error) {
await client.app.log({
body: {
service: 'global-plugin',
level: 'error',
message: `(Global Plugin) Error in hook script ${script}`,
extra: {
ts: new Date().toISOString(),
script,
error: String(_error),
},
},
});
// Hooks are advisory — never block on hook failure
return '';
}
}
async function logInfoData(message: string, obj?: Record<string, unknown>) {
await client.app.log({
body: {
service: 'global-plugin',
level: 'info',
message: `(Global Plugin) ${message}`,
extra: {
ts: new Date().toISOString(),
...(obj ?? {}),
},
},
});
}
return hooks;
}; };

View File

@ -11,10 +11,10 @@ warn() { printf '\033[0;33m⚠\033[0m %s\n' "$1"; }
skip() { printf '\033[0;34m\033[0m %s\n' "$1"; } skip() { printf '\033[0;34m\033[0m %s\n' "$1"; }
# ── 1. Copilot global hooks ────────────────────────────────────────────────── # ── 1. Copilot global hooks ──────────────────────────────────────────────────
# Generate ~/.copilot/hooks/agent-support.json with absolute paths so the hooks # Generate ~/.copilot/hooks/hooks.json with absolute paths so the hooks
# work from any workspace — no per-project symlinks or stubs needed. # work from any workspace — no per-project symlinks or stubs needed.
COPILOT_HOOKS_DIR="$HOME/.copilot/hooks" COPILOT_HOOKS_DIR="$HOME/.copilot/hooks"
COPILOT_HOOK_FILE="$COPILOT_HOOKS_DIR/agent-support.json" COPILOT_HOOK_FILE="$COPILOT_HOOKS_DIR/hooks.json"
mkdir -p "$COPILOT_HOOKS_DIR" mkdir -p "$COPILOT_HOOKS_DIR"
@ -48,7 +48,7 @@ fi
# ── 2. OpenCode global plugin ──────────────────────────────────────────────── # ── 2. OpenCode global plugin ────────────────────────────────────────────────
OC_PLUGINS_DIR="$HOME/.config/opencode/plugins" OC_PLUGINS_DIR="$HOME/.config/opencode/plugins"
OC_PLUGIN_TARGET="$DOTFILES_AGENTS/frameworks/opencode/plugin.ts" OC_PLUGIN_TARGET="$DOTFILES_AGENTS/frameworks/opencode/plugin.ts"
OC_PLUGIN_LINK="$OC_PLUGINS_DIR/agent-support.ts" OC_PLUGIN_LINK="$OC_PLUGINS_DIR/plugin.ts"
mkdir -p "$OC_PLUGINS_DIR" mkdir -p "$OC_PLUGINS_DIR"
if [[ -L "$OC_PLUGIN_LINK" && "$(readlink "$OC_PLUGIN_LINK")" == "$OC_PLUGIN_TARGET" ]]; then if [[ -L "$OC_PLUGIN_LINK" && "$(readlink "$OC_PLUGIN_LINK")" == "$OC_PLUGIN_TARGET" ]]; then

View File

@ -12,7 +12,7 @@
* Frontmatter fields: * Frontmatter fields:
* description (required) routing description for the prompt/tool * description (required) routing description for the prompt/tool
* toolName (skills only, optional) override the derived tool name * toolName (skills only, optional) override the derived tool name
* default: load_<basename> (e.g. research.md load_research) * default: load_<basename> (e.g. research-methodology.md load_research-methodology)
* *
* Not handled here (stays bespoke): * Not handled here (stays bespoke):
* hooks/ MCP has no lifecycle intercept primitive * hooks/ MCP has no lifecycle intercept primitive
@ -33,7 +33,7 @@ const skillsDir = resolve(import.meta.dirname, "../skills");
interface ParsedFile { interface ParsedFile {
description: string; description: string;
toolName?: string; toolName?: string | undefined;
body: string; body: string;
} }
@ -61,12 +61,12 @@ function parseFrontmatter(content: string): ParsedFile {
if (descMatch) { if (descMatch) {
// If the match includes a leading quote, strip matching quotes // If the match includes a leading quote, strip matching quotes
const raw = frontmatter.match(/^description:\s*(['"])([\s\S]*?)\1\s*$/m); const raw = frontmatter.match(/^description:\s*(['"])([\s\S]*?)\1\s*$/m);
description = raw ? raw[2].trim() : descMatch[1].trim(); description = raw ? raw[2]?.trim() ?? '' : descMatch[1]?.trim() ?? '';
} }
return { return {
description, description,
toolName: toolMatch ? toolMatch[1].trim() : undefined, toolName: toolMatch?.[1]?.trim(),
body, body,
}; };
} }

View File

@ -10,6 +10,9 @@
"dependencies": { "dependencies": {
"@modelcontextprotocol/sdk": "^1.29.0", "@modelcontextprotocol/sdk": "^1.29.0",
"zod": "^4.1.12" "zod": "^4.1.12"
},
"devDependencies": {
"@types/node": "^25.9.1"
} }
}, },
"node_modules/@hono/node-server": { "node_modules/@hono/node-server": {
@ -64,6 +67,16 @@
} }
} }
}, },
"node_modules/@types/node": {
"version": "25.9.1",
"resolved": "https://registry.npmjs.org/@types/node/-/node-25.9.1.tgz",
"integrity": "sha512-xfrlY7UD5rMJk3ZVJP8BNzS28J36YJg+xp+LPXV1TdWxr8uMH5A860QNxYDGQe/ylDSgjxE52Q9VnO7p75tJxg==",
"dev": true,
"license": "MIT",
"dependencies": {
"undici-types": ">=7.24.0 <7.24.7"
}
},
"node_modules/accepts": { "node_modules/accepts": {
"version": "2.0.0", "version": "2.0.0",
"resolved": "https://registry.npmjs.org/accepts/-/accepts-2.0.0.tgz", "resolved": "https://registry.npmjs.org/accepts/-/accepts-2.0.0.tgz",
@ -1095,6 +1108,13 @@
"url": "https://opencollective.com/express" "url": "https://opencollective.com/express"
} }
}, },
"node_modules/undici-types": {
"version": "7.24.6",
"resolved": "https://registry.npmjs.org/undici-types/-/undici-types-7.24.6.tgz",
"integrity": "sha512-WRNW+sJgj5OBN4/0JpHFqtqzhpbnV0GuB+OozA9gCL7a993SmU+1JBZCzLNxYsbMfIeDL+lTsphD5jN5N+n0zg==",
"dev": true,
"license": "MIT"
},
"node_modules/unpipe": { "node_modules/unpipe": {
"version": "1.0.0", "version": "1.0.0",
"resolved": "https://registry.npmjs.org/unpipe/-/unpipe-1.0.0.tgz", "resolved": "https://registry.npmjs.org/unpipe/-/unpipe-1.0.0.tgz",

View File

@ -6,5 +6,8 @@
"dependencies": { "dependencies": {
"@modelcontextprotocol/sdk": "^1.29.0", "@modelcontextprotocol/sdk": "^1.29.0",
"zod": "^4.1.12" "zod": "^4.1.12"
},
"devDependencies": {
"@types/node": "^25.9.1"
} }
} }

45
.agents/mcp/tsconfig.json Normal file
View File

@ -0,0 +1,45 @@
{
// Visit https://aka.ms/tsconfig to read more about this file
"compilerOptions": {
"preserveSymlinks": true,
// File Layout
// "rootDir": "./src",
// "outDir": "./dist",
// Environment Settings
// See also https://aka.ms/tsconfig/module
"module": "nodenext",
"target": "esnext",
"lib": [
"esnext"
],
"types": [
"node"
],
// For nodejs:
// "lib": ["esnext"],
// "types": ["node"],
// and npm install -D @types/node
// Other Outputs
"sourceMap": true,
"declaration": true,
"declarationMap": true,
// Stricter Typechecking Options
"noUncheckedIndexedAccess": true,
"exactOptionalPropertyTypes": true,
// Style Options
// "noImplicitReturns": true,
// "noImplicitOverride": true,
// "noUnusedLocals": true,
// "noUnusedParameters": true,
// "noFallthroughCasesInSwitch": true,
// "noPropertyAccessFromIndexSignature": true,
// Recommended Options
"strict": true,
"jsx": "react-jsx",
"verbatimModuleSyntax": true,
"isolatedModules": true,
"noUncheckedSideEffectImports": true,
"moduleDetection": "force",
"skipLibCheck": true,
}
}

View File

@ -0,0 +1,34 @@
---
description: Execution rules for debugging: hypothesis testing, instrumentation, and trace cleanup
---
# Research Execution
Keep context clean and evidence tracked during active investigation.
## Context Management
Methodology degrades after ~15 tool calls. Re-read investigation file and
dead-ends every ~10 tool calls. When drifting toward guess-and-check, pause and
re-read notes. Hold references; load on demand.
## Findings Format
Record each hypothesis test to `.session/findings.md`:
```
- [timestamp] Hypothesis: [one sentence]
Falsification: [what you'd expect if wrong]
Result: [ELIMINATED/CONFIRMED] — [why, in one sentence]
```
## Timing Awareness
Prefix unknown commands with `time`. Fast (<5s): low barrier. Slow (>30s):
reason first. Unknown: measure. Capture: `time cmd 2>&1 | tee /tmp/output.txt`
## Techniques
- **Five Whys**: trace causal chains; starting point, not sole method
- **Delta Debugging**: binary search between passing/failing cases
- **Rubber Duck**: explain the system step by step to expose gaps

View File

@ -0,0 +1,16 @@
---
description: Research methodology index: overview of the three-phase research workflow (setup, triage, execution)
---
# Research Methodology
Structured investigation across three phases. Load each on demand via `read_file`.
1. **Setup** — hypothesis checklist, Understand/Diagnose orientations
`skills/research-setup.md`
2. **Triage** — risk-based table choosing Satisfice vs Strong Inference
`skills/research-triage.md`
3. **Execution** — context management, dead-ends, timing, techniques
`skills/research-execution.md`
For full agent support with delegation and session memory, use `@research`.

View File

@ -0,0 +1,33 @@
---
description: Checklist for investigation setup: orientations, hypothesis, and circuit breaker baselines
---
# Research Setup
**Goal**: Build a grounded mental model before acting.
## Investigation Checklist
Before every hypothesis cycle:
- [ ] Hypothesis written (one sentence: "I believe X because Y")
- [ ] Falsification criterion written ("if wrong, I'd expect to see ___")
- [ ] Falsification test run BEFORE confirmation test
- [ ] Result recorded (ELIMINATED with reason, or CONFIRMED with evidence)
- [ ] Hypothesis re-evaluated at this tool-call boundary
- [ ] All traces/instrumentation removed before next hypothesis
## Orientations
**Understand (Grounded Theory)** — Read code, name what you see. Compare new
observations against earlier ones. Connect categories (what calls what, data
flows). Write findings to session memory. Stop at saturation.
**Diagnose (Strong Inference + Satisficing)** — Simple check first: can a
single log answer the question. When no single log answers the question,
triage (see `research-triage.md`).
## Mode Switching
These compose recursively:
Understand -> anomaly -> Diagnose -> need context -> Understand -> ...

View File

@ -0,0 +1,20 @@
---
description: Risk assessment table for debugging: symptom-to-cause mapping and verification steps
---
# Research Triage
Assess risk before choosing your approach.
| Factor | Low Risk | High Risk |
| ----------------- | ------------------------ | ------------------------------ |
| **Reversibility** | Easy to undo | Hard to reverse (data, deploy) |
| **Blast radius** | One file/function | Many systems, shared state |
| **Confidence** | Familiar, clear evidence | Novel, ambiguous symptoms |
| **Novelty** | Seen this before | Never encountered |
| **Time cost** | Known fast (<5s) | Unknown = measure first |
**Low risk** → Satisfice: test the single most likely hypothesis. Stop when confirmed.
**Any high risk** → Strong Inference: generate 2-3 competing hypotheses, design
a discriminating test, eliminate based on evidence.

View File

@ -1,113 +0,0 @@
---
description: 'Load the structured research methodology — call this when starting any investigation, debugging session, root cause analysis, or systematic exploration of unfamiliar code. Returns a checklist with two orientations (Understand + Diagnose), risk-based triage, circuit breakers, and context management guidance.'
toolName: 'load_research_methodology'
---
# Research Methodology Skill
This skill provides a structured, evidence-based investigation methodology. It
prevents common AI agent failure modes: pattern-matching without evidence,
confirmation bias, fixing symptoms instead of causes, and methodology drift
during long sessions.
## Quick Reference: The Investigation Checklist
Before every hypothesis cycle:
- [ ] **Hypothesis written** (one sentence: "I believe X because Y")
- [ ] **Falsification criterion written** ("if wrong, I'd expect to see \_\_\_")
- [ ] **Falsification test run BEFORE confirmation test**
- [ ] **Result recorded** (ELIMINATED with reason, or CONFIRMED with evidence)
- [ ] **Hypothesis re-evaluated at this tool-call boundary** — new evidence
changes what to check next. Interleaved thinking makes this automatic for
Claude 4; consciously invoke it for other models.
- [ ] **All traces/instrumentation removed** before next hypothesis
## Two Orientations
### Understand (Grounded Theory)
**Goal**: Build a mental model from the code itself, not assumptions.
1. **Open coding** — Read code, name what you see (functions, patterns, flows)
2. **Constant comparison** — Compare new observations against earlier ones
3. **Axial coding** — Connect the categories (what calls what, data flows)
4. **Memo** — Write findings to session memory as you go
5. **Saturation check** — Stop when new files confirm what you already know
**Use for**: "How does X work?", "What's the architecture?", "I need to
understand this before changing it."
### Diagnose (Strong Inference + Satisficing)
**Goal**: Determine why something isn't working.
**Simple check first**: Can you answer this with a single log/print? If the
question is "what value does X have here?" — just log and look.
**Triage** (if the simple check didn't resolve it):
| Factor | Low Risk | High Risk |
| ----------------- | ------------------------ | ------------------------------ |
| **Reversibility** | Easy to undo | Hard to reverse (data, deploy) |
| **Blast radius** | One file/function | Many systems, shared state |
| **Confidence** | Familiar, clear evidence | Novel, ambiguous symptoms |
| **Novelty** | Seen this before | Never encountered |
| **Time cost** | Known fast (<5s) | Unknown = measure first |
**Low risk → Satisfice**: Test the single most likely hypothesis. Done if
confirmed.
**Any high risk → Strong Inference**: Generate 2-3 competing hypotheses, design
a discriminating test, eliminate based on evidence.
### Mode Switching
These compose recursively:
`Understand → anomaly → Diagnose → need context → Understand → ...`
## Circuit Breakers
1. **5+ attempts without falsifying = STOP and report**
2. **3+ edits to same file without passing test = STOP and rethink**
3. **Urge to "just try something" = STOP and write hypothesis first**
4. **Two failures at same abstraction level = go UP one level**
## Context Management
Methodology degrades after ~15 tool calls (context competition). Counteract:
- Re-read investigation file and dead-ends every ~10 tool calls
- If drifting toward guess-and-check, pause and re-read notes
- For long sessions, create an investigation file so fresh context can continue
- Hold references; load on demand. Do not read files you don't need yet.
## Dead-Ends Format
Record eliminated hypotheses so you (or the next session) don't re-test them:
```
- **[timestamp] Hypothesis:** [one sentence]
**Falsification:** [what you'd expect if wrong]
**Result:** [ELIMINATED/CONFIRMED] — [why, in one sentence]
```
Write to `.session/dead-ends.md` or the investigation file's Hypotheses section.
## Timing Awareness
- Prefix unknown commands with `time` to learn baselines
- Capture output: `time npm test 2>&1 | tee /tmp/test_output.txt`
- Fast (<5s): low barrier to run. Slow (>30s): reason first. Unknown: measure.
## Techniques
- **Five Whys**: Trace causal chains. Starting point, not sole method.
- **Delta Debugging**: Binary search between passing/failing cases (`git bisect`
logic).
- **Rubber Duck**: Explain the system step by step in writing to expose gaps.
## Full Agent
For comprehensive investigation support with delegation, exploration files, and
session memory management, use `@research`.

View File

@ -0,0 +1,62 @@
# Verification Exercise: `build` agent smoke test
**Setup**: Open OpenCode → the default agent is now `orchestrator`. To test the
`build` agent directly, either Tab-cycle to it or use
`opencode run --agent build "your prompt"`.
## Level 1 — Read-only (verifies tool-call JSON is valid)
> **Prompt**: "Read .agents/hooks/post-tool-use.sh. Report: (1) what file path
> the counter uses, (2) what line the SELF-CHECK fires on, and (3) the exact
> modulo condition."
### Pass criteria:
- No tool call parse error in the OpenCode UI
- It reads the file in ≤50-line chunks (pagination rule working)
- Reports `/tmp/.opencode-tool-count-<hash>`, line ~23, `COUNT % 15 == 0`
- Session counter file exists: `ls /tmp/.opencode-tool-count-* 2>/dev/null`
## Level 2 — Small bounded write (verifies end-to-end tool call + edit)
> **Prompt**: "In .agents/hooks/post-tool-use.sh, the REPO_ID derivation line
> uses md5sum. Add a single-line comment directly above it (# repo-scoped to
> avoid cross-repo counter contamination) and nothing else."
### Pass criteria:
- Makes exactly 23 tool calls (read → edit → optionally verify)
- Doesn't read more than 50 lines at once
- The comment appears on the correct line in the file
- No hallucinated paths
## Level 3 — Scope escalation (verifies rule 5 in build.md)
> **Prompt**: "Refactor all five hook files to share a common REPO_ROOT
> derivation function."
### Pass criteria:
- It refuses and tells you this exceeds 23 files / needs the orchestrator or
default agent
- It does NOT start reading all five files and attempting the refactor
If Level 1 and 2 pass cleanly and Level 3 correctly escalates, the build agent
is working. If Level 1 shows parse errors, restart OpenCode to reload the
renamed agent config.
## Level 4 — Orchestrator planning gate (cloud only)
**Setup**: Switch to the `orchestrator` agent (or use `/orchestrator` in
Copilot). Run a vague multi-step request.
> **Prompt**: "Clean up the hook files — reduce repetition and make sure the
> conventions match what's in .agents/AGENTS.md."
### Pass criteria:
- Produces a numbered plan with clear subtasks and acceptance criteria
- Asks "Proceed?" before starting any implementation
- Does NOT immediately start reading or editing files
- After confirming, executes subtasks sequentially with inline tool calls
(cloud) or dispatches to `build` via `task` (OpenCode/local)