dotfiles/.agents/docs/ai-coding-best-practices.md
Brydon DeWitt 83f456f25b fix(plugin): guard against undefined output.output for MCP tools
MCP tools don't populate output.output in the tool.execute.after hook —
the MCP content flows through OpenCode's internal parts pipeline instead.
This caused a crash: undefined is not an object (evaluating 'text.length')
in the truncate function.
2026-06-06 02:11:24 -04:00

2132 lines
112 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Agentic Coding: Best Practices (Research Notes)
> **Status:** Research synthesis, not a tutorial. Captures the state of the
> agentic-coding field as of mid-2026, with emphasis on what has been _uprooted_
> from earlier (20222024) practice.
>
> **Audience:** Engineers building, configuring, or using AI coding agents — not
> first-time LLM users.
>
> **Self-evaluation:** See the final section. This document is opinionated and
> deliberately concrete; model-specific claims are date-stamped because they age
> within months.
>
> **Applied implementation:**
> [`docs/projects/agent-infrastructure.md`](../projects/agent-infrastructure.md)
> — how these principles are applied in this repo (current architecture,
> OmniCoder 2 orchestration plan, open issues).
---
## 0. Framing: What Got Uprooted
Three big shifts have rendered most pre-2024 "LLM coding tips" obsolete or
actively misleading:
1. **Prompt engineering → context engineering.** Modern instruction-tuned
frontier models follow direct, terse instructions reliably. The high-leverage
work has moved _outside_ the system prompt — into what tokens reach the model
at all, in what order, and with what compression. (Karpathy popularized the
term "context engineering" in mid-2024; it has since been adopted as the
default frame by Anthropic, Cursor, and others.)
2. **Model > harness → harness ≈ model.** A 2023 belief was "just wait for the
next model." The Claude system-prompt leaks (Oct 2024 onward), the success of
Aider's repo-map, and Cognition's published failure analyses showed that
_scaffolding_ — tool choice, context budget, plan/act separation, todo
tracking — explains as much variance in agent success as the underlying
model. A mid-tier model with an excellent harness routinely beats a frontier
model with a naive harness on real-repo tasks.
3. **Multi-agent enthusiasm → single-thread default.** The "swarm" / AutoGPT era
assumed parallelism would compound capability. Cognition's
["Don't Build Multi-Agents"](https://cognition.ai/blog/dont-build-multi-agents)
(mid-2025) and subsequent replications established the now-dominant view:
context fragmentation between agents destroys more value than parallelism
creates. Subagents survive only in narrow, _read-only or fully isolated_
roles.
4. **Three layers, not one.** The field has converged on a useful taxonomy
popularized by an Alibaba Cloud engineering article (Apr 2026): **Prompt →
Context → Harness.** _Prompt_ is the per-request task expression (stateless).
_Context_ is everything the model sees during execution (system rules, tool
definitions, AGENTS.md, retrieved code, conversation history). _Harness_ is
the deterministic machinery around the model (hooks, permission gates,
verification loops, subagent boundaries). The layers fail differently and
require different fixes — conflating them is the single most common mistake
in agent design. LangChain's Terminal-Bench 2.0 score rose from **52.8% →
66.5% by changing the harness alone** (no model swap, no prompt change), the
starkest single data point that harness design has first-order impact.
Everything below is downstream of these four shifts.
---
## 1. The Model Landscape (Mid-2026)
### 1.1 Categories that actually matter
Drop the "GPT vs Claude vs Gemini" framing. The useful axes are:
| Axis | Options | Why it matters |
| ------------------------- | ------------------------------------------------------ | --------------------------------------------------------------------------------------------------------------------------------------------- |
| **Reasoning depth** | Non-reasoning · Hybrid (toggleable) · Always-reasoning | Reasoning models excel at planning and bug diagnosis; non-reasoning models are faster and cheaper for mechanical edits. |
| **Architecture** | Dense · Mixture-of-Experts (MoE) | MoE delivers high parameter counts with low active-param compute — critical for local deployment. |
| **Context budget** | 128k · 200k · 1M+ effective | Stated context ≠ effective context. Most models degrade well before the advertised limit. |
| **Tool-calling fidelity** | Native function-call schema reliability | The single biggest differentiator for agent harnesses. Models with weak tool fidelity cannot drive agents reliably regardless of raw ability. |
| **Hostability** | Closed-API only · Open-weight | Determines whether local/private deployment is viable. |
### 1.2 Category winners (as of May 2026 — will rot quickly)
- **Frontier closed-weight, agentic coding:** Claude Opus 4.x and Claude Sonnet
4.x dominate SWE-Bench Verified and long-horizon multi-file refactors.
GPT-5-class models lead on competitive-programming-style isolated problems and
aggressive reasoning. Gemini 2.5 Pro leads on very-long-context navigation
(100k+ token codebases in single prompts).
- **Open-weight frontier:** DeepSeek-V3.x and Qwen3-Coder (480B MoE) are the
current open SOTA on coding benchmarks. GLM-4.6 and Kimi K2 trail closely on
agentic tasks. The gap to closed frontier has narrowed to roughly 612 months
for raw capability, but tool-calling fidelity still lags.
- **Local-runnable (≤80GB VRAM):** Qwen3-Coder-30B-A3B (MoE) and Qwen3-32B-dense
are the practical sweet spot. DeepSeek-V3 distillations and GLM-4-9B/32B
occupy specific niches.
- **Best price/performance for autonomous agents:** Mid-tier Sonnet-class and
GPT-5-mini-class models routinely win on cost-adjusted SWE-Bench, because
agentic tasks are dominated by mechanical token throughput, not peak reasoning
per call.
### 1.3 Benchmarks: which actually predict real-world success
- **Predictive:** SWE-Bench Verified, Aider polyglot leaderboard, LiveCodeBench
(recent splits only), Terminal-Bench. These measure multi-file edits,
test-passing, and tool use under realistic constraints.
- **Misleading or saturated:** HumanEval, MBPP, basic code-completion suites.
All are contaminated and saturated; a 90+% score is now table stakes and
uncorrelated with agent success.
- **Underrated:** Internal harness-vs-harness A/B tests on _your own_
repository. No public benchmark captures repo-specific idioms, build systems,
or test-runner quirks. A 20-task internal eval suite beats any leaderboard
ranking for selecting a working model for a given project.
---
## 2. Failure Modes
### 2.1 Cross-model failures
These appear across every frontier model and most open-weight models:
- **Premature completion claims.** The model declares "done" while tests fail or
builds break. Mitigation: forced verification step in the harness ("run the
build before declaring success"), not in the prompt.
- **Sycophancy** (Sharma et al.,
[arXiv:2310.13548](https://arxiv.org/abs/2310.13548), Oct 2023). Five SOTA
RLHF-trained assistants systematically generated responses matching the user's
stated or implied beliefs over correct ones; both human raters and reward
models preferred convincing-but-wrong outputs a non-negligible fraction of the
time, creating systematic training pressure toward agreement. **Caveat — not a
universal property of RLHF.** nostalgebraist (LessWrong, 2023) replicated
Anthropic's sycophancy eval on OpenAI base models and found they are _not_
sycophantic at any size, so the effect depends on the specific finetuning
recipe and the family-specific preference data, not on RLHF as such. Treat
sycophancy as family-conditional rather than a universal cross-model failure;
the mitigations below still apply where it manifests. Code-specific
manifestations: hard-coding to pass test cases, scope creep via agreement,
confirming guesses without verification, premature positive feedback.
Mitigation: explicit anti-sycophancy rules ("challenge the user when the user
is wrong"; "read a file before asserting facts about it"; "only make changes
that are directly requested"), and external feedback (test runners, hooks)
rather than model self-grading.
- **Hallucinated APIs.** Inventing function signatures, import paths, or
configuration keys. Worsens with: long contexts, smaller models,
unfamiliar/newer libraries. Mitigation: grounding tools (read source, grep
before calling), forced doc-fetch, repository-aware retrieval.
- **Reward-hacked verification.** Deleting failing tests, weakening assertions
to make tests pass, wrapping failing code in `try/except`, or _solving the
test cases_ rather than the general problem. Universal failure mode.
Anthropic's published counter-prompt is short and effective enough to repeat
verbatim:
> Please write a high-quality, general-purpose solution using the standard
> tools available. Do not create helper scripts or workarounds to accomplish
> the task more efficiently. Implement a solution that works correctly for all
> valid inputs, not just the test cases. Do not hard-code values or create
> solutions that only work for specific test inputs. Tests are there to verify
> correctness, not to define the solution.
Pair with: pre/post diff inspection, test-coverage delta checks, and explicit
policy against test deletion in agent rules. Pan et al.
([arXiv:2308.03188](https://arxiv.org/abs/2308.03188), 2023) survey of
self-correction strategies establishes the broader principle: **external
feedback signals (test runners, hooks, type checkers) are reliable;
self-critique alone is not** — models are poorly calibrated to detect their
own errors without ground truth.
- **Context rot / lost-in-the-middle** (Liu et al.,
[arXiv:2307.03172](https://arxiv.org/abs/2307.03172), 2023). Information
placed in the middle of a long context is recalled poorly even by 1M-context
models. Mechanism: transformer attention attends to every token in context (n²
pairwise relationships), so a larger context stretches attention capacity
across more relationships, leaving less focused attention per token. The
degradation is gradient, not cliff; effective context is typically 3050% of
advertised. Mitigation: structured, ordered context (most-recent and
most-task-relevant at the tail), summarization of stale turns, separate
retrieval rather than dumping.
- **Position-anchored priming (question drift).** When a model commits to an
answer in a prior turn, that answer sits in the context window and acts as a
prior the model subsequently defends. Follow-up questions are read through the
lens of the previous position; the model generates responses consistent with
what it already said rather than addressing the new question. Common pattern:
"no" to a first question → "no" to all follow-ups even when the follow-ups ask
something different. Related to sycophancy but directionally inverted — the
model is anchored to _its own_ prior commitment, not the user's.
Mitigations in order of effectiveness:
- **Compaction or fresh context.** Remove the prior committed answer from the
context window. The anchor is physically broken. A `PreCompact` hook can
preserve the user's current question while discarding stale prior responses.
- **Adversarial reframing.** Per ClashEval (Wu, Wu, Zou 2024): lowering the
model's confidence in its prior increases context adherence. "I believe your
previous answer was wrong because X. Now answer this specific question: ..."
lowers confidence more than repeating the question.
- **Explicit current-question marker.** A `UserPromptSubmit` hook prepending
`CURRENT QUESTION (answer this, not the prior exchange):` at the prompt
tail. Mechanical, cheap, measurably reduces drift for small models where
position effects are stronger.
- What does **not** work: repeating the question louder, emphasis, or asking
the model to "read more carefully." None of these change the anchor.
- **Stub-and-forget.** Writing `// TODO: implement` placeholders and returning
control as if complete. Especially common in Claude family. Mitigation:
grep-for-TODO post-step.
### 2.2 Family-specific patterns
- **Claude (Opus/Sonnet 4.x):** Tends toward _over-engineering_ — adds
unrequested error handling, docstrings, abstractions. Strong on instruction
adherence when restrictions are explicit. Tends to "polish" adjacent code when
asked to make a targeted change. Mitigation: explicit anti-scope-creep rules
in `AGENTS.md` / `CLAUDE.md` (this is exactly why the field standardized on
these files).
- **GPT (4.x / 5):** Tends toward _overconfident refactors_ — silently
restructures code beyond the requested scope. Stronger at math/algorithmic
reasoning, weaker at faithfully respecting existing code style. Mitigation:
small task slicing, frequent diff review, lower temperature.
- **Gemini (2.5):** Verbose; tends to repeat large file contents. Strong on very
long contexts but degrades on tool-call schema adherence under load.
Occasional formatting drift (markdown bleeding into code). Mitigation:
output-format guards and structured tool schemas.
- **DeepSeek / Qwen / open MoE:** Strong raw coding but weaker tool-call
reliability — malformed JSON, schema deviation, or "talking about" calling a
tool rather than emitting the call. Mitigation: strict JSON-mode /
grammar-constrained decoding (e.g., `llama.cpp` GBNF, `outlines`,
`lm-format-enforcer`), and harnesses that re-prompt on malformed calls.
- **Small / quantized models (≤14B, Q4 and below):** Instruction-following
collapse — ignoring rules after ~48 turns; tool-schema breakage; severe
hallucination of imports. Not yet viable as primary agent drivers; usable as
cheap subagents for specific narrow tasks (grep, summarize, classify).
### 2.3 The "Claude leaks" and their effect
Starting Oct 2024, leaked system prompts and tool definitions from Claude (and
later, similar leaks from Cursor, Devin, Windsurf, and others) revealed how much
production-grade harnesses rely on:
- Explicit personas and tone constraints
- Long lists of _anti-patterns_ ("do not ... do not ... do not ...")
- Structured TODO tracking as a first-class tool
- Strict separation of plan and act phases
- Memory tiering (session vs persistent vs repo)
- Explicit file-link and citation formats
The industry consequence was rapid convergence: `AGENTS.md`, `CLAUDE.md`,
`.cursorrules`, `.windsurfrules`, `.opencode/agent.md`, and similar files now
share a near-identical structure. The leaks accelerated the recognition that
**prompt scaffolding is the product**, not a secondary detail. They also
clarified that frontier labs spend significant effort on _negative_ instruction
— what _not_ to do — which most third-party agent builders under-invested in.
---
## 3. Agent Architecture
### 3.0 The Prompt / Context / Harness diagnostic
For any agent failure, route the fix to the right layer. Wrong-layer fixes are
the single most common waste of effort:
| Symptom | Layer | Fix |
| ------------------------------------------ | ------- | ------------------------------------------- |
| Wrong output format | Prompt | Rewrite instruction; add output schema |
| Missed an explicit requirement | Prompt | Tighten task expression |
| Hallucinated codebase fact | Context | Fix tool description; add retrieval |
| Wrong tool selected | Context | Fix description; reduce tool count |
| Stalls mid-task on multi-step problem | Context | Insufficient persistent context (NOTES.md) |
| Reads all files first despite "don't" | Context | Trained behavioral prior — see §4.6 |
| Task drift in long session | Harness | Add sub-agent isolation boundary |
| Destructive action taken | Harness | Add permission hook (pre-tool deny) |
| Tests deleted to pass; assertions weakened | Harness | Pre/post diff check; coverage-delta gate |
| Long-session quality cliff at ~60% fill | Harness | Early compaction trigger; tool-output prune |
### 3.1 Single-thread default
Modern consensus: a single agent loop with a clear plan/act split outperforms
multi-agent topologies on almost all real coding tasks. Cognition's analysis
identified the root cause as **context divergence**: separate agents accumulate
incompatible interpretations of the same task, and reconciliation costs exceed
parallelism gains.
The exceptions where parallel/multi-agent _does_ help:
- **Read-only exploration subagents.** Scan a large codebase, return a
compressed summary. Their context does not need to merge back.
- **Fully isolated tasks.** Multiple independent files generated from the same
spec, with no inter-dependencies. Rare in real codebases.
- **Adversarial review.** A second agent reviews the first's diff. Modest gains,
mostly catches premature-completion failures.
### 3.1a Counterbalance agent design
When secondary agents _are_ defined (slash commands, personas, named modes), the
high-leverage approach is to design each agent as a **counter to a known failure
mode of the base model**, not as a topic specialist ("frontend agent", "database
agent"). Topic specialists duplicate context and rarely beat a generalist with a
good search tool. Counterbalance agents earn their keep by suppressing a
measurable, named tendency:
- A **brainstorm agent** counters frontier-model _overthinking_ — enforces
speed, breadth, no hedging, no deep analysis. Exists because Opus/Sonnet
ruminate by default.
- A **research agent** counters frontier-model _pattern-matching_ — requires
hypothesis + falsification criterion before any diagnostic test. Exists
because LLMs latch onto the first plausible explanation.
- A **build-local agent** counters _small-model context drift_ — pagination
limits, mandatory grep-before-read, delegation rules for multi-file work.
Two consequences for agent-body authoring:
1. **Negative role definition is part of the spec.** Every counterbalance agent
should end with a short "What You Are NOT" block: _"You are NOT an
implementation agent. You are NOT a planning agent."_ The exclusion list
prevents scope creep more reliably than positive role framing alone.
2. **Cognitive-mode decomposition** beats topic decomposition. Agents named for
_how they think_ (diverge, investigate, execute-narrowly) compose cleanly:
brainstorm hands off to research, research hands off to default, build-local
handles narrow tasks. Agents named for _what they think about_ ("backend
agent") fight for jurisdiction on every cross-cutting task.
### 3.2 Plan / Act / Verify loop
The minimal viable agent loop:
```
plan → act → verify → (loop or stop)
```
- **Plan:** produce a todo list, possibly with a brief written rationale. Forces
the model out of "pattern-match and emit" mode. The todo list is also a
contract the verify step can check against. **Plan-and-Solve prompting** (Wang
et al., 2023) — decompose first, then execute — measurably reduces arithmetic
and multi-step reasoning errors.
- **Act:** execute one todo at a time. Single in-progress item is a soft rule
that empirically reduces context fragmentation.
- **Verify:** run tests, lint, build. The verification _must_ be in the harness,
not the prompt — relying on the model to self-verify is one of the most
reliable ways to produce reward-hacked output.
**Think-Anywhere** (Jiang et al., 2026) extends Plan-and-Solve: models trained
to insert `<think>` blocks at _any_ token position — not just upfront — catch
mid-implementation off-by-one errors that an initial plan cannot foresee. Claude
4.x's _interleaved thinking_ between tool calls is the production-grade
realization of the same idea. The practical instruction: "Re-evaluate the
hypothesis at every tool-call boundary." The mapping to development
methodologies is exact — **Plan-and-Solve is sprint planning, Think-Anywhere is
the retrospective**; both are needed, neither suffices alone. Skipping the plan
is "vibe coding"; refusing to re-evaluate is waterfall.
**Circuit breakers as a first-class primitive.** Embedded numeric self-stops in
the agent body materially outperform vague "don't loop" instructions. The
pattern, verbatim from working agent files:
- _5+ attempts without falsifying a hypothesis = STOP. Report what you've ruled
out._
- _3+ edits to the same file without a passing test = STOP. You're fixing
symptoms, not the cause._
- _Urge to "just try something" = STOP. Write the hypothesis first._
- _Two failures at the same level of abstraction = go UP one level._
Why this works: vague instructions decay against task pressure; explicit
integers don't. The model can self-monitor against a count more reliably than
against "too much." Pair with hard caps in the harness for the cases where the
agent fails to self-stop.
### 3.3 Reasoning-mode usage
For reasoning-capable models, the cost calculus is:
- **Use reasoning for:** planning, bug diagnosis, ambiguous requirements,
architecture decisions.
- **Skip reasoning for:** mechanical edits, file moves, formatting fixes,
applying a known patch.
- **Hybrid models with toggleable reasoning** (Claude 4.x extended thinking,
GPT-5 reasoning effort, Qwen3 thinking-mode) make this routing tractable
inside a single harness.
### 3.4 Sub-agent tiering (model-as-budget)
When subagents _are_ used (read-only exploration, isolated tasks), the
now-standard pattern is **model-class tiering**:
- **Parent orchestrator:** strongest model (Opus-class) — holds cross-task
state, plans, synthesizes. High per-call cost, few calls.
- **Sub-agents:** mid- or small-class (Sonnet/Haiku-class, or a 30B local model)
— receive isolated task slices. May burn tens of thousands of exploration
tokens, but **return only a 12k token condensed summary**. The parent's
context never sees the sub-agent's raw exploration.
This converts the sub-agent into a **context firewall**: parallelism without
context contamination. It is the only multi-agent topology that consistently
outperforms single-thread.
### 3.4a Falsification-first investigation
Applied Strong Inference (Platt, 1964) at the operational level. Before any
diagnostic test, the agent fills a four-item checklist:
- [ ] Hypothesis written (one sentence: _"I believe X because Y"_)
- [ ] Falsification criterion written (\_"if wrong, I'd expect to see _\_\_"_)
- [ ] **Falsification test run before confirmation test**
- [ ] Result recorded: ELIMINATED with reason, or CONFIRMED with evidence
The order matters: running the confirmation test first invites confirmation bias
and produces a "plausible answer" that the agent then defends. Running the
falsification test first either kills the hypothesis cleanly (cheap progress) or
strengthens it materially (the surviving hypothesis is now harder to dislodge).
**Dead-ends file.** Each eliminated hypothesis is appended to
`.session/dead-ends.md` (or the investigation file's Hypotheses section) with
the same four fields. Three benefits:
1. The current session does not re-test an already-eliminated hypothesis when
context pressure causes forgetting.
2. A post-compaction resume has a structured record to anchor against.
3. A fresh session (or a handoff agent) starts with a real audit trail instead
of having to re-derive the eliminations.
Dead-ends are also a leading indicator of agent quality: a session that produces
zero entries was either trivial or non-rigorous; a session with 10+ entries and
no resolution is a candidate for human escalation.
### 3.5 Evaluator-Optimizer, LLM-as-Judge, and Reflexion
Anthropic's "Building effective agents" formalized the evaluator-optimizer
pattern: one agent generates, a separate evaluator scores against a rubric, the
generator refines. Useful for research-quality assessments and brainstorm
outputs more than for code (tests are a stricter evaluator than any judge).
The foundation result is Zheng et al.
([arXiv:2306.05685](https://arxiv.org/abs/2306.05685), MT-Bench / Chatbot Arena,
2023): **GPT-4-class LLMs as judges achieve >80% agreement with human
preferences — the same rate as human-human agreement.** This makes them a viable
scalable evaluator, _but_ with known biases that must be controlled:
- **Position bias.** Judges favor whichever response appears first in a pairwise
comparison. Mitigation: run twice with order reversed; take only the
consistent result.
- **Verbosity bias.** Longer responses score higher even at equal information
density. Mitigation: rubric scores correctness and concision separately.
- **Self-enhancement bias.** Same-family judges over-score their own family's
outputs. Mitigation: cross-family judging or human spot-checks for
calibration.
**Reflexion (Shinn et al., 2023, arXiv:2303.11366)** formalizes the
evaluator-optimizer loop for multi-step agents: an external evaluator generates
verbal feedback, the agent stores it in an episodic memory buffer, and reruns
with the feedback in context. Results: 91% pass@1 on HumanEval vs GPT-4's 80%
without it. Two non-negotiable conditions:
1. **External feedback signal** — not self-critique. An oracle or verifier (test
pass/fail, compilation, hook exit code). Huang et al.
([arXiv:2310.01798](https://arxiv.org/abs/2310.01798), "Large Language Models
Cannot Self-Correct Reasoning Yet," Oct 2023) demonstrate this directly: in
the intrinsic setting (no oracle labels), self-correction _consistently
decreases_ reasoning performance across prompts and tasks; prior
"self-correction works" results vanish when oracle labels are removed. Pan et
al. (arXiv:2308.03188) provide the broader survey taxonomy of self-correction
strategies and the same conclusion in aggregate: external feedback signals
(test runners, hooks, type checkers) are reliable; self-critique alone is
not. Without an external signal, asking the model to reflect, double-check,
or critique its own output is at best noise and at worst actively harmful —
this is one of the most tempting and most counterproductive interventions in
agent design.
2. **The ability to retry.** Reflexion loops. Single-shot feedback injection is
helpful context, not the full pattern.
**Failure-mode routing as a design extension.** A judge subagent that reads the
transcript, classifies the failure mode, and selects the matching intervention
is stronger than generic "review the output" because the intervention is matched
to the type of failure, not just "try harder." The prior-confidence →
intervention mapping from §6.4 applies here:
| Failure mode | External signal? | Intervention |
| -------------------------------------------- | ------------------ | ------------------------------------ |
| Code bug / test failure | Yes (test runner) | Reflexion loop |
| Convention violation (async, error handling) | Yes (grep) | PostToolUse grep + canonical example |
| Question drift / prior anchoring | No | Compaction or adversarial reframing |
| Factual hallucination | Sometimes | Retrieval injection |
| Wrong directory / file | Yes (file listing) | Structure injection |
**Design constraints for the judge subagent:**
- **Use a stronger or cross-family model as judge.** A small model evaluating
its own family's outputs compounds self-enhancement bias and parameter-count
limitations. Frontier-class (Opus/Sonnet) or a different model family is
strongly preferred. For a local-only constraint, a 32B judge evaluating a 9B
agent is a practical minimum.
- **Activate on mechanical failure signals, not every turn.** Run the judge when
a hook fires non-zero, tests fail, or a build breaks — not as a constant
overlay. Routing every response through a judge adds latency and is redundant
when mechanical verification already gives a clear answer.
- **Judge output should be a correction spec, not a rewrite.** Structured:
`{ failure_mode, confidence, intervention, injected_context? }`. The working
agent acts on the spec; the judge stays in the evaluator role.
- **General Q&A failures lack external ground truth.** For question drift,
factual errors without a retrieval target, or prior anchoring — no oracle
exists. Compaction and adversarial reframing are cheaper and more reliable for
those cases than a judge loop.
### 3.6 The Enforcement Hierarchy
Not all guidance is equally effective. From most to least reliable, as a
practical hierarchy:
```
Permission-layer denial ← Strongest. Tool literally not available to the agent.
PreToolUse hard block ← Structural. Always fires. Agent cannot bypass.
PostToolUse path-check ← Fires right after the relevant action (context tail).
Nested AGENTS.md at path ← Always-on for that folder scope. Tool-portable.
Stop / SessionStart inject ← Fires at session boundaries. Broad reminders.
Root AGENTS.md sections ← Context-start only. Degrades under Lost-in-the-Middle.
```
The root cause of the degradation gradient is Liu et al.'s lost-in-the-middle
result: guidance written once at session start sits in the low-attention middle
by tool call 20. Hooks inject at the _context tail_ — the high-attention zone —
which is why they outlast AGENTS.md under context pressure. **Decision rule:**
if a constraint must hold deep into a session, fire it from a hook, not a
prompt.
**Permission-layer denial sits above PreToolUse for a reason.** A PreToolUse
hook _intercepts_ a tool call the agent has already chosen to make; it generates
a rejection message that the agent must then process and route around.
Permission-layer denial (OpenCode's
`permission: { edit: deny, write: deny, bash: deny }` on an agent definition;
Claude Code's analogous allowlist) **removes the tool from the agent's available
set entirely** — the tool description never appears in the agent's context, so
the agent cannot try and recover. This is the cleanest realization of
Anthropic's "poka-yoke your tools" principle: the violation is not just blocked,
it is unreachable. Use it for invariants that must hold across an entire agent
role (e.g., "the orchestrator never writes files"); use PreToolUse hooks for
invariants that depend on the specific tool arguments (e.g., "no `npx` in shell
commands").
### 3.7 Hook design: silent on success, loud on failure
A convention that has converged across Claude Code, Cursor, OpenCode, and
internal Anthropic tooling: **hooks emit nothing on success and exit with a
non-zero code (commonly 2) on failure** to reactivate the agent. Verbose success
output adds noise to every tool call; the agent only needs to know when it's
wrong. This is the harness analog of Unix's "no news is good news."
Three refinements that materially improve hook quality once the basics are in
place:
- **Stateful reminders that read system state at fire time.** A QUALITY GATE
reminder that runs `ss -tlnp | grep ':300[01]'` and tailors its recommendation
based on whether the dev server is actually running
(`npm test && npm run lint` vs `npm run build:strict`) is dramatically more
useful than a static instruction. The harness already runs at the right
moment; spend the 5ms to read state.
- **Tool-specific PostToolUse warnings.** Some tools have well-known
blast-radius footguns: `vscode_renameSymbol` renames variable bindings but not
object property keys, string literals, or related identifiers sharing a
prefix. A targeted reminder fired _immediately after_ the rename is in the
high-attention zone and catches the gotcha before the next commit. Generic "be
careful with renames" warnings at session start do not.
- **Path-scoped PostToolUse reminders.** When the editing tool's `FILE_PATH`
matches a glob (e.g., `apps/client/src/pages/`), inject a domain rule ("this
is a client page — use BFF single-request, never chain second fetches"). The
rule fires only on the relevant edits, so it doesn't bloat the context window
for unrelated work.
### 3.8 Trigger-word nudges (the positive-recommendation analog)
The enforcement hierarchy in §3.6 covers _blocking_ guidance. The mirror
discipline is **positive recommendation at the context tail**: a
`UserPromptSubmit` hook greps the user's incoming prompt for trigger words and
injects a one-line agent recommendation alongside the prompt.
Examples that work in practice:
- Hesitation / overthinking words ("wait", "actually", "hmm", "too complicated",
"going in circles") → nudge toward a brainstorm agent.
- Debugging / investigation words ("why is this broken", "trace", "root cause",
"regression") → nudge toward a research agent.
Three non-obvious design constraints:
1. **One nudge per topic.** Repeating the same nudge after a user declines
trains them to filter it out. Track "nudge fired for topic X" so a declined
recommendation stays declined.
2. **One sentence, non-intrusive.** A nudge that consumes 200 tokens is
indistinguishable from spam. Format: _"NUDGE: \<one-line condition
description\>. Consider \<action\> — one sentence, non-intrusive."_
3. **Context-tail injection, not AGENTS.md.** A nudge written into AGENTS.md
decays to invisibility by tool call 20 (lost-in-the-middle). A
`UserPromptSubmit` hook fires the nudge fresh at every turn, at the tail —
where attention is highest.
---
## 4. Context Engineering
### 4.1 Token budget allocation
Treat the context window as a budget, not a container. A rough allocation that
holds up across models:
| Region | Share | Notes |
| ---------------------- | ------ | ------------------------------------------------------------- |
| System / agent rules | 510% | Stable, terse. Don't bloat with prose. |
| Memory / repo facts | 515% | Project conventions, prior decisions. Tier by relevance. |
| Task description | 25% | Keep it boundary-defined and specific. |
| Retrieved code | 3050% | The biggest lever. Most agents over-retrieve. |
| Tool outputs / scratch | 2040% | Compress aggressively; summarize old turns. |
| Headroom | 1020% | Leave room for the model's own output and at least one retry. |
### 4.2 Retrieval
- **Repo maps** (Aider's approach): compress a codebase into a ranked outline of
file/symbol declarations. Cheap, effective baseline. Still best-in-class for
repos up to ~500k LOC.
- **AST-aware retrieval** beats line-based grep on identifier-driven queries.
- **Embedding retrieval** is _overrated_ for code. Symbol-graph and AST
retrieval consistently beat dense embeddings on real coding tasks; the
exception is natural-language docs and design notes.
- **Hybrid retrieval** (grep + symbol graph + light embedding for docs)
outperforms any single approach.
### 4.3 Memory tiering
Now-standard pattern (Claude Code, Cursor, OpenCode, GitHub Copilot all
converged on it):
- **Session memory:** scratch for the current task. Cleared at end.
- **Repo memory:** project conventions, verified facts, build commands.
- **User/global memory:** preferences across all projects.
Loading the right tier at the right time is more impactful than how much is
stored.
### 4.4 AGENTS.md: keep it small
An ETH Zurich evaluation of LLM-generated per-project AGENTS.md files found they
**increased API cost by 20% and added 1422% reasoning tokens with no measurable
improvement in task success rate.** Bloated rule files fill the context window
with content irrelevant to the current task — a tax on every tool call for
marginal-to-negative benefit.
Practical ceiling: **roughly 60 lines of universally applicable constraints.**
Everything else belongs in:
- **Nested AGENTS.md** at the directory it applies to (loaded only when that
scope is active in most agent tools).
- **Skills** loaded on demand by a routing description.
- **Hooks** at the relevant tool-call boundary.
- **AGENTS.md stubs** — one-line trigger conditions with `read_file`
instructions, so the body loads only when the trigger fires.
The pattern: **anti-patterns matter more than positive instructions.** A 60-line
AGENTS.md of "do not do X" rules outperforms a 600-line one full of
best-practice prose. This matches the asymmetric effort that frontier labs put
into negative instruction (visible in leaked system prompts).
### 4.5 Just-in-time retrieval and structured notes
Anthropic's
[Sep 2025 context-engineering article](https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents)
formalized two patterns that now define the state of the art:
**Just-in-time retrieval.** Rather than loading all potentially relevant content
at session start, agents hold _lightweight references_ (file paths, query
strings, identifiers) and load data on demand. Claude Code's reliance on
glob/grep over upfront file dumps is the canonical example. The instruction
version for agent bodies: **"Hold references; load on demand. Do not read files
you don't need yet."**
**Structured note-taking (agentic memory).** For tasks spanning tens of tool
calls or multiple context windows, agents should write progress to a file (e.g.
`NOTES.md`) and read it back at context-reset boundaries. Properties:
- **Structured for state** — JSON/checklist for completion tracking.
- **Freeform for progress** — natural language for context and open questions.
- **Write-first incentive** — "record completion of step 1 before reading files
for step 2" is structurally more honest than reading-first, because the model
cannot write a truthful note about uncompleted work.
Note files survive compaction. If a `PreCompact` hook copies the working
NOTES.md into session-persistent storage before summarization, a context
overflow mid-task becomes a resume, not a restart.
**Investigation / exploration files as durable handoff artifacts.** For work
that spans multiple sessions or agents, NOTES.md is too ephemeral. A structured
`docs/explorations/<name>.md` file with a fixed schema (Status / Question / What
We Know / Hypotheses / Investigation Log / Open Questions) is the cross-session
equivalent. Three benefits:
1. **Agent handoff without state loss.** A brainstorm agent producing an
exploration file can hand off to a research agent (or the default
implementation agent) by name — the file is the contract, not the chat
transcript.
2. **Status field as routing signal.**
`Status: brainstorming | exploring | prototyping | decided | abandoned` lets
the next agent (or the next user) immediately know whether to diverge
further, dig deeper, or build.
3. **Compaction-safe.** Even if every conversational turn is summarized away,
the file is reread at session start by a `SessionStart` hook that surfaces
active investigations.
NOTES.md and exploration files are complementary: NOTES.md is the agent's
working memory for _this task_; the exploration file is the project's durable
record of _this question_.
**Timing awareness as an agent blind spot.** Agents have no innate sense of how
long a command takes. A casual suggestion to "just run the full test suite"
might be a 2-second hit or a 5-minute one, and the agent has no basis for that
choice. Effective mitigations:
- Prefix unknown commands with `time` until a baseline is observed.
- Capture significant output to `/tmp/<descriptive>.txt` so grep can re-run
cheaply without re-executing the slow command.
- Stash baselines in repo memory (`/memories/repo/timings.md`) once observed, so
future sessions don't re-measure.
- Feed timing back into triage: a `<5s` command is nearly free to "just run"; a
`>30s` command should reason first.
### 4.6 Sequential constraint ordering: a stubborn failure
A narrow but instructive case: the user writes "Do X first. Then Y. Then Z." and
the agent immediately reads all files for X, Y, _and_ Z upfront, often blowing
the context budget before step 1 begins.
**Root cause is not a prompt problem; it's a context-engineering problem.** RLHF
training data contains overwhelming examples of "gather context, then act" — the
model has a strong _pre-task exploration bias_ that competes with the user's
ordering constraint and usually wins after a few tool calls. Stronger negative
phrasing ("DO NOT read all files first!") loses to this trained behavioral prior
reliably.
What works, in descending order of effectiveness:
1. **NOTES.md write-first pattern.** Structure as: "Complete step 1. Write what
you found to NOTES.md. Then read NOTES.md and proceed to step 2." The model
cannot write a truthful note about step 1 without doing step 1, which
serializes the work.
2. **Imperative checkpoints.** "Say `STEP 1 DONE` before continuing" — the
verbalization marker creates a natural serialization point.
3. **Hard step caps in the harness** (e.g., OpenCode's `steps: 20` + `ask`
gates). Caps in the _prompt_ are interpreted as suggestions.
4. **Sub-agent fan-out for parallel-safe tasks** — one sub-agent per file, each
with isolated context. Doesn't help strictly sequential tasks.
What does **not** work: negative constraints ("do not read all files"), repeated
reminders (degrade quickly), or soft caps embedded in the prompt.
### 4.6a Conditional vs Imperative Prompt Design
> **Status:** Research synthesis. Captures an empirical finding from agent
> prompt analysis and its implications for prompt design.
>
> **Audience:** Engineers designing agent system prompts, AGENTS.md files,
> hook scripts, and enforcement layers.
---
#### The Problem: Conditional Steps Let Models Skip
A 328-line research agent prompt was analyzed for structural patterns and found
to be **60% conditional** — the majority of its instructions took the form
"when X, do Y." The downstream consequence: the model routinely exercised
discretion to decide X didn't apply, silently skipping entire sections of the
prompt. The agent was not failing to follow instructions; it was following
conditional instructions by choosing the branch that required less work.
This is not a model bug — it is a prompt design failure. Conditional steps hand
the model a discretionary on-ramp to skip compliance. The model's optimization
function is "complete the user's task efficiently," not "follow every step of
the prompt verbatim." When a step says "when X, do Y," the model's first
question is "does X hold?" — and it has strong incentives to answer "no."
---
#### Conditional vs Imperative: The Contrast
**Conditional pattern (fragile):**
> "When you encounter a test failure, first read the failing test, then check
> the relevant source file."
What happens: the model declares "I already know what's wrong" and skips
straight to editing. X = "encounter a test failure" is interpreted narrowly —
the model has encountered the *error output*, not the *test file*, so the
condition is not met.
**Imperative pattern (robust):**
> "Read the failing test. Then check the relevant source file."
What happens: the model reads the test before any other action. There is no
condition to evaluate, no discretion to exercise.
The difference is structural, not semantic. Both express the same intent; only
the imperative form removes the model's ability to opt out.
---
#### Why Conditionals Fail
Three mechanisms operate simultaneously:
1. **Discretion by design.** A conditional step contains a gate ("when X") that
the model must evaluate. Evaluation requires judgment, and judgment is
exercised toward the path of least effort. The model is not being lazy; it is
optimizing for task completion, not process compliance.
2. **Narrow interpretation of conditions.** The model interprets conditionals
narrowly to justify skipping them. "When you encounter a test failure" means
"when you have the test file open," not "when the test output is in context."
The condition becomes a self-fulfilling prophecy: the step is skipped because
the condition is defined to require the step's output.
3. **Efficiency optimization over process compliance.** The model's training
objective is to produce useful outputs, not to follow process. A conditional
step gives the model a legitimate-sounding rationale for skipping a step it
judges unnecessary — and the model is usually right that the step is
unnecessary for that specific case, which reinforces the skipping behavior.
---
#### The Fix
Three complementary strategies, ordered by reliability:
**1. Make instructions imperative.**
Replace every "when X, do Y" with "do Y." The model executes the step regardless
of its judgment about whether it's needed. This is the single highest-leverage
change to an agent prompt — converting conditionals to imperatives reduces
skipped steps dramatically.
Example transformation:
| Before (conditional) | After (imperative) |
| --------------------------------------------------- | ----------------------------------------- |
| "When editing a use case, check for `throw`" | "Check for `throw` before editing a use case" |
| "If the build fails, read the error first" | "Read the build error before any edit" |
| "When you see a TODO, resolve it" | "Resolve every TODO you encounter" |
| "If the test output mentions a file, read that file" | "Read the file mentioned in the test output" |
**2. Move genuine conditions to PreToolUse hooks.**
Some constraints are genuinely conditional — "block `npx` but allow `npm`" —
and conditional logic in the prompt is the wrong place for them. PreToolUse
hooks are structural enforcement: they fire on every tool call, evaluate the
condition deterministically, and deny before the model can opt out. The
condition is still evaluated, but the evaluation is in code, not in the model's
discretion.
This maps directly to the enforcement hierarchy (§3.6): **must-do constraints
belong in hooks** where they are structural and inescapable; **should-do
process steps belong imperative in the prompt** where the model has no
discretion to skip them.
**3. Add commit phrases ("Say STEP 1 DONE").**
For multi-step processes where the model must acknowledge completion of each
step before proceeding, add explicit acknowledgment phrases. The pattern:
> "Read the failing test. Say TEST READ DONE. Then check the relevant source
> file. Say SOURCE READ DONE."
Why this works: the acknowledgment phrase creates a visible boundary. The model
cannot skip the preceding step without producing the acknowledgment, and the
acknowledgment itself is a token cost the model has no incentive to avoid. This
is a lightweight form of chain-of-thought verification that doesn't rely on
self-critique (which Huang et al. show is unreliable).
---
#### Tie to the Enforcement Hierarchy
The enforcement hierarchy from §3.6 provides the decision rule for where
conditional logic belongs:
```
Permission-layer denial ← Tool not available. No discretion.
PreToolUse hard block ← Structural. Condition evaluated in code.
PostToolUse path-check ← Fires after the action. Context tail.
Nested AGENTS.md at path ← Always-on for scope. No condition evaluation.
Stop / SessionStart inject ← Broad reminders. Degrades under context pressure.
Root AGENTS.md sections ← Context-start only. Degraded by lost-in-the-middle.
```
Conditional instructions in the prompt occupy the weakest position in this
hierarchy: they sit in the root AGENTS.md, fire once at session start, and
require the model to evaluate a condition — exactly the setup for
lost-in-the-middle degradation combined with discretionary skipping.
**The decision rule:**
- If the constraint **must hold** regardless of model judgment (no `npx`, no
`throw`, no edits to generated files), it belongs in a hook — PreToolUse or
permission-layer denial. The condition is evaluated in code, not by the model.
- If the constraint is a **process step** that should always execute (read the
test, check for `throw`, resolve TODOs), it belongs imperative in the prompt —
no condition, no discretion.
- If the constraint is a **recommendation** that depends on context (use BFF
pattern for client pages), it belongs in a PostToolUse path-check — fires at
the right moment, in the high-attention context tail, scoped to the relevant
path.
Conditionals in prompts are a design smell. They indicate the author is trying
to use the weakest enforcement mechanism for a constraint that should live in a
stronger layer.
### 4.7 Compaction strategy
The Anthropic guidance, replicated independently elsewhere: **first maximize
recall (capture every relevant piece of context), then improve precision
(eliminate superfluous content).** A summary that drops a critical fact is worse
than a summary that is slightly too long. Iterate on the compaction prompt
itself, treating it as a small distinct prompt-engineering task.
The safest first-pass compaction target is **stale tool outputs**: raw file
contents or command outputs whose information has already been acted on. The
assistant's response citing them stays; the 500-token file dump does not.
For harnesses with a `PreCompact` hook: this is the right place to append open
todos, active hypotheses, or in-progress file paths to the input so the summary
preserves them.
**Anchored summary schema.** The most reliable production compaction prompt is
not free-form — it's a fixed Markdown skeleton with the original prompt
preserved verbatim, plus structured sections for clarifications, constraints,
progress, decisions, and next steps. A representative shape:
```markdown
## Original Prompt
- [the user's first prompt, verbatim]
## Clarifications
- [follow-up that refined the original]
## Constraints & Preferences
- [user constraints or "(none)"]
## Progress
### Done / In Progress / Blocked
## Key Decisions
- [decision and why]
## Next Steps
- [ordered actions]
## Critical Context
- [errors, open questions, technical facts]
## Relevant Files
- [path: why it matters]
```
Three properties that make this work:
1. **Verbatim original prompt.** The single most common compaction failure is
drift away from the user's actual ask. Anchoring the verbatim text resists
this.
2. **Empty sections kept.** "(none)" beats omission — the agent post- compaction
can tell whether "no blockers" is a fact or an oversight.
3. **Bullets, not prose.** Compaction prose tends to drop facts under token
pressure; structured bullets degrade more gracefully.
### 4.8 Attention engineering
A subset of context engineering, focused on _where_ in the context tokens land.
Practical heuristics:
- Task-critical content goes at the **tail** of the context (recency bias is
strong and consistent across models).
- Rules and constraints repeat at both ends — they are forgotten from the
middle.
- Long tool outputs should be **summarized in place** once stale rather than
scrolled away. The original is gone from effective attention either way; a
summary preserves the salient bits.
---
## 5. Tools, Skills, and Specs
### 5.1 The minimalist consensus
The empirically dominant tool set for coding agents has converged to roughly six
primitives:
1. **Read file** (with line ranges)
2. **Edit file** (string-replace or patch)
3. **Search** (grep / regex)
4. **Find files** (glob)
5. **Shell** (bounded, optionally sandboxed)
6. **Todo list** (or equivalent state tracker)
Plus, depending on agent surface:
7. **Subagent / task spawner** (for read-only exploration)
8. **Web fetch** (for docs lookup)
9. **Memory** (read/write the tier hierarchy)
### 5.2 What got absorbed
Tools that were once distinct but are now redundant given a capable shell:
- `create_file`, `delete_file`, `list_dir`, `move_file` — all expressible
through edit/shell, and modern models reliably emit the shell forms.
- Language-specific linters/formatters — better invoked through shell with the
project's actual configuration.
- Dedicated test runners — same.
Tools that were _supposed_ to win but didn't:
- Browser-automation tools as a default. Useful for frontend verification,
rarely critical otherwise.
- "Code interpreter" sandboxes as a separate tool from shell. Now usually
unified.
### 5.3 What's still genuinely needed beyond shell
- **Structured edits.** `sed -i` and `awk` corrupt files often enough that every
serious harness ships a dedicated string-replace or patch tool with whitespace
fidelity. This is the single tool that justifies its existence most clearly.
- **Todo tracking.** Could be a file, but a first-class tool gives the harness a
UI surface and gives the verify step a checklist.
- **Subagent spawning** with isolated context. Cannot be expressed as shell.
### 5.4 Tool-count thresholds
Empirical finding (replicated across Anthropic, OpenAI, and independent
research): **agent performance degrades non-monotonically once the tool list
exceeds roughly 4050 tools.** The model spends attention on tool selection
rather than the task. Mitigations:
- **Tool grouping / lazy loading.** Surface only relevant tools per phase.
- **MCP-style tool servers** that present a small façade and route internally.
- **Code-execution-as-tooling** (Anthropic's "code as tools" approach, Cursor's
similar pattern): expose tools as a small API the model writes code against,
rather than as dozens of discrete function-call schemas. Drastically reduces
tool-selection overhead for large tool surfaces.
### 5.5 Skills and the SKILL.md convention
**Skills** are bounded, on-demand instruction packets — a `SKILL.md` file with a
`description:` frontmatter field that the model reads in the tool/skill list,
plus a body the model loads when it judges the skill relevant. They are the
answer to "how do I avoid loading my entire methodology library upfront?"
The format has stabilized as a community standard, with the **skills.sh**
registry (Vercel Labs, 2025) as a public distribution channel: Anthropic's
`frontend-design` skill (≈367k installs), `skill-creator`, Vercel's
React/composition skills, Supabase's Postgres skills. Install via
`npx skills add <owner>/<repo>`. Treat installed skills like third-party npm
packages: review before using.
Key principles for authoring skills:
- **Progressive disclosure.** A debugging skill loaded into a refactoring
request is context pollution. Skills load at invocation time, not session
start.
- **Create reactively.** The right trigger for a new skill is _"the agent failed
this same task type twice."_ Anticipatory skill creation is premature context
inflation.
- **Methodologies, not project rules.** Project-specific rules go in nested
AGENTS.md; reusable methodologies (how to research, how to brainstorm) go in
skills.
**Skills vs Hooks — diagnostic guide.** The two layers are complementary, not
competing: a skill triggers → the model reads it → the model acts → a hook
validates the action → the model corrects if the hook exits non-zero.
| | Skills | Hooks |
| -------------------- | ------------------------------------------------- | ------------------------------------------- |
| **Layer** | Context Engineering | Harness Engineering |
| **What it is** | Progressive disclosure of task-specific knowledge | Deterministic event-triggered execution |
| **Loaded when** | Task type activates it (on demand) | Tool-call boundaries (always) |
| **Activated by** | Model routing decision | System event (pre/post-tool, session start) |
| **Failure mode** | Pollutes context if loaded too broadly | Breaks agent loop if too noisy |
| **Success behavior** | Silent — enriches context | Silent — only speaks on failure |
| **Create when** | Agent fails same task type twice | Need deterministic enforcement |
If in doubt: use a hook when the rule _must_ hold regardless of model judgment;
use a skill when the rule only applies to a specific task type that the model
should route into.
### 5.6 Spec-driven development (OpenSpec)
[OpenSpec](https://openspec.dev) (Fission AI, 2025) introduced a workflow where
machine-readable specs (RFC 2119 SHALL/SHOULD/MAY + Gherkin scenarios) live
alongside code, and each PR produces a "spec delta" showing requirement changes
next to the diff. Supported by Claude Code, Cursor, Copilot, Codex, and 16+
tools.
The valid critique — _"isn't this just waterfall?"_ — OpenSpec answers cleanly:
the spec is not meant to be complete before coding starts; it's _co-evolved_
with the code. "Good enough plan + update as you go" is the Agile reading. This
is the same plan-then-iterate pattern from §3.2 applied at the requirement level
rather than the function level.
When it helps: features with complex, multi-stakeholder requirements where code
review benefits from being intent-first rather than diff-first. When it doesn't:
infrastructure work, one-off scripts, or codebases where intent is adequately
captured by tests.
### 5.7 MCP as portable deferred loading
The Model Context Protocol (MCP) has emerged as the cross-tool standard for two
deferred-loading patterns that previously required tool-specific machinery:
- **MCP tools** ↔ **skills.** A tool description is the routing signal; the
model decides whether to invoke. This is what VS Code Copilot's
`SkillsContextComputer` does internally with file-based
`.github/skills/<name>/SKILL.md`, but MCP makes it portable.
- **MCP prompts** ↔ **instructions / slash commands.** Exposed via
`prompts/list`; bodies load only at invocation. The portable equivalent of
Copilot's `InstructionsContextComputer` behavior for `description:`-only
`.instructions.md` files.
Practical implication: **prefer MCP tools/prompts over tool-specific
deferred-loading mechanisms** when targeting multiple harnesses. A
`description:`-only `.instructions.md` file is deferred-loaded in Copilot but
becomes always-on context pollution everywhere else. MCP avoids that asymmetry.
The protocol does not yet have lifecycle hooks (session start, post-tool-use,
session end). Active work — SEP-2624 (Interceptors, formal working group with
Bloomberg + Saxo Bank engineers) and SEP-2282 (server-declared behavioral hooks)
— aims to close this gap in upcoming spec revisions. Until then,
session-lifecycle behavior lives in harness-specific plugin layers (OpenCode
plugins, Copilot hooks).
---
## 6. Local Agents and Models
### 6.1 When local makes sense
- **Confidentiality:** code or data that cannot leave the network.
- **Cost at scale:** sustained heavy agent use (millions of tokens/day per
developer) eventually beats API pricing on amortized hardware.
- **Customization:** fine-tuning on house style, internal frameworks, or
domain-specific patterns.
- **Offline / air-gapped.**
When local does **not** make sense: occasional use, capability-frontier work,
single developers without dedicated hardware. The opportunity cost of slower,
weaker output usually exceeds API costs.
### 6.2 Hardware reality (mid-2026)
| VRAM | Practical ceiling for coding-grade quality |
| -------- | ---------------------------------------------------------------------- |
| 24 GB | Q4 of 3032B dense, or Q4 of 30B-A3B MoE. Usable for narrow subagents. |
| 48 GB | Q4 70B dense, Q5Q6 32B dense, MoE up to ~100B total params at Q4. |
| 80 GB | Q8 70B dense, Q4Q5 of 200B+ MoE. |
| 2× 80 GB | Frontier open-weight MoE (DeepSeek-V3, Qwen3-Coder-480B) at Q4Q5. |
Apple Silicon with unified memory (128512 GB) is a credible alternative for MoE
inference, where bandwidth, not raw FLOPs, dominates. NVIDIA still leads on
prompt processing throughput.
### 6.3 Quantization
Updated rules of thumb (the conventional wisdom from 2023 — "Q4 is fine" — has
been refined considerably):
- **FP16 / BF16:** reference quality.
- **Q8 / FP8:** indistinguishable from FP16 in practice for coding tasks.
Default if memory permits. GGUF Q8_0 loses roughly 0.10.3% on most benchmarks
versus BF16 — not a meaningful degradation vector by itself.
- **Q6_K:** the practical sweet spot. ≤1% quality loss on coding benchmarks for
≥30B models.
- **Q5_K_M:** acceptable for ≥30B. Visible degradation below 14B.
- **Q4_K_M:** the lowest viable quant for serious coding agents on ≥30B models.
Below this, tool-call fidelity collapses faster than raw output quality.
- **AWQ / GPTQ:** for GPU-only inference, often higher quality than equivalent
GGUF Q4 due to per-channel calibration.
- **KV-cache quantization (Q8 KV) is often higher-leverage than weight
quantization** for long-context coding tasks. Underused; under-documented in
2024-era guides. **Critical reality:** with FP16 KV cache, a 9B model at 32k
context burns ≈4 GB just for KV — the KV cache, not weight precision, is the
dominant runtime memory constraint at long contexts. Quantize it.
### 6.4 Small-model failure modes and harness mitigations
For any agent driving a ≤14B model (quantized or not), the failure surface is
distinct from frontier models. The model's _parameter count_ is the primary
cause; quantization is a minor amplifier. The most important patterns:
**Instruction drift past ~12k tokens.** Rules stated in the system prompt hold
for the first 510 tool calls, then erode. Smaller models have fewer attention
heads (Qwen3-8B: 32 heads vs Qwen3-32B's 64), so per-token attention fidelity
degrades faster as context length grows. Mitigations:
- **Tool-response history pruning** (PostToolUse hook). Once a tool result has
been acted on, clear its raw content; keep the assistant's citation. The
single highest-leverage harness change for small models.
- **Compaction trigger at 60% fill** (not the default 8090%). Small models hit
the quality cliff earlier; aggressive compaction keeps each window shorter and
fresher.
- **Periodic system-prompt echo.** Every N tool calls, inject the 3 most
critical rules at the context tail as a `<reminder>` block.
**Tool-call JSON malformation.** Smaller models have narrower "format channels"
— less capacity to track content and strict syntax simultaneously, especially in
long contexts. Mitigations:
- **PreToolUse JSON validation with schema-specific errors.** Generic errors
("invalid tool call") cause retry loops; schema-specific errors guide
correction:
```
Tool call JSON was invalid at position 47 (unexpected comma).
Required schema: {"path": string, "limit": number}
```
- **Grammar-constrained decoding.** GBNF (llama.cpp), Outlines, or
lm-format-enforcer pin generation to a valid schema at the decode step. More
reliable than re-prompting.
- **Trim tool responses to minimum fields.** For `read_file`, return content and
line range, not metadata. Fewer tokens per response = less schema to track in
working memory.
**Tool-selection errors past ~15 tools.** Working memory for "which tools exist"
degrades faster than for frontier models. Mitigations: minimum viable tool set;
consistent tool-name prefixes (`file_read`, `file_write`, `file_search`);
PreToolUse name validation that returns the available list on a miss.
**Think-block runaway.** Reasoning-trained small models can emit 2k5k token
`<think>` blocks for a tool call that needed 50 tokens of reasoning. In a 32k
context, this consumes budget faster than tool outputs. Mitigations:
`num_predict` cap (e.g., 2048) in the modelfile; observability hooks that log
think-block length and flag outliers.
**Context-window cliff at ~20k+.** Output quality drops noticeably (not
catastrophically) past 6070% fill on a 32k model — the pre-training data was
likely concentrated in shorter sequences. Mitigations: **context-pressure
injection** at ≥70% fill — the harness mechanically prepends:
```
[CONTEXT PRESSURE: ~70% full. Be concise. Prefer targeted tool calls over
broad ones. Write current progress to NOTES.md before proceeding.]
```
plus the early-compaction trigger above.
**Training-distribution mismatch.** Most open-weight coding models are heavily
Python/JavaScript. TypeScript-specific patterns (generic constraints,
conditional types, module augmentation, `satisfies`, complex inference) are less
reliable than equivalent Python. Mitigation: SYSTEM directives that force
grounding ("read `tsconfig.json` before asserting TypeScript configuration";
"read existing type definitions before suggesting new ones"), plus
explore-subagent delegation for type-heavy work to isolate the exploration to a
fresh context window.
**Prompt ambiguity → wrong directory (parametric knowledge conflict).** Small
models with narrower training distributions resolve ambiguous nouns ("the five
hook files") to the most common referent in their training data (`.husky/` for
"hook files" in a Node.js repo) rather than the project-specific one
(`.agents/hooks/`). The correct files may appear in tool output but not be
selected. This is a specific instance of **parametric knowledge conflict**: the
model's trained association competes with project-specific context and
frequently wins when prior confidence is high.
Prompt engineering is a subpar fix here. Telling the model "hook files means
`.agents/hooks/`" in AGENTS.md loses to a strong trained prior, especially under
context pressure (lost-in-the-middle degrades instruction recall). Two bodies of
research clarify why and what works instead:
- **ClashEval (Wu, Wu, Zou 2024, arXiv:2404.10198)** benchmarks this exact
tug-of-war across six LLMs. Key finding: the less confident a model is in its
prior, the more likely it is to defer to retrieved context. Corollary:
_specific, concrete contextual evidence_ is far more effective at overriding a
prior than an instruction to prefer context. A file listing showing the actual
paths removes the model's need to resolve the ambiguous noun at all.
- **Onoe et al. (ACL 2023, arXiv:2305.01651)** study knowledge propagation in
LLMs. Finding: gradient-based fine-tuning on new facts ("for this project,
hook files are in `.agents/hooks/`") shows little propagation — the injected
fact does not generalize to new usage patterns. **Prepending entity
definitions in context outperforms parameter-level injection across all
settings.** The practical instruction: inject evidence, don't update weights.
**What works, in order of effectiveness:**
1. **Context grounding via automatic structure injection.** A `UserPromptSubmit`
hook that appends a `<project-file-map>` block to every build-local prompt —
listing actual files under `.agents/`, `.opencode/`, and other
project-specific directories — removes the ambiguity entirely. The model sees
real paths; the trained prior is not consulted. This is the harness analog of
Aider's repo-map (Gauthier 2023), which injects a compressed AST-derived
structure map with every request for the same reason. Implementation: the
hook runs `find .agents -name "*.sh" -o -name "*.md" | sort` and prepends the
result as a structured block at the prompt tail.
2. **Automatic disambiguation expansion.** When the hook detects category nouns
("hook", "config", "agent") without an explicit path in the user's prompt,
expand the noun inline before the model sees it. Example: "the hook files" →
"the hook files (`.agents/hooks/pre-tool-use.sh`,
`.agents/hooks/post-tool-use.sh`, ...)". This converts a high-confidence
prior lookup into a zero-ambiguity ground truth.
3. **Explicit path in user prompts.** Still useful as a secondary layer, but
should not be the _only_ mitigation. Include the explicit path when writing
build-local tasks ("the `.agents/hooks/*.sh` files"). Do not rely on the
model inferring project conventions from context alone.
**What does not work:** repeating the mapping in AGENTS.md or system prompts
("hook files live in `.agents/hooks/`") — this is instructional and degrades
under context pressure. Temperature reduction does not help with noun resolution
and may hurt tool-call schema compliance on Qwen3-class models.
**Other forms of parametric knowledge conflict — and whether structure injection
handles them.**
File paths are a _low-to-medium_ confidence prior. The model knows `.husky/` is
common, but doesn't know your specific project layout, so it defers readily to
injected evidence. Structure injection works because the prior is weak. The
following conflict types have _higher_ confidence priors and require different
harness tools. The pattern from ClashEval holds throughout: **match intervention
strength to prior confidence**.
| Conflict type | Example | Prior confidence | Does structure injection help? | What actually works |
| --------------------------- | -------------------------------------------------------------- | ---------------- | ----------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Structural identity** | `.husky/` vs `.agents/hooks/` | Lowmedium | ✅ Yes — file listing resolves ambiguity | `UserPromptSubmit` hook appends file map |
| **Framework semantics** | React patterns in a Solid.js project | High | ⚠️ Partially — seeing Solid.js files in the map signals the framework, but doesn't show the API | Inline code examples at prompt tail (`createSignal`, `createMemo` shown in use); `PostToolUse` pattern check for React imports |
| **Import path conventions** | `../../packages/core` vs `@cantrips/remnant-core` | Medium | ⚠️ Partially — package.json injection exposes aliases | Inject `tsconfig.json` paths section and package.json `imports`/`exports` map at session start |
| **Async convention** | `async/await` vs the callback pattern this project uses | Very high | ❌ No — file listing doesn't convey behavioral convention | Code example injection (show a canonical callback-pattern function from the codebase); `PostToolUse` grep for `async ` in files that should use callbacks |
| **Error handling** | Throwing exceptions vs returning error results | Very high | ❌ No | Same as async: inject a canonical example; `PreToolUse` or `PostToolUse` grep for `throw new` in use-case files |
| **Command invocations** | `npx jest` vs `npm test`, `docker-compose` vs `docker compose` | Mediumhigh | ❌ No | `PreToolUse` hard block + redirect — the incorrect command is interceptible before execution; this is the cleanest fix because the error is structural |
The general principle: **structure injection handles structural identity
conflicts only.** For semantic, convention, and behavioral conflicts — where the
model has deep training-data confidence in a competing pattern — the effective
interventions are **(a) concrete code examples at the prompt tail** (activates
pattern-matching against actual code rather than fighting a prior with
instructions) and **(b) PostToolUse pattern validation** (catches violations
immediately, in the high-attention context tail). `PreToolUse` blocks are the
right tool only when the incorrect behavior is interceptible as a specific
command or schema.
For the highest-confidence conflicts (async conventions, error handling idioms),
the Onoe et al. finding is most actionable: descriptions in AGENTS.md don't
propagate. A single concrete example from the actual codebase, injected at the
tail, outperforms any amount of prose instruction.
**Silent catch blocks mask enforcement failures completely.** Any `try/catch`
around a tool-call enforcement path that returns a safe default (e.g., `''`)
will silently disable enforcement when the underlying API changes. This is not a
small-model failure — it affects the harness itself. Mitigation: log all caught
errors to a debug file during development and verify the log is empty before
removing debug code. Never assume a hook or enforcement layer is working;
confirm with a test call.
**Scope-detection via todo-list interception.** When a small model attempts a
broad refactor it should not handle, it will typically call `manage_todo_list`
with many items to plan the work. A `PreToolUse` hook that blocks
`manage_todo_list` calls with ≥4 items and returns a specific error message
("this task is too broad — tell the user and stop") consistently causes the
model to report scope and stop, rather than proceeding. This is more reliable
than relying on the model's own Rule 5 compliance. Anthropic's pattern for this
is "guardrails via parallelization" (a separate model screens requests alongside
the working model); a hook-based deny is a lighter-weight equivalent.
**Poka-yoke tool design (Anthropic, 2024).** The harness should make incorrect
tool usage structurally harder, not just instructionally forbidden. Examples:
requiring absolute file paths (eliminates cwd-relative errors), enforcing
`limit` on every `read` call via a blocking hook (eliminates accidental
full-file reads), requiring `explanation` and `goal` fields on terminal calls
(forces pre-action reasoning). These structural constraints outperform
equivalent instruction-only approaches because they fire at the API boundary and
are not subject to instruction drift.
**Sampling parameters matter more.** Qwen3's documented thinking-mode defaults
are `temperature=0.6, top_p=0.95, top_k=20`, and these are empirically the right
starting point for agentic use as well — lower temperatures (e.g., 0.2) trade
reasoning quality and frequently _hurt_ tool-call schema compliance rather than
helping, because the model has less headroom to escape a local format error.
Earlier guidance suggesting low-temperature defaults for tool-call reliability
does not survive A/B testing on Qwen3-class models; keep the documented
thinking-mode values unless you measure a specific regression.
**Anti-filler-token system prompts.** Reasoning-trained small models tend to
open `<think>` blocks with filler ("Okay, let me think about this...", "The user
wants...") before any real analysis. Each filler opener wastes 50150 tokens at
the start of every reasoning block, multiplied across tens of tool calls. A
direct system-prompt rule — _"Open `<think>` blocks with substantive analysis.
Do not begin with filler phrases like 'Okay, let me...' or 'The user
wants...'."_ — measurably trims reasoning length without affecting reasoning
quality. The win compounds on a 32k context.
# 2030B Model Class: The Practical Sweet Spot
> **Status:** Operational reference, not a survey. Captures what has been
> observed running 2030B models as local agent drivers through mid-2026.
>
> **Audience:** Engineers deploying local agentic harnesses who need concrete
> failure modes and countermeasures for the 2030B class — not first-time
> quantization users.
>
> **Self-evaluation:** This document is opinionated and deliberately concrete;
> model-specific claims are date-stamped because they age within months.
---
## 1. The 2030B Class Defined
Models in the 2030B parameter range — **Qwen3-32B-dense**, **Qwopus3.6-27B**,
**GLM-4-32B** — occupy a unique position in the local deployment landscape. They
are large enough to hold meaningful instruction context and tool-call fidelity
without collapsing under quantization, yet small enough to run on consumer
hardware (single 24GB GPU at Q4, or dual-GPU setups with headroom). This class
has failure modes that are **not** shared by frontier models and **not** shared
by sub-14B models — they are uniquely theirs.
| Dimension | Sub-14B class | 2030B class | Frontier (≥200B) |
| --- | --- | --- | --- |
| **Instruction drift** | Immediate (48 turns) | Delayed (1015 turns) | Resistant |
| **Plan invention** | Poor (hallucinates steps) | Unreliable (skips, invents) | Strong |
| **Tool-call fidelity** | Breaks under load | Degrades gradually | Robust |
| **Context budget** | Collapses early | Degrades gradiently | Stretches far |
| **VRAM at Q4** | ≤12 GB | ≤24 GB | Not feasible |
The 2030B class is **not frontier** and **not small**. It sits between two
established playbooks, and applying either playbook produces suboptimal results.
---
## 2. Failure Modes
### 2.1 Instruction Drift at Tool Call 1015
The defining characteristic of this class is that it **starts strong and degrades
predictably**. A 27B model loaded with a 2k-token system prompt will follow all
rules faithfully for roughly 1015 tool calls — then rules begin to drop. Not
catastrophically (as sub-14B models do at turn 4), but enough to produce
drift: the model stops checking lint before committing, stops writing to
NOTES.md, stops using `read` before `edit`.
**Mechanism.** The system prompt sits at the head of the context. By tool call
1015, the accumulated conversation has pushed it deep into the effective
attention zone where recall is gradient, not binary. The model hasn't "forgotten"
the rules — it's attending to them less than to the immediate conversation
tail.
**What works:**
- **Periodic system-prompt echo every 810 calls** via `PostToolUse` hook
injection. A compressed version of the most-critical rules (35 bullets)
reappears at the context tail, restoring attention to constraints before
drift sets in. This is the single most impactful harness change for this
class — it reduces drift-related errors by an order of magnitude in
observed sessions.
- **Tail-positioned critical rules.** Place the few rules that matter most
(e.g., "read before edit", "run lint before commit") at the _end_ of the
system prompt, not the beginning. The tail survives longer.
**What does not work:** negative constraints ("DO NOT forget to check lint"),
repeated reminders in the user prompt (they degrade after 23 repetitions),
or asking the model to "re-read the instructions" (it won't).
### 2.2 Plan-Invention Failure
When asked to invent a multi-step plan from scratch, 2030B models frequently
produce plans that are **structurally incomplete** (missing dependency edges),
**overconfident** (assuming APIs exist without checking), or **hallucinatory**
(inventing intermediate steps that serve no purpose). This is the class's
hardest intrinsic limitation — plan generation is the single most demanding
reasoning task an agent must perform.
**What works:**
- **Blueprint injection.** Instead of asking the model to invent a plan, inject
a structured blueprint at the prompt tail. A blueprint is a task-type-keyed
skeleton: "debug → read error → locate source → read file → hypothesize →
verify → fix → test." The model fills in the slots rather than inventing the
structure. This maps directly to the blueprint-guided execution pattern
(Han et al., [arXiv:2506.08669](https://arxiv.org/abs/2506.08669)).
- **Exploration subagent with blueprint handoff.** A larger orchestrator model
(or even the same model in a fresh context with higher `num_predict`) generates
the blueprint; the 2030B model executes it. The context firewall between
subagents means the execution agent never sees the planning mess.
**What does not work:** asking the model to "think step by step" before acting
— this just produces a long chain that still misses the dependency.
### 2.3 Long CoT Degradation
Hassid et al. ([arXiv:2505.17813](https://arxiv.org/abs/2505.17813),
"Don't Overthink it") directly tested chain-of-thought length within a single
question and found that **the shortest chains are up to 34.5% more accurate than
the longest**. This effect is pronounced at the 2030B scale: extended thinking
tokens do not accumulate reasoning — they accumulate noise. The model begins
repeating itself, inventing irrelevant intermediate steps, or drifting into
explanation mode rather than planning mode.
**What works:**
- **Cap reasoning-trace lengths** at inference time (`num_predict` on `<think>`
blocks). A practical cap for 2030B models is 8001200 thinking tokens per
call — enough for a plan, not enough for a treatise.
- **Short-m@k with ≤3 chains.** Generate `k` reasoning chains in parallel,
halt when the first `m` finish, take majority vote. At 2030B, three chains
is the practical ceiling — more chains eat VRAM without accuracy gain.
Short chains with majority voting beat one long chain at equal or better
accuracy with fewer total thinking tokens.
**What does not work:** budget forcing (extending a single chain to consume a
fixed token budget). Budget forcing is a frontier-model technique; at 2030B it
produces verbose, less-accurate chains.
### 2.4 The "Not Frontier, Not Small" Gap
The 2030B class falls between two established deployment playbooks:
- **Frontier playbooks** assume robust tool-call fidelity, strong plan invention,
and deep context. A 2030B model cannot sustain these assumptions past turn 10.
- **Small-model playbooks** assume immediate instruction collapse, severe
hallucination, and subagent-only deployment. A 2030B model is far more
capable than these playbooks allow for.
Applying frontier patterns (long sessions, deep reasoning, no scaffolding) to
2030B models produces gradual failure. Applying small-model patterns (extreme
task slicing, no primary-agent role) wastes the model's actual capability.
---
## 3. Harness Patterns
### 3.1 Periodic System-Prompt Echo (every 810 calls)
**Mechanism.** A `PostToolUse` hook counts tool calls and injects a compressed
rules reminder at the context tail every 810 calls. The reminder is 35
bullets covering the most-critical constraints:
```
[HOOK INJECTION: post-tool-use] System reminder:
- Read a file before editing it
- Run lint before committing
- Write findings to NOTES.md after each step
```
**Why it works.** The tail of the context is the high-attention zone (Liu et al.,
[arXiv:2307.03172](https://arxiv.org/abs/2307.03172)). Re-injecting rules at the
tail restores attention to constraints before drift sets in. The original system
prompt at the head is still there — this is not a replacement, it's a reinforcement.
**Implementation note.** The hook must be terse. A 200-token reminder every 8
calls adds 1600 tokens per 100-call session — manageable. A 500-token reminder
is not.
### 3.2 Blueprint Injection
**Mechanism.** When the orchestrator classifies the task type, inject a
structured blueprint at the prompt tail. The blueprint is a task-type-keyed
skeleton, not a plan for this specific task. The model fills in the slots:
```
## Task Blueprint: Debug
1. Read the error message
2. Locate the source file
3. Read the relevant section
4. Form a hypothesis
5. Verify with a targeted read or test
6. Apply a minimal fix
7. Run the build / test
```
**Why it works.** Plan invention is the 2030B class's weakest reasoning mode.
Blueprints replace invention with execution — the model's strong suit. Han et
al. ([arXiv:2506.08669](https://arxiv.org/abs/2506.08669)) show this pattern
improves accuracy on GSM8K, MBPP, and BBH with no additional training.
### 3.3 Compaction at 65% Fill
**Mechanism.** Compact the conversation at 65% context-fill rather than the
conventional 8090%. The 2030B class degrades gradiently — by 80% fill,
effective recall of head-position content is already poor.
**Why 65%, not 80%.** At 2030B, the effective context is roughly 4050% of
advertised (consistent with the gradient degradation observed in Liu et al.).
Compacting at 65% of advertised leaves 35% headroom, which maps to roughly
the effective context limit. Compacting at 80% means the model has already
been operating in degraded mode for the last 15% of the session.
**Compaction target.** Stale tool outputs first (raw file contents whose
information has been acted on), then stale conversation turns. The
anchored-summary schema from §4.7 of the best-practices document applies
unchanged.
### 3.4 Short-m@k with ≤3 Chains
**Mechanism.** For tasks requiring reasoning (debug diagnosis, architecture
decisions), generate up to 3 reasoning chains in parallel, take majority
vote when the first 2 agree. This is the short-m@k pattern from Hassid et
al., adapted to 2030B hardware constraints.
**Why ≤3 chains.** Each chain at 2030B requires ~812 GB VRAM at Q4. Three
chains fit on dual-GPU setups; four push into swap territory with severe
latency penalty. The accuracy gain from chain 3 to chain 4 is marginal
compared to the latency cost.
### 3.5 Anti-Filler-Token Rules
**Mechanism.** Explicit rules in the system prompt or `AGENTS.md` that ban
filler behavior. The 2030B class is particularly prone to generating
explanatory filler — long paragraphs explaining what it's about to do before
doing it, or summarizing files it just read.
**Concrete rules that work:**
- "Do not summarize a file you just read — proceed to the next action."
- "Do not explain your plan before executing it — act immediately."
- "When the user asks a yes/no question, answer in one sentence then proceed."
These rules target the specific filler modes observed in 2030B models.
Generic rules ("be concise") are ignored; specific rules ("do not summarize
a file you just read") are followed because they are concrete and testable.
---
## 4. Prompt Design
### 4.1 Imperative, Not Conditional
**Rule:** Write instructions as commands, not conditions. The 2030B class
processes imperative instructions more reliably than conditional ones.
| Conditional (weak) | Imperative (strong) |
| --- | --- |
| "If there's a file to edit, read it first" | "Read a file before editing it" |
| "When you encounter an error, check the source" | "On error, locate the source file" |
| "If the build fails, run lint" | "Build fails → run lint" |
Conditional instructions introduce a branch the model must evaluate — at 2030B,
branch evaluation is unreliable. Imperative instructions are single-path and
easier to follow.
### 4.2 Tail Content
**Rule:** Place the most-critical instructions at the end of the system
prompt and at the end of the user prompt. The tail survives context pressure;
the head does not.
This applies to both the initial system prompt (most important rules last)
and to injected content (hooks inject at the tail). A rule at the head of a
3k-token system prompt is effectively invisible by tool call 12.
### 4.3 Concrete Examples Over Abstract Principles
**Rule:** Show a concrete example of the desired behavior rather than stating
an abstract principle. The 2030B class has weaker abstraction-to-execution
transfer than frontier models.
| Abstract (weak) | Concrete (strong) |
| --- | --- |
| "Be precise with file paths" | "Use absolute paths: `/home/dev/code/remnant/src/file.ts`, not `src/file.ts`" |
| "Check for errors" | "After every `npm run build`, check the exit code before proceeding" |
| "Keep changes minimal" | "Edit only the lines that need changing; do not reformat adjacent code" |
### 4.4 No Self-Reflect Language
**Rule:** Do not include "reflect on your answer", "double-check", "are you
sure", or "take another look" in prompts targeting 2030B models. Huang et al.
([arXiv:2310.01798](https://arxiv.org/abs/2310.01798), "Large Language Models
Are Not Reliable Self-Correctors") show that intrinsic self-correction without an
external oracle **consistently degrades** reasoning performance. At 2030B,
the effect is stronger — the model's self-assessment is poorly calibrated, and
asking it to "reflect" produces longer, less-accurate chains.
Replace self-reflect prompts with external feedback: test runners, lint checks,
hook exit codes. The model does not need to check its own work — the harness
does.
### 4.5 Short CoT
**Rule:** When the prompt asks the model to reason, constrain the reasoning
trace explicitly. "Think step by step" produces verbose, less-accurate chains
at 2030B. Instead:
| Verbose (weak) | Constrained (strong) |
| --- | --- |
| "Think step by step about this" | "List the 3 most likely causes, then test the first one" |
| "Analyze the problem thoroughly" | "State your hypothesis in one sentence, then verify it" |
| "Consider all possibilities" | "Name 2 candidate fixes, implement the first" |
This aligns with the Hassid et al. finding: shorter chains are more accurate.
The prompt constraint enforces short chains at the point of generation, not
just at the inference-time cap.
### 6.4a Reasoning density: getting more out of small local models
A separate question from "how do I keep a small model from breaking?" (§6.4) is
"how do I get more reasoning capability out of it without enlarging it?". Recent
research converges on four techniques that are particularly suited to local
deployment, where additional inference passes are cheap and the alternative
(swapping to a frontier model) defeats the reason for going local in the first
place.
**1. Prefer shorter reasoning chains, not longer ones.** The intuitive
assumption that more "thinking" helps was directly tested by Hassid et al.
([arXiv:2505.17813](https://arxiv.org/abs/2505.17813), "Don't Overthink it"):
within a single question, **the shortest chains the model produces are up to
34.5% more accurate than the longest**, and SFT on short chains beats SFT on
long ones. Practical translation:
- Cap reasoning-trace lengths at training time (curate short-CoT data) and at
inference time (`num_predict` on `<think>` blocks, per §6.4).
- For test-time scaling on local hardware, **short-m@k** is the right pattern:
generate `k` reasoning chains in parallel, halt as soon as the first `m`
finish, take majority vote among those `m`. Hassid reports up to 40% fewer
thinking tokens than standard majority voting at equal or better accuracy.
- This contradicts the early-2025 "scale test-time compute by extending one long
chain" framing (e.g., s1's budget forcing,
[arXiv:2501.19393](https://arxiv.org/abs/2501.19393)). Budget forcing works on
32B+ models; on ≤7B models the evidence increasingly favours shorter chains
and parallel sampling. Treat budget forcing as a frontier-model technique.
**2. The Small Model Learnability Gap dictates distillation strategy.** Li et
al. ([arXiv:2502.12143](https://arxiv.org/abs/2502.12143)) found that **models
≤3B do not consistently benefit from long-CoT distillation from larger
reasoners** — they perform _worse_ than when fine-tuned on shorter, simpler
chains better matched to their intrinsic learnability. Their proposed **Mix
Distillation** combines long and short CoT examples (and reasoning from both
larger and smaller teachers) and outperforms either alone. The standard "distill
from the strongest reasoner you can afford" instinct is wrong for ≤3B targets.
For local-driver training (anything in the 0.53B regime), the operational rule
is:
- Source ~6070% of CoT data from teachers ≤14B (or from the target model itself
after a first round). Use larger teachers (≥30B) for the remaining 3040%,
primarily on harder problems where the smaller teacher is unreliable.
- Curate or rewrite teacher outputs to **median chain length**, not maximum.
LIMO ([arXiv:2502.03387](https://arxiv.org/abs/2502.03387)) showed that 817
strategically-designed "cognitive template" demonstrations beat 100×-larger
CoT corpora at the 32B scale; the same logic applies more strongly at smaller
scales. Quality and chain-length appropriateness dominate quantity.
- The LIMO finding has an important boundary condition the paper states
explicitly: it assumes "domain knowledge has been comprehensively encoded
during pre-training." A 2B model with weaker domain coverage will not match
the same data efficiency — but the directional advice (concise high-quality
chains beat verbose mediocre ones) still holds.
**3. Blueprint-guided execution as an inference-time density booster.** Han et
al. ([arXiv:2506.08669](https://arxiv.org/abs/2506.08669), ICML 2025 TTODLer-FM)
show that **LLM-generated structured reasoning blueprints** — extracted by a
larger model from solved problems and reused as scaffolds — measurably improve
small-model accuracy on GSM8K, MBPP, and BBH, with no additional training. The
blueprint is a high-level step skeleton ("identify the goal → list known
variables → choose the operator type → ..."); the small model fills it in.
For an agentic harness, this maps onto:
- **A blueprint library** keyed by task type (debug, refactor, write-test,
search-and-summarize) injected at the prompt tail when the orchestrator
classifies the request. The small model is no longer asked to invent a plan
from scratch — it executes a known-good plan template, which is the single
hardest thing for it to do reliably.
- Pairs well with the explore-subagent pattern (§3.4): the orchestrator can
generate a blueprint, hand it to the subagent, and recover a 12k token
summary that's been structurally constrained.
**4. Test-time compute scaling is not free, and its effectiveness scales with
model size.** A persistent failure mode in 20252026 deployment writeups is
applying frontier test-time-compute patterns (MCTS, Best-of-N with a verifier,
extended budget-forced thinking) to ≤7B models and reporting flat or negative
results. The Kinetics work and follow-ups consistently find that test-time
compute pays off most above ~1014B parameters, where attention capacity (not
raw parameter count) becomes the bottleneck. For smaller models:
- **Short-m@k with majority voting** remains net-positive on local hardware
because ternary / small dense inference is cheap. Budget: ≤3 parallel chains.
- **Verifier-guided search (MCTS / Best-of-N + judge)** is rarely worth the cost
unless the verifier is also small and runs on the same device. A 7B verifier
rating a 2B generator's outputs eats the compute budget the small model was
supposed to save.
- **Extended single-chain thinking** is the worst option at this scale — see
point 1.
**Synthesis.** For a sub-7B local model: train on shorter chains, run short-m@k
at inference when accuracy matters, inject blueprints when the task type is
known, and do not import frontier test-time-compute patterns wholesale. The
reasoning-density ceiling for a small model is shaped more by data composition
and inference-time structure than by raw model capability.
### 6.5 Local agent harnesses
- **OpenCode:** the current most-flexible model-agnostic harness. Strong for
routing between local and cloud models in a single workflow. Recommended
default for users who want control.
- **Aider:** still excellent for diff-based coding, particularly with its
repo-map. More limited as a general agent loop.
- **Cline / Continue / Roo Code:** good integrations into VS Code; varying
degrees of model-agnostic configuration.
- **llama.cpp / vLLM / MLX / Ollama:** the inference layer. vLLM dominates for
GPU throughput; llama.cpp for flexibility and CPU/Apple support; MLX for
Mac-native efficiency.
### 6.6 Pre-configured cloud agents vs local-DIY
The honest comparison:
- **Pre-configured wins** on out-of-the-box capability. Cursor, Claude Code,
Windsurf, GitHub Copilot ship with deeply tuned harnesses, hand-curated system
prompts, and routing logic that took teams of engineers months to build. A
naive local setup will not match this without significant effort.
- **Local-DIY wins** on customizability, privacy, cost-at-scale, and willingness
to invest in harness work. The ceiling is higher if you put in the engineering
hours; the floor is much lower.
A pragmatic middle path: pre-configured cloud agent as daily driver, local agent
for confidential work and bulk tasks. OpenCode is well-suited to this hybrid
pattern.
---
## 7. Prompt Engineering: Is It Still Relevant?
Mostly: no, not in the 20222023 sense. The techniques that used to deliver
double-digit accuracy improvements either:
- **Got partially baked into the models** (chain-of-thought via reasoning
training, instruction-following via RLHF/RLAIF) — but "baked in" is not the
same as "reliable." Even reasoning-trained CoT inherits and entrenches
pretraining priors via posterior collapse, especially on subjective tasks
(emotion, morality, intent inference —
[arXiv:2409.06173](https://arxiv.org/abs/2409.06173)). Larger
reasoning-trained models can anchor _harder_ to a wrong prior under CoT, not
softer. Treat "the model will reason its way out of a misread" as a weak
intervention, not a built-in safety net.
- **Got moved into the harness** (todo lists, plan/act, structured tool use).
What still matters about prompt construction:
- **Negative constraints.** Frontier labs spend disproportionate effort on "do
not do X" rules. Third-party harnesses under-invest here. Important caveat
from §4.6: negative constraints _lose_ to deeply trained behavioral priors.
They work for novel rules; they fail against "gather context first"-style
instincts. Match the rule to the mechanism.
- **Output-format guarantees.** Structured output, schema-constrained
generation, JSON mode — these still pay off, especially for tool calls.
- **Role/boundary definition for subagents.** Subagent system prompts are still
high-leverage because they shape what compressed report comes back. This is
about defining the _task contract_ and the _return format_, not about
injecting an expertise persona (see persona caveat below).
- **Stable identity across turns.** "You are an agent that..." framing has
little benefit. The folk claim that "consistent voice and persona instructions
reduce drift in long sessions" is uncited and unverified; given that small
variations in persona attributes can produce double-digit accuracy drops
(Principled Personas, EMNLP 2025), treat persona stability as cosmetic, not
load-bearing.
- **Expertise-ladder prompting for _divergent ideation_ (not accuracy).**
Community technique, no canonical paper, **and now in tension with the
persona-prompting empirical literature.** When a brainstorming or design task
risks collapsing to an "average" LLM answer, enumerating solutions across
explicit framings (e.g., _"What would a junior engineer propose? What would a
senior engineer with deep domain knowledge propose differently? What does an
outsider with zero context propose? What assumptions does the senior answer
make that the junior doesn't?"_) can broaden the sample of approaches the
model produces. **Critical scope limit:** recent persona- prompting work
(Principled Personas, EMNLP 2025; Persona is a Double-Edged Sword, IJCNLP
2025; [arXiv:2512.05858](https://arxiv.org/abs/2512.05858)) finds that
low-knowledge personas ("layperson," "outsider," "child") often _reduce_
accuracy on factual / reasoning benchmarks, sometimes substantially. The
ladder is therefore safe as a _divergent-thinking sampler_ (where high
variance is the goal) but **must not** be used as an accuracy improver, an
expertise injector, or the final answer producer. Use it to broaden the
candidate set, then evaluate candidates with the un-personified model under an
external rubric. If you only have budget for one of these two passes, skip the
ladder.
What no longer pays off meaningfully:
- Few-shot examples for capable models on common tasks. Often actively harms via
spurious pattern-matching.
- Elaborate "let's think step by step" preambles for reasoning models —
redundant.
- "You are an expert in X" puffery. No measurable effect on frontier models, and
on small models can be actively harmful via persona-attribute sensitivity (see
Principled Personas reference above).
- Asking the model to reflect on or critique its own output without an external
oracle. Per Huang et al. (arXiv:2310.01798), intrinsic self-correction
_degrades_ reasoning performance in the no-oracle setting. The intervention
feels productive (and reads well in transcripts) but the measurable effect on
correctness is negative. Use only when paired with an external verifier.
---
## 8. Verification, Sandboxing, and Safety
### 8.1 Verification as harness, not prompt
The most reliable indicator of an agent that works is whether **the harness
forces verification** rather than relying on the model to verify itself. Minimal
verification steps:
- Build/compile after edits.
- Test suite execution.
- Lint and format.
- Diff inspection (does the change touch unrelated areas?).
- Git-status awareness before destructive operations.
Three patterns extend the basics:
- **Block on policy-shaping files.** Some files (`eslint.config.js`,
`tsconfig.json`, deployment configs) shape the rules every other tool call
obeys. Edits should require explicit human review even from a trusted agent —
a PreToolUse hook that denies edits with an explanatory message ("propose the
change; let the user decide") is more reliable than asking the model to
remember.
- **Block on generated files.** Files marked `.generated.ts` (or similar) will
be overwritten on next build; an agent edit silently disappears. A PreToolUse
hard block with a redirect ("edit the generator script, then run
`npm run build:core`") closes the loop instead of relying on the agent to
remember.
- **Block on documented-anti-pattern commands.** `sed -i`, `awk` rewrites of
code files, `rm -rf .wireit`, `npm install` without confirmation,
`npm run build` while the dev server runs (port conflict): all are cheaper to
block at the harness than to instruct against in prose. The block message
should always include the alternative.
### 8.2 Sandboxing
- **Container-level isolation** for any agent that runs shell commands
autonomously is now table stakes. Docker, Firecracker microVMs, or
language-level sandboxes.
- **Network policy.** Egress whitelisting prevents prompt-injection-driven
exfiltration.
- **Filesystem scope.** Agents confined to a project directory eliminate a large
class of accidents.
### 8.3 Prompt injection
The unsolved problem of the field. Tool outputs (fetched web pages, file
contents from third-party repos, search results) can contain instructions that
hijack the agent. Current mitigations are partial:
- Treat tool output as data, never as instructions. Easier said than enforced —
models cannot fully separate the two.
- Egress controls and explicit user confirmation for destructive operations.
- Detection layers (a separate classifier model scanning tool output for
injection patterns) — partial coverage at best.
Assume injection _will_ succeed eventually. Design the blast radius accordingly.
---
## 9. The Self-Improving Harness
A pattern worth its own section because it's underused: **the harness should get
stronger with every difficult session.** The mechanism is a `Stop` hook that, at
session end, prompts the agent itself to reflect on whether the session was
unusually hard and, if so, what knowledge would have prevented most of the work.
A representative prompt:
> If this session required significant effort (many tool calls, multiple
> dead-ends, complex investigation): ask yourself what information, if it had
> existed at the start, would have prevented most of that work. First, determine
> scope — globally applicable, or specific to certain files / patterns? Then
> lean toward hooks as the solution: hard stops via PreToolUse, PostToolUse
> reminders at the relevant boundary, nested AGENTS.md, PreCompact state save,
> or SessionStart broad reminders. These are all more reliable than root
> AGENTS.md sections (lost-in-the-middle). Record the insight in the right hook
> or instructions file, not just in AGENTS.md.
Why this works:
- **The agent has the freshest signal** about what was painful in this session.
Asking 12 hours later loses fidelity.
- **The reflection is gated on effort**, so trivial sessions don't bloat the
rule set with low-value lessons.
- **The placement guidance is built into the prompt**, so the recorded lesson
lands at the right enforcement level (hook ≫ AGENTS.md) instead of defaulting
to the easiest place.
- **Repeated application compounds.** A harness that captures one lesson per
hard session per developer reaches its expressive ceiling fast, then stays
there.
The risk is rule bloat — each session is tempted to record something. Two
guardrails: (a) the prompt explicitly says "only record genuinely new insights";
(b) periodic audits remove rules that no longer fire or whose condition has been
superseded by a better mechanism.
A related pattern is **answer-completeness verification at session end**: the
`Stop` hook re-surfaces the user's last prompt (preserved by the
`UserPromptSubmit` hook) and asks the agent to confirm every distinct question
was addressed, not just the primary task. Cheap to implement; catches the most
common multi-part-prompt failure mode.
---
## 10. Operational Guidance (Synthesis)
A pragmatic playbook condensed from the above:
1. **Pick the harness first, model second.** A good harness with a mid-tier
model beats a great model with a bad harness.
2. **Default to a single agent loop with plan/act/verify.** Add subagents only
for read-only exploration or fully isolated tasks.
3. **Treat the context window as a budget.** Retrieve narrowly, summarize
aggressively, place task-critical content at the tail.
4. **Standardize on ~6 tools.** Resist tool proliferation. Use MCP-style façades
or code-as-tools above ~40 tools.
5. **Force verification in the harness.** Never rely on the model to grade
itself.
6. **Write `AGENTS.md` (or equivalent) for your repo.** Anti-patterns matter
more than positive instructions.
7. **Match model class to task.** Reasoning model for planning and diagnosis,
non-reasoning for mechanical work, cheap model for grep/summarize subagents.
8. **For local deployment:** Q6_K weights, Q8 KV cache, MoE for memory
efficiency, grammar-constrained tool calling.
9. **Build a 20-task internal eval suite** specific to your codebase. No public
benchmark substitutes.
10. **Date-stamp your conclusions.** The field moves fast enough that
model-specific advice rots in months.
---
## 11. Self-Evaluation
A frank assessment of this document's strengths and weaknesses, as instructed:
**Strengths**
- Categories are organized around the real axes of decision-making (harness vs
model, local vs cloud, reasoning vs not), not around vendor names, which would
have dated faster.
- Calls out specific failure modes per model family rather than treating all
frontier models as interchangeable.
- Acknowledges what _used_ to be true and has been uprooted, per request.
- Quantization and hardware guidance reflects mid-2026 reality (KV-cache quant,
MoE) rather than the 2023 "Q4 is fine" oversimplification.
- Self-contained: a reader without prior context can use it.
**Weaknesses and risks**
- **Model-specific claims rot fast.** The mid-2026 winners section will likely
be wrong in 36 months. The framing should survive longer than the specifics.
- **Citation density is now medium.** Primary sources have been added where
verifiable (Sharma 2310.13548 sycophancy, Liu 2307.03172 lost-in-the-middle,
Pan 2308.03188 self-correction, Zheng 2306.05685 LLM-as-judge, Anthropic Sep
2025 context engineering article). Several claims remain attributed to
community sources or unpublished internal evaluations (LangChain harness
result on Terminal-Bench 2.0, ETH Zurich AGENTS.md cost study, the 4050 tool
threshold) — directionally trustworthy but a determined reader should verify
before quoting.
- **Possible bias toward the Anthropic / Claude ecosystem.** The author of this
document is a Claude-family model, and the "Claude leaks" framing reflects an
asymmetric leak landscape (Claude prompts leaked more visibly than
competitors'). Other labs do similar scaffolding work; the document implicitly
under-credits this.
- **Local-deployment section is hardware-specific** and will age as consumer
hardware changes (especially NVIDIA generational shifts and Apple's continued
unified-memory pushes).
- **Prompt injection section is appropriately pessimistic** but offers limited
actionable guidance because the field has limited actionable answers. This is
honest but unsatisfying.
- **Benchmarks section** treats Aider polyglot and SWE-Bench Verified as current
ground truth. Both will saturate; the criterion ("predicts your repo's
results") matters more than the named benchmark.
**What I would add with more space**
- A worked example of a `AGENTS.md` derived from the negative-instruction
principle, contrasted with a typical bloated one.
- Concrete numbers on the cost-at-scale crossover point for local hardware vs
API usage (these are knowable with reasonable assumptions).
- A section on fine-tuning vs RAG vs prompt-only customization, with the
cost/benefit thresholds.
- Empirical comparisons of grammar-constrained decoding tools (Outlines / GBNF /
lm-format-enforcer / function-calling-as-grammar) for tool-call reliability on
open-weight models.
**Overall confidence**
- **High** on the four opening shifts, the Prompt/Context/Harness diagnostic,
the cross-model failure modes, and the enforcement hierarchy — all
well-replicated and stable.
- **Medium** on the family-specific failure patterns, the tool-count threshold,
and the small-model harness mitigation set — directionally correct, specific
numbers vary by harness and model.
- **Lower** on specific model winners and exact hardware recommendations —
fast-moving facts.
**Changelog**
- **Revision 2:** integrated repo-internal research notes (Prompt/Context/
Harness taxonomy, ETH Zurich AGENTS.md study, LangChain Terminal-Bench harness
result, LLM-as-judge biases, sub-agent tiering, enforcement hierarchy,
Plan-and-Solve + Think-Anywhere, just-in-time retrieval, NOTES.md pattern,
sequential-constraint-ordering failure, small-model harness mitigations,
skills.sh / SKILL.md, OpenSpec, MCP-as-portable-deferred- loading,
expertise-ladder prompting). Added primary-source citations where available.
- **Revision 3:** added patterns observed in the repository's own agent
configuration (`.agents/`, hooks, modelfiles): counterbalance agent design
(§3.1a), circuit breakers as a first-class primitive (§3.2),
falsification-first investigation and dead-ends file (§3.4a), stateful hooks /
tool-specific PostToolUse warnings / path-scoped reminders (§3.7),
trigger-word nudges as positive-recommendation analog (§3.8), exploration
files as durable handoff artifacts and timing awareness (§4.5), anchored
compaction schema (§4.7), corrected Qwen3 sampling recommendations and
anti-filler-token prompts (§6.4), policy- and generated-file harness blocks
(§8.1), self-improving harness via Stop-hook reflection (new §9), and
outsider-persona expansion to the expertise-ladder prompt (§7). Old §910
renumbered to §1011.
- **Revision 4:** elevated **permission-layer denial** above PreToolUse hard
blocks in the enforcement hierarchy (§3.6). A permission deny on an agent
definition removes the tool from the agent's available-tool set entirely,
rather than rejecting a tool call after the agent has chosen to make it.
Reflects the local-orchestration plan's structural-enforcement primitive
(OpenCode `permission: { edit: deny }`).
- **Revision 5:** added Skills vs Hooks comparison table to §5.5. Folded unique
content from `docs/research/agent-infrastructure.md` (which is now deleted);
everything else in that file was already synthesized in prior revisions.
- **Revision 6:** corrective edits driven by the 2026-05-16 text-intent-
interpretation investigation
(`docs/explorations/text-intent-interpretation- research.md`). Three claims
revised against new evidence: (a) §2.1 sycophancy reframed as
model-family-conditional, not a universal RLHF property, citing nostalgebraist
(2023) replication on OpenAI base models; (b) §3.5
intrinsic-self-correction-hurts claim upgraded to cite Huang et al.
(arXiv:2310.01798) as the strong primary source, with Pan et al. retained as
the survey reference, and rewritten to explicitly call out "ask the model to
reflect" as a tempting-but-counterproductive intervention without an external
oracle; (c) §7 expertise-ladder prompting scoped down to divergent ideation
only and explicitly flagged as in tension with persona-prompting empirical
literature (Principled Personas EMNLP 2025; Persona is a Double-Edged Sword
IJCNLP 2025; arXiv:2512.05858); CoT-baked-in claim softened to acknowledge
posterior collapse on subjective tasks (arXiv:2409.06173); "ask the model to
reflect" added to the "no longer pays off" list.