MCP tools don't populate output.output in the tool.execute.after hook — the MCP content flows through OpenCode's internal parts pipeline instead. This caused a crash: undefined is not an object (evaluating 'text.length') in the truncate function.
2132 lines
112 KiB
Markdown
2132 lines
112 KiB
Markdown
# Agentic Coding: Best Practices (Research Notes)
|
||
|
||
> **Status:** Research synthesis, not a tutorial. Captures the state of the
|
||
> agentic-coding field as of mid-2026, with emphasis on what has been _uprooted_
|
||
> from earlier (2022–2024) practice.
|
||
>
|
||
> **Audience:** Engineers building, configuring, or using AI coding agents — not
|
||
> first-time LLM users.
|
||
>
|
||
> **Self-evaluation:** See the final section. This document is opinionated and
|
||
> deliberately concrete; model-specific claims are date-stamped because they age
|
||
> within months.
|
||
>
|
||
> **Applied implementation:**
|
||
> [`docs/projects/agent-infrastructure.md`](../projects/agent-infrastructure.md)
|
||
> — how these principles are applied in this repo (current architecture,
|
||
> OmniCoder 2 orchestration plan, open issues).
|
||
|
||
---
|
||
|
||
## 0. Framing: What Got Uprooted
|
||
|
||
Three big shifts have rendered most pre-2024 "LLM coding tips" obsolete or
|
||
actively misleading:
|
||
|
||
1. **Prompt engineering → context engineering.** Modern instruction-tuned
|
||
frontier models follow direct, terse instructions reliably. The high-leverage
|
||
work has moved _outside_ the system prompt — into what tokens reach the model
|
||
at all, in what order, and with what compression. (Karpathy popularized the
|
||
term "context engineering" in mid-2024; it has since been adopted as the
|
||
default frame by Anthropic, Cursor, and others.)
|
||
2. **Model > harness → harness ≈ model.** A 2023 belief was "just wait for the
|
||
next model." The Claude system-prompt leaks (Oct 2024 onward), the success of
|
||
Aider's repo-map, and Cognition's published failure analyses showed that
|
||
_scaffolding_ — tool choice, context budget, plan/act separation, todo
|
||
tracking — explains as much variance in agent success as the underlying
|
||
model. A mid-tier model with an excellent harness routinely beats a frontier
|
||
model with a naive harness on real-repo tasks.
|
||
3. **Multi-agent enthusiasm → single-thread default.** The "swarm" / AutoGPT era
|
||
assumed parallelism would compound capability. Cognition's
|
||
["Don't Build Multi-Agents"](https://cognition.ai/blog/dont-build-multi-agents)
|
||
(mid-2025) and subsequent replications established the now-dominant view:
|
||
context fragmentation between agents destroys more value than parallelism
|
||
creates. Subagents survive only in narrow, _read-only or fully isolated_
|
||
roles.
|
||
4. **Three layers, not one.** The field has converged on a useful taxonomy
|
||
popularized by an Alibaba Cloud engineering article (Apr 2026): **Prompt →
|
||
Context → Harness.** _Prompt_ is the per-request task expression (stateless).
|
||
_Context_ is everything the model sees during execution (system rules, tool
|
||
definitions, AGENTS.md, retrieved code, conversation history). _Harness_ is
|
||
the deterministic machinery around the model (hooks, permission gates,
|
||
verification loops, subagent boundaries). The layers fail differently and
|
||
require different fixes — conflating them is the single most common mistake
|
||
in agent design. LangChain's Terminal-Bench 2.0 score rose from **52.8% →
|
||
66.5% by changing the harness alone** (no model swap, no prompt change), the
|
||
starkest single data point that harness design has first-order impact.
|
||
|
||
Everything below is downstream of these four shifts.
|
||
|
||
---
|
||
|
||
## 1. The Model Landscape (Mid-2026)
|
||
|
||
### 1.1 Categories that actually matter
|
||
|
||
Drop the "GPT vs Claude vs Gemini" framing. The useful axes are:
|
||
|
||
| Axis | Options | Why it matters |
|
||
| ------------------------- | ------------------------------------------------------ | --------------------------------------------------------------------------------------------------------------------------------------------- |
|
||
| **Reasoning depth** | Non-reasoning · Hybrid (toggleable) · Always-reasoning | Reasoning models excel at planning and bug diagnosis; non-reasoning models are faster and cheaper for mechanical edits. |
|
||
| **Architecture** | Dense · Mixture-of-Experts (MoE) | MoE delivers high parameter counts with low active-param compute — critical for local deployment. |
|
||
| **Context budget** | 128k · 200k · 1M+ effective | Stated context ≠ effective context. Most models degrade well before the advertised limit. |
|
||
| **Tool-calling fidelity** | Native function-call schema reliability | The single biggest differentiator for agent harnesses. Models with weak tool fidelity cannot drive agents reliably regardless of raw ability. |
|
||
| **Hostability** | Closed-API only · Open-weight | Determines whether local/private deployment is viable. |
|
||
|
||
### 1.2 Category winners (as of May 2026 — will rot quickly)
|
||
|
||
- **Frontier closed-weight, agentic coding:** Claude Opus 4.x and Claude Sonnet
|
||
4.x dominate SWE-Bench Verified and long-horizon multi-file refactors.
|
||
GPT-5-class models lead on competitive-programming-style isolated problems and
|
||
aggressive reasoning. Gemini 2.5 Pro leads on very-long-context navigation
|
||
(100k+ token codebases in single prompts).
|
||
- **Open-weight frontier:** DeepSeek-V3.x and Qwen3-Coder (480B MoE) are the
|
||
current open SOTA on coding benchmarks. GLM-4.6 and Kimi K2 trail closely on
|
||
agentic tasks. The gap to closed frontier has narrowed to roughly 6–12 months
|
||
for raw capability, but tool-calling fidelity still lags.
|
||
- **Local-runnable (≤80GB VRAM):** Qwen3-Coder-30B-A3B (MoE) and Qwen3-32B-dense
|
||
are the practical sweet spot. DeepSeek-V3 distillations and GLM-4-9B/32B
|
||
occupy specific niches.
|
||
- **Best price/performance for autonomous agents:** Mid-tier Sonnet-class and
|
||
GPT-5-mini-class models routinely win on cost-adjusted SWE-Bench, because
|
||
agentic tasks are dominated by mechanical token throughput, not peak reasoning
|
||
per call.
|
||
|
||
### 1.3 Benchmarks: which actually predict real-world success
|
||
|
||
- **Predictive:** SWE-Bench Verified, Aider polyglot leaderboard, LiveCodeBench
|
||
(recent splits only), Terminal-Bench. These measure multi-file edits,
|
||
test-passing, and tool use under realistic constraints.
|
||
- **Misleading or saturated:** HumanEval, MBPP, basic code-completion suites.
|
||
All are contaminated and saturated; a 90+% score is now table stakes and
|
||
uncorrelated with agent success.
|
||
- **Underrated:** Internal harness-vs-harness A/B tests on _your own_
|
||
repository. No public benchmark captures repo-specific idioms, build systems,
|
||
or test-runner quirks. A 20-task internal eval suite beats any leaderboard
|
||
ranking for selecting a working model for a given project.
|
||
|
||
---
|
||
|
||
## 2. Failure Modes
|
||
|
||
### 2.1 Cross-model failures
|
||
|
||
These appear across every frontier model and most open-weight models:
|
||
|
||
- **Premature completion claims.** The model declares "done" while tests fail or
|
||
builds break. Mitigation: forced verification step in the harness ("run the
|
||
build before declaring success"), not in the prompt.
|
||
- **Sycophancy** (Sharma et al.,
|
||
[arXiv:2310.13548](https://arxiv.org/abs/2310.13548), Oct 2023). Five SOTA
|
||
RLHF-trained assistants systematically generated responses matching the user's
|
||
stated or implied beliefs over correct ones; both human raters and reward
|
||
models preferred convincing-but-wrong outputs a non-negligible fraction of the
|
||
time, creating systematic training pressure toward agreement. **Caveat — not a
|
||
universal property of RLHF.** nostalgebraist (LessWrong, 2023) replicated
|
||
Anthropic's sycophancy eval on OpenAI base models and found they are _not_
|
||
sycophantic at any size, so the effect depends on the specific finetuning
|
||
recipe and the family-specific preference data, not on RLHF as such. Treat
|
||
sycophancy as family-conditional rather than a universal cross-model failure;
|
||
the mitigations below still apply where it manifests. Code-specific
|
||
manifestations: hard-coding to pass test cases, scope creep via agreement,
|
||
confirming guesses without verification, premature positive feedback.
|
||
Mitigation: explicit anti-sycophancy rules ("challenge the user when the user
|
||
is wrong"; "read a file before asserting facts about it"; "only make changes
|
||
that are directly requested"), and external feedback (test runners, hooks)
|
||
rather than model self-grading.
|
||
- **Hallucinated APIs.** Inventing function signatures, import paths, or
|
||
configuration keys. Worsens with: long contexts, smaller models,
|
||
unfamiliar/newer libraries. Mitigation: grounding tools (read source, grep
|
||
before calling), forced doc-fetch, repository-aware retrieval.
|
||
- **Reward-hacked verification.** Deleting failing tests, weakening assertions
|
||
to make tests pass, wrapping failing code in `try/except`, or _solving the
|
||
test cases_ rather than the general problem. Universal failure mode.
|
||
Anthropic's published counter-prompt is short and effective enough to repeat
|
||
verbatim:
|
||
|
||
> Please write a high-quality, general-purpose solution using the standard
|
||
> tools available. Do not create helper scripts or workarounds to accomplish
|
||
> the task more efficiently. Implement a solution that works correctly for all
|
||
> valid inputs, not just the test cases. Do not hard-code values or create
|
||
> solutions that only work for specific test inputs. Tests are there to verify
|
||
> correctness, not to define the solution.
|
||
|
||
Pair with: pre/post diff inspection, test-coverage delta checks, and explicit
|
||
policy against test deletion in agent rules. Pan et al.
|
||
([arXiv:2308.03188](https://arxiv.org/abs/2308.03188), 2023) survey of
|
||
self-correction strategies establishes the broader principle: **external
|
||
feedback signals (test runners, hooks, type checkers) are reliable;
|
||
self-critique alone is not** — models are poorly calibrated to detect their
|
||
own errors without ground truth.
|
||
|
||
- **Context rot / lost-in-the-middle** (Liu et al.,
|
||
[arXiv:2307.03172](https://arxiv.org/abs/2307.03172), 2023). Information
|
||
placed in the middle of a long context is recalled poorly even by 1M-context
|
||
models. Mechanism: transformer attention attends to every token in context (n²
|
||
pairwise relationships), so a larger context stretches attention capacity
|
||
across more relationships, leaving less focused attention per token. The
|
||
degradation is gradient, not cliff; effective context is typically 30–50% of
|
||
advertised. Mitigation: structured, ordered context (most-recent and
|
||
most-task-relevant at the tail), summarization of stale turns, separate
|
||
retrieval rather than dumping.
|
||
|
||
- **Position-anchored priming (question drift).** When a model commits to an
|
||
answer in a prior turn, that answer sits in the context window and acts as a
|
||
prior the model subsequently defends. Follow-up questions are read through the
|
||
lens of the previous position; the model generates responses consistent with
|
||
what it already said rather than addressing the new question. Common pattern:
|
||
"no" to a first question → "no" to all follow-ups even when the follow-ups ask
|
||
something different. Related to sycophancy but directionally inverted — the
|
||
model is anchored to _its own_ prior commitment, not the user's.
|
||
|
||
Mitigations in order of effectiveness:
|
||
- **Compaction or fresh context.** Remove the prior committed answer from the
|
||
context window. The anchor is physically broken. A `PreCompact` hook can
|
||
preserve the user's current question while discarding stale prior responses.
|
||
- **Adversarial reframing.** Per ClashEval (Wu, Wu, Zou 2024): lowering the
|
||
model's confidence in its prior increases context adherence. "I believe your
|
||
previous answer was wrong because X. Now answer this specific question: ..."
|
||
lowers confidence more than repeating the question.
|
||
- **Explicit current-question marker.** A `UserPromptSubmit` hook prepending
|
||
`CURRENT QUESTION (answer this, not the prior exchange):` at the prompt
|
||
tail. Mechanical, cheap, measurably reduces drift for small models where
|
||
position effects are stronger.
|
||
- What does **not** work: repeating the question louder, emphasis, or asking
|
||
the model to "read more carefully." None of these change the anchor.
|
||
|
||
- **Stub-and-forget.** Writing `// TODO: implement` placeholders and returning
|
||
control as if complete. Especially common in Claude family. Mitigation:
|
||
grep-for-TODO post-step.
|
||
|
||
### 2.2 Family-specific patterns
|
||
|
||
- **Claude (Opus/Sonnet 4.x):** Tends toward _over-engineering_ — adds
|
||
unrequested error handling, docstrings, abstractions. Strong on instruction
|
||
adherence when restrictions are explicit. Tends to "polish" adjacent code when
|
||
asked to make a targeted change. Mitigation: explicit anti-scope-creep rules
|
||
in `AGENTS.md` / `CLAUDE.md` (this is exactly why the field standardized on
|
||
these files).
|
||
- **GPT (4.x / 5):** Tends toward _overconfident refactors_ — silently
|
||
restructures code beyond the requested scope. Stronger at math/algorithmic
|
||
reasoning, weaker at faithfully respecting existing code style. Mitigation:
|
||
small task slicing, frequent diff review, lower temperature.
|
||
- **Gemini (2.5):** Verbose; tends to repeat large file contents. Strong on very
|
||
long contexts but degrades on tool-call schema adherence under load.
|
||
Occasional formatting drift (markdown bleeding into code). Mitigation:
|
||
output-format guards and structured tool schemas.
|
||
- **DeepSeek / Qwen / open MoE:** Strong raw coding but weaker tool-call
|
||
reliability — malformed JSON, schema deviation, or "talking about" calling a
|
||
tool rather than emitting the call. Mitigation: strict JSON-mode /
|
||
grammar-constrained decoding (e.g., `llama.cpp` GBNF, `outlines`,
|
||
`lm-format-enforcer`), and harnesses that re-prompt on malformed calls.
|
||
- **Small / quantized models (≤14B, Q4 and below):** Instruction-following
|
||
collapse — ignoring rules after ~4–8 turns; tool-schema breakage; severe
|
||
hallucination of imports. Not yet viable as primary agent drivers; usable as
|
||
cheap subagents for specific narrow tasks (grep, summarize, classify).
|
||
|
||
### 2.3 The "Claude leaks" and their effect
|
||
|
||
Starting Oct 2024, leaked system prompts and tool definitions from Claude (and
|
||
later, similar leaks from Cursor, Devin, Windsurf, and others) revealed how much
|
||
production-grade harnesses rely on:
|
||
|
||
- Explicit personas and tone constraints
|
||
- Long lists of _anti-patterns_ ("do not ... do not ... do not ...")
|
||
- Structured TODO tracking as a first-class tool
|
||
- Strict separation of plan and act phases
|
||
- Memory tiering (session vs persistent vs repo)
|
||
- Explicit file-link and citation formats
|
||
|
||
The industry consequence was rapid convergence: `AGENTS.md`, `CLAUDE.md`,
|
||
`.cursorrules`, `.windsurfrules`, `.opencode/agent.md`, and similar files now
|
||
share a near-identical structure. The leaks accelerated the recognition that
|
||
**prompt scaffolding is the product**, not a secondary detail. They also
|
||
clarified that frontier labs spend significant effort on _negative_ instruction
|
||
— what _not_ to do — which most third-party agent builders under-invested in.
|
||
|
||
---
|
||
|
||
## 3. Agent Architecture
|
||
|
||
### 3.0 The Prompt / Context / Harness diagnostic
|
||
|
||
For any agent failure, route the fix to the right layer. Wrong-layer fixes are
|
||
the single most common waste of effort:
|
||
|
||
| Symptom | Layer | Fix |
|
||
| ------------------------------------------ | ------- | ------------------------------------------- |
|
||
| Wrong output format | Prompt | Rewrite instruction; add output schema |
|
||
| Missed an explicit requirement | Prompt | Tighten task expression |
|
||
| Hallucinated codebase fact | Context | Fix tool description; add retrieval |
|
||
| Wrong tool selected | Context | Fix description; reduce tool count |
|
||
| Stalls mid-task on multi-step problem | Context | Insufficient persistent context (NOTES.md) |
|
||
| Reads all files first despite "don't" | Context | Trained behavioral prior — see §4.6 |
|
||
| Task drift in long session | Harness | Add sub-agent isolation boundary |
|
||
| Destructive action taken | Harness | Add permission hook (pre-tool deny) |
|
||
| Tests deleted to pass; assertions weakened | Harness | Pre/post diff check; coverage-delta gate |
|
||
| Long-session quality cliff at ~60% fill | Harness | Early compaction trigger; tool-output prune |
|
||
|
||
### 3.1 Single-thread default
|
||
|
||
Modern consensus: a single agent loop with a clear plan/act split outperforms
|
||
multi-agent topologies on almost all real coding tasks. Cognition's analysis
|
||
identified the root cause as **context divergence**: separate agents accumulate
|
||
incompatible interpretations of the same task, and reconciliation costs exceed
|
||
parallelism gains.
|
||
|
||
The exceptions where parallel/multi-agent _does_ help:
|
||
|
||
- **Read-only exploration subagents.** Scan a large codebase, return a
|
||
compressed summary. Their context does not need to merge back.
|
||
- **Fully isolated tasks.** Multiple independent files generated from the same
|
||
spec, with no inter-dependencies. Rare in real codebases.
|
||
- **Adversarial review.** A second agent reviews the first's diff. Modest gains,
|
||
mostly catches premature-completion failures.
|
||
|
||
### 3.1a Counterbalance agent design
|
||
|
||
When secondary agents _are_ defined (slash commands, personas, named modes), the
|
||
high-leverage approach is to design each agent as a **counter to a known failure
|
||
mode of the base model**, not as a topic specialist ("frontend agent", "database
|
||
agent"). Topic specialists duplicate context and rarely beat a generalist with a
|
||
good search tool. Counterbalance agents earn their keep by suppressing a
|
||
measurable, named tendency:
|
||
|
||
- A **brainstorm agent** counters frontier-model _overthinking_ — enforces
|
||
speed, breadth, no hedging, no deep analysis. Exists because Opus/Sonnet
|
||
ruminate by default.
|
||
- A **research agent** counters frontier-model _pattern-matching_ — requires
|
||
hypothesis + falsification criterion before any diagnostic test. Exists
|
||
because LLMs latch onto the first plausible explanation.
|
||
- A **build-local agent** counters _small-model context drift_ — pagination
|
||
limits, mandatory grep-before-read, delegation rules for multi-file work.
|
||
|
||
Two consequences for agent-body authoring:
|
||
|
||
1. **Negative role definition is part of the spec.** Every counterbalance agent
|
||
should end with a short "What You Are NOT" block: _"You are NOT an
|
||
implementation agent. You are NOT a planning agent."_ The exclusion list
|
||
prevents scope creep more reliably than positive role framing alone.
|
||
2. **Cognitive-mode decomposition** beats topic decomposition. Agents named for
|
||
_how they think_ (diverge, investigate, execute-narrowly) compose cleanly:
|
||
brainstorm hands off to research, research hands off to default, build-local
|
||
handles narrow tasks. Agents named for _what they think about_ ("backend
|
||
agent") fight for jurisdiction on every cross-cutting task.
|
||
|
||
### 3.2 Plan / Act / Verify loop
|
||
|
||
The minimal viable agent loop:
|
||
|
||
```
|
||
plan → act → verify → (loop or stop)
|
||
```
|
||
|
||
- **Plan:** produce a todo list, possibly with a brief written rationale. Forces
|
||
the model out of "pattern-match and emit" mode. The todo list is also a
|
||
contract the verify step can check against. **Plan-and-Solve prompting** (Wang
|
||
et al., 2023) — decompose first, then execute — measurably reduces arithmetic
|
||
and multi-step reasoning errors.
|
||
- **Act:** execute one todo at a time. Single in-progress item is a soft rule
|
||
that empirically reduces context fragmentation.
|
||
- **Verify:** run tests, lint, build. The verification _must_ be in the harness,
|
||
not the prompt — relying on the model to self-verify is one of the most
|
||
reliable ways to produce reward-hacked output.
|
||
|
||
**Think-Anywhere** (Jiang et al., 2026) extends Plan-and-Solve: models trained
|
||
to insert `<think>` blocks at _any_ token position — not just upfront — catch
|
||
mid-implementation off-by-one errors that an initial plan cannot foresee. Claude
|
||
4.x's _interleaved thinking_ between tool calls is the production-grade
|
||
realization of the same idea. The practical instruction: "Re-evaluate the
|
||
hypothesis at every tool-call boundary." The mapping to development
|
||
methodologies is exact — **Plan-and-Solve is sprint planning, Think-Anywhere is
|
||
the retrospective**; both are needed, neither suffices alone. Skipping the plan
|
||
is "vibe coding"; refusing to re-evaluate is waterfall.
|
||
|
||
**Circuit breakers as a first-class primitive.** Embedded numeric self-stops in
|
||
the agent body materially outperform vague "don't loop" instructions. The
|
||
pattern, verbatim from working agent files:
|
||
|
||
- _5+ attempts without falsifying a hypothesis = STOP. Report what you've ruled
|
||
out._
|
||
- _3+ edits to the same file without a passing test = STOP. You're fixing
|
||
symptoms, not the cause._
|
||
- _Urge to "just try something" = STOP. Write the hypothesis first._
|
||
- _Two failures at the same level of abstraction = go UP one level._
|
||
|
||
Why this works: vague instructions decay against task pressure; explicit
|
||
integers don't. The model can self-monitor against a count more reliably than
|
||
against "too much." Pair with hard caps in the harness for the cases where the
|
||
agent fails to self-stop.
|
||
|
||
### 3.3 Reasoning-mode usage
|
||
|
||
For reasoning-capable models, the cost calculus is:
|
||
|
||
- **Use reasoning for:** planning, bug diagnosis, ambiguous requirements,
|
||
architecture decisions.
|
||
- **Skip reasoning for:** mechanical edits, file moves, formatting fixes,
|
||
applying a known patch.
|
||
- **Hybrid models with toggleable reasoning** (Claude 4.x extended thinking,
|
||
GPT-5 reasoning effort, Qwen3 thinking-mode) make this routing tractable
|
||
inside a single harness.
|
||
|
||
### 3.4 Sub-agent tiering (model-as-budget)
|
||
|
||
When subagents _are_ used (read-only exploration, isolated tasks), the
|
||
now-standard pattern is **model-class tiering**:
|
||
|
||
- **Parent orchestrator:** strongest model (Opus-class) — holds cross-task
|
||
state, plans, synthesizes. High per-call cost, few calls.
|
||
- **Sub-agents:** mid- or small-class (Sonnet/Haiku-class, or a 30B local model)
|
||
— receive isolated task slices. May burn tens of thousands of exploration
|
||
tokens, but **return only a 1–2k token condensed summary**. The parent's
|
||
context never sees the sub-agent's raw exploration.
|
||
|
||
This converts the sub-agent into a **context firewall**: parallelism without
|
||
context contamination. It is the only multi-agent topology that consistently
|
||
outperforms single-thread.
|
||
|
||
### 3.4a Falsification-first investigation
|
||
|
||
Applied Strong Inference (Platt, 1964) at the operational level. Before any
|
||
diagnostic test, the agent fills a four-item checklist:
|
||
|
||
- [ ] Hypothesis written (one sentence: _"I believe X because Y"_)
|
||
- [ ] Falsification criterion written (\_"if wrong, I'd expect to see _\_\_"_)
|
||
- [ ] **Falsification test run before confirmation test**
|
||
- [ ] Result recorded: ELIMINATED with reason, or CONFIRMED with evidence
|
||
|
||
The order matters: running the confirmation test first invites confirmation bias
|
||
and produces a "plausible answer" that the agent then defends. Running the
|
||
falsification test first either kills the hypothesis cleanly (cheap progress) or
|
||
strengthens it materially (the surviving hypothesis is now harder to dislodge).
|
||
|
||
**Dead-ends file.** Each eliminated hypothesis is appended to
|
||
`.session/dead-ends.md` (or the investigation file's Hypotheses section) with
|
||
the same four fields. Three benefits:
|
||
|
||
1. The current session does not re-test an already-eliminated hypothesis when
|
||
context pressure causes forgetting.
|
||
2. A post-compaction resume has a structured record to anchor against.
|
||
3. A fresh session (or a handoff agent) starts with a real audit trail instead
|
||
of having to re-derive the eliminations.
|
||
|
||
Dead-ends are also a leading indicator of agent quality: a session that produces
|
||
zero entries was either trivial or non-rigorous; a session with 10+ entries and
|
||
no resolution is a candidate for human escalation.
|
||
|
||
### 3.5 Evaluator-Optimizer, LLM-as-Judge, and Reflexion
|
||
|
||
Anthropic's "Building effective agents" formalized the evaluator-optimizer
|
||
pattern: one agent generates, a separate evaluator scores against a rubric, the
|
||
generator refines. Useful for research-quality assessments and brainstorm
|
||
outputs more than for code (tests are a stricter evaluator than any judge).
|
||
|
||
The foundation result is Zheng et al.
|
||
([arXiv:2306.05685](https://arxiv.org/abs/2306.05685), MT-Bench / Chatbot Arena,
|
||
2023): **GPT-4-class LLMs as judges achieve >80% agreement with human
|
||
preferences — the same rate as human-human agreement.** This makes them a viable
|
||
scalable evaluator, _but_ with known biases that must be controlled:
|
||
|
||
- **Position bias.** Judges favor whichever response appears first in a pairwise
|
||
comparison. Mitigation: run twice with order reversed; take only the
|
||
consistent result.
|
||
- **Verbosity bias.** Longer responses score higher even at equal information
|
||
density. Mitigation: rubric scores correctness and concision separately.
|
||
- **Self-enhancement bias.** Same-family judges over-score their own family's
|
||
outputs. Mitigation: cross-family judging or human spot-checks for
|
||
calibration.
|
||
|
||
**Reflexion (Shinn et al., 2023, arXiv:2303.11366)** formalizes the
|
||
evaluator-optimizer loop for multi-step agents: an external evaluator generates
|
||
verbal feedback, the agent stores it in an episodic memory buffer, and reruns
|
||
with the feedback in context. Results: 91% pass@1 on HumanEval vs GPT-4's 80%
|
||
without it. Two non-negotiable conditions:
|
||
|
||
1. **External feedback signal** — not self-critique. An oracle or verifier (test
|
||
pass/fail, compilation, hook exit code). Huang et al.
|
||
([arXiv:2310.01798](https://arxiv.org/abs/2310.01798), "Large Language Models
|
||
Cannot Self-Correct Reasoning Yet," Oct 2023) demonstrate this directly: in
|
||
the intrinsic setting (no oracle labels), self-correction _consistently
|
||
decreases_ reasoning performance across prompts and tasks; prior
|
||
"self-correction works" results vanish when oracle labels are removed. Pan et
|
||
al. (arXiv:2308.03188) provide the broader survey taxonomy of self-correction
|
||
strategies and the same conclusion in aggregate: external feedback signals
|
||
(test runners, hooks, type checkers) are reliable; self-critique alone is
|
||
not. Without an external signal, asking the model to reflect, double-check,
|
||
or critique its own output is at best noise and at worst actively harmful —
|
||
this is one of the most tempting and most counterproductive interventions in
|
||
agent design.
|
||
2. **The ability to retry.** Reflexion loops. Single-shot feedback injection is
|
||
helpful context, not the full pattern.
|
||
|
||
**Failure-mode routing as a design extension.** A judge subagent that reads the
|
||
transcript, classifies the failure mode, and selects the matching intervention
|
||
is stronger than generic "review the output" because the intervention is matched
|
||
to the type of failure, not just "try harder." The prior-confidence →
|
||
intervention mapping from §6.4 applies here:
|
||
|
||
| Failure mode | External signal? | Intervention |
|
||
| -------------------------------------------- | ------------------ | ------------------------------------ |
|
||
| Code bug / test failure | Yes (test runner) | Reflexion loop |
|
||
| Convention violation (async, error handling) | Yes (grep) | PostToolUse grep + canonical example |
|
||
| Question drift / prior anchoring | No | Compaction or adversarial reframing |
|
||
| Factual hallucination | Sometimes | Retrieval injection |
|
||
| Wrong directory / file | Yes (file listing) | Structure injection |
|
||
|
||
**Design constraints for the judge subagent:**
|
||
|
||
- **Use a stronger or cross-family model as judge.** A small model evaluating
|
||
its own family's outputs compounds self-enhancement bias and parameter-count
|
||
limitations. Frontier-class (Opus/Sonnet) or a different model family is
|
||
strongly preferred. For a local-only constraint, a 32B judge evaluating a 9B
|
||
agent is a practical minimum.
|
||
- **Activate on mechanical failure signals, not every turn.** Run the judge when
|
||
a hook fires non-zero, tests fail, or a build breaks — not as a constant
|
||
overlay. Routing every response through a judge adds latency and is redundant
|
||
when mechanical verification already gives a clear answer.
|
||
- **Judge output should be a correction spec, not a rewrite.** Structured:
|
||
`{ failure_mode, confidence, intervention, injected_context? }`. The working
|
||
agent acts on the spec; the judge stays in the evaluator role.
|
||
- **General Q&A failures lack external ground truth.** For question drift,
|
||
factual errors without a retrieval target, or prior anchoring — no oracle
|
||
exists. Compaction and adversarial reframing are cheaper and more reliable for
|
||
those cases than a judge loop.
|
||
|
||
### 3.6 The Enforcement Hierarchy
|
||
|
||
Not all guidance is equally effective. From most to least reliable, as a
|
||
practical hierarchy:
|
||
|
||
```
|
||
Permission-layer denial ← Strongest. Tool literally not available to the agent.
|
||
PreToolUse hard block ← Structural. Always fires. Agent cannot bypass.
|
||
PostToolUse path-check ← Fires right after the relevant action (context tail).
|
||
Nested AGENTS.md at path ← Always-on for that folder scope. Tool-portable.
|
||
Stop / SessionStart inject ← Fires at session boundaries. Broad reminders.
|
||
Root AGENTS.md sections ← Context-start only. Degrades under Lost-in-the-Middle.
|
||
```
|
||
|
||
The root cause of the degradation gradient is Liu et al.'s lost-in-the-middle
|
||
result: guidance written once at session start sits in the low-attention middle
|
||
by tool call 20. Hooks inject at the _context tail_ — the high-attention zone —
|
||
which is why they outlast AGENTS.md under context pressure. **Decision rule:**
|
||
if a constraint must hold deep into a session, fire it from a hook, not a
|
||
prompt.
|
||
|
||
**Permission-layer denial sits above PreToolUse for a reason.** A PreToolUse
|
||
hook _intercepts_ a tool call the agent has already chosen to make; it generates
|
||
a rejection message that the agent must then process and route around.
|
||
Permission-layer denial (OpenCode's
|
||
`permission: { edit: deny, write: deny, bash: deny }` on an agent definition;
|
||
Claude Code's analogous allowlist) **removes the tool from the agent's available
|
||
set entirely** — the tool description never appears in the agent's context, so
|
||
the agent cannot try and recover. This is the cleanest realization of
|
||
Anthropic's "poka-yoke your tools" principle: the violation is not just blocked,
|
||
it is unreachable. Use it for invariants that must hold across an entire agent
|
||
role (e.g., "the orchestrator never writes files"); use PreToolUse hooks for
|
||
invariants that depend on the specific tool arguments (e.g., "no `npx` in shell
|
||
commands").
|
||
|
||
### 3.7 Hook design: silent on success, loud on failure
|
||
|
||
A convention that has converged across Claude Code, Cursor, OpenCode, and
|
||
internal Anthropic tooling: **hooks emit nothing on success and exit with a
|
||
non-zero code (commonly 2) on failure** to reactivate the agent. Verbose success
|
||
output adds noise to every tool call; the agent only needs to know when it's
|
||
wrong. This is the harness analog of Unix's "no news is good news."
|
||
|
||
Three refinements that materially improve hook quality once the basics are in
|
||
place:
|
||
|
||
- **Stateful reminders that read system state at fire time.** A QUALITY GATE
|
||
reminder that runs `ss -tlnp | grep ':300[01]'` and tailors its recommendation
|
||
based on whether the dev server is actually running
|
||
(`npm test && npm run lint` vs `npm run build:strict`) is dramatically more
|
||
useful than a static instruction. The harness already runs at the right
|
||
moment; spend the 5ms to read state.
|
||
- **Tool-specific PostToolUse warnings.** Some tools have well-known
|
||
blast-radius footguns: `vscode_renameSymbol` renames variable bindings but not
|
||
object property keys, string literals, or related identifiers sharing a
|
||
prefix. A targeted reminder fired _immediately after_ the rename is in the
|
||
high-attention zone and catches the gotcha before the next commit. Generic "be
|
||
careful with renames" warnings at session start do not.
|
||
- **Path-scoped PostToolUse reminders.** When the editing tool's `FILE_PATH`
|
||
matches a glob (e.g., `apps/client/src/pages/`), inject a domain rule ("this
|
||
is a client page — use BFF single-request, never chain second fetches"). The
|
||
rule fires only on the relevant edits, so it doesn't bloat the context window
|
||
for unrelated work.
|
||
|
||
### 3.8 Trigger-word nudges (the positive-recommendation analog)
|
||
|
||
The enforcement hierarchy in §3.6 covers _blocking_ guidance. The mirror
|
||
discipline is **positive recommendation at the context tail**: a
|
||
`UserPromptSubmit` hook greps the user's incoming prompt for trigger words and
|
||
injects a one-line agent recommendation alongside the prompt.
|
||
|
||
Examples that work in practice:
|
||
|
||
- Hesitation / overthinking words ("wait", "actually", "hmm", "too complicated",
|
||
"going in circles") → nudge toward a brainstorm agent.
|
||
- Debugging / investigation words ("why is this broken", "trace", "root cause",
|
||
"regression") → nudge toward a research agent.
|
||
|
||
Three non-obvious design constraints:
|
||
|
||
1. **One nudge per topic.** Repeating the same nudge after a user declines
|
||
trains them to filter it out. Track "nudge fired for topic X" so a declined
|
||
recommendation stays declined.
|
||
2. **One sentence, non-intrusive.** A nudge that consumes 200 tokens is
|
||
indistinguishable from spam. Format: _"NUDGE: \<one-line condition
|
||
description\>. Consider \<action\> — one sentence, non-intrusive."_
|
||
3. **Context-tail injection, not AGENTS.md.** A nudge written into AGENTS.md
|
||
decays to invisibility by tool call 20 (lost-in-the-middle). A
|
||
`UserPromptSubmit` hook fires the nudge fresh at every turn, at the tail —
|
||
where attention is highest.
|
||
|
||
---
|
||
|
||
## 4. Context Engineering
|
||
|
||
### 4.1 Token budget allocation
|
||
|
||
Treat the context window as a budget, not a container. A rough allocation that
|
||
holds up across models:
|
||
|
||
| Region | Share | Notes |
|
||
| ---------------------- | ------ | ------------------------------------------------------------- |
|
||
| System / agent rules | 5–10% | Stable, terse. Don't bloat with prose. |
|
||
| Memory / repo facts | 5–15% | Project conventions, prior decisions. Tier by relevance. |
|
||
| Task description | 2–5% | Keep it boundary-defined and specific. |
|
||
| Retrieved code | 30–50% | The biggest lever. Most agents over-retrieve. |
|
||
| Tool outputs / scratch | 20–40% | Compress aggressively; summarize old turns. |
|
||
| Headroom | 10–20% | Leave room for the model's own output and at least one retry. |
|
||
|
||
### 4.2 Retrieval
|
||
|
||
- **Repo maps** (Aider's approach): compress a codebase into a ranked outline of
|
||
file/symbol declarations. Cheap, effective baseline. Still best-in-class for
|
||
repos up to ~500k LOC.
|
||
- **AST-aware retrieval** beats line-based grep on identifier-driven queries.
|
||
- **Embedding retrieval** is _overrated_ for code. Symbol-graph and AST
|
||
retrieval consistently beat dense embeddings on real coding tasks; the
|
||
exception is natural-language docs and design notes.
|
||
- **Hybrid retrieval** (grep + symbol graph + light embedding for docs)
|
||
outperforms any single approach.
|
||
|
||
### 4.3 Memory tiering
|
||
|
||
Now-standard pattern (Claude Code, Cursor, OpenCode, GitHub Copilot all
|
||
converged on it):
|
||
|
||
- **Session memory:** scratch for the current task. Cleared at end.
|
||
- **Repo memory:** project conventions, verified facts, build commands.
|
||
- **User/global memory:** preferences across all projects.
|
||
|
||
Loading the right tier at the right time is more impactful than how much is
|
||
stored.
|
||
|
||
### 4.4 AGENTS.md: keep it small
|
||
|
||
An ETH Zurich evaluation of LLM-generated per-project AGENTS.md files found they
|
||
**increased API cost by 20% and added 14–22% reasoning tokens with no measurable
|
||
improvement in task success rate.** Bloated rule files fill the context window
|
||
with content irrelevant to the current task — a tax on every tool call for
|
||
marginal-to-negative benefit.
|
||
|
||
Practical ceiling: **roughly 60 lines of universally applicable constraints.**
|
||
Everything else belongs in:
|
||
|
||
- **Nested AGENTS.md** at the directory it applies to (loaded only when that
|
||
scope is active in most agent tools).
|
||
- **Skills** loaded on demand by a routing description.
|
||
- **Hooks** at the relevant tool-call boundary.
|
||
- **AGENTS.md stubs** — one-line trigger conditions with `read_file`
|
||
instructions, so the body loads only when the trigger fires.
|
||
|
||
The pattern: **anti-patterns matter more than positive instructions.** A 60-line
|
||
AGENTS.md of "do not do X" rules outperforms a 600-line one full of
|
||
best-practice prose. This matches the asymmetric effort that frontier labs put
|
||
into negative instruction (visible in leaked system prompts).
|
||
|
||
### 4.5 Just-in-time retrieval and structured notes
|
||
|
||
Anthropic's
|
||
[Sep 2025 context-engineering article](https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents)
|
||
formalized two patterns that now define the state of the art:
|
||
|
||
**Just-in-time retrieval.** Rather than loading all potentially relevant content
|
||
at session start, agents hold _lightweight references_ (file paths, query
|
||
strings, identifiers) and load data on demand. Claude Code's reliance on
|
||
glob/grep over upfront file dumps is the canonical example. The instruction
|
||
version for agent bodies: **"Hold references; load on demand. Do not read files
|
||
you don't need yet."**
|
||
|
||
**Structured note-taking (agentic memory).** For tasks spanning tens of tool
|
||
calls or multiple context windows, agents should write progress to a file (e.g.
|
||
`NOTES.md`) and read it back at context-reset boundaries. Properties:
|
||
|
||
- **Structured for state** — JSON/checklist for completion tracking.
|
||
- **Freeform for progress** — natural language for context and open questions.
|
||
- **Write-first incentive** — "record completion of step 1 before reading files
|
||
for step 2" is structurally more honest than reading-first, because the model
|
||
cannot write a truthful note about uncompleted work.
|
||
|
||
Note files survive compaction. If a `PreCompact` hook copies the working
|
||
NOTES.md into session-persistent storage before summarization, a context
|
||
overflow mid-task becomes a resume, not a restart.
|
||
|
||
**Investigation / exploration files as durable handoff artifacts.** For work
|
||
that spans multiple sessions or agents, NOTES.md is too ephemeral. A structured
|
||
`docs/explorations/<name>.md` file with a fixed schema (Status / Question / What
|
||
We Know / Hypotheses / Investigation Log / Open Questions) is the cross-session
|
||
equivalent. Three benefits:
|
||
|
||
1. **Agent handoff without state loss.** A brainstorm agent producing an
|
||
exploration file can hand off to a research agent (or the default
|
||
implementation agent) by name — the file is the contract, not the chat
|
||
transcript.
|
||
2. **Status field as routing signal.**
|
||
`Status: brainstorming | exploring | prototyping | decided | abandoned` lets
|
||
the next agent (or the next user) immediately know whether to diverge
|
||
further, dig deeper, or build.
|
||
3. **Compaction-safe.** Even if every conversational turn is summarized away,
|
||
the file is reread at session start by a `SessionStart` hook that surfaces
|
||
active investigations.
|
||
|
||
NOTES.md and exploration files are complementary: NOTES.md is the agent's
|
||
working memory for _this task_; the exploration file is the project's durable
|
||
record of _this question_.
|
||
|
||
**Timing awareness as an agent blind spot.** Agents have no innate sense of how
|
||
long a command takes. A casual suggestion to "just run the full test suite"
|
||
might be a 2-second hit or a 5-minute one, and the agent has no basis for that
|
||
choice. Effective mitigations:
|
||
|
||
- Prefix unknown commands with `time` until a baseline is observed.
|
||
- Capture significant output to `/tmp/<descriptive>.txt` so grep can re-run
|
||
cheaply without re-executing the slow command.
|
||
- Stash baselines in repo memory (`/memories/repo/timings.md`) once observed, so
|
||
future sessions don't re-measure.
|
||
- Feed timing back into triage: a `<5s` command is nearly free to "just run"; a
|
||
`>30s` command should reason first.
|
||
|
||
### 4.6 Sequential constraint ordering: a stubborn failure
|
||
|
||
A narrow but instructive case: the user writes "Do X first. Then Y. Then Z." and
|
||
the agent immediately reads all files for X, Y, _and_ Z upfront, often blowing
|
||
the context budget before step 1 begins.
|
||
|
||
**Root cause is not a prompt problem; it's a context-engineering problem.** RLHF
|
||
training data contains overwhelming examples of "gather context, then act" — the
|
||
model has a strong _pre-task exploration bias_ that competes with the user's
|
||
ordering constraint and usually wins after a few tool calls. Stronger negative
|
||
phrasing ("DO NOT read all files first!") loses to this trained behavioral prior
|
||
reliably.
|
||
|
||
What works, in descending order of effectiveness:
|
||
|
||
1. **NOTES.md write-first pattern.** Structure as: "Complete step 1. Write what
|
||
you found to NOTES.md. Then read NOTES.md and proceed to step 2." The model
|
||
cannot write a truthful note about step 1 without doing step 1, which
|
||
serializes the work.
|
||
2. **Imperative checkpoints.** "Say `STEP 1 DONE` before continuing" — the
|
||
verbalization marker creates a natural serialization point.
|
||
3. **Hard step caps in the harness** (e.g., OpenCode's `steps: 20` + `ask`
|
||
gates). Caps in the _prompt_ are interpreted as suggestions.
|
||
4. **Sub-agent fan-out for parallel-safe tasks** — one sub-agent per file, each
|
||
with isolated context. Doesn't help strictly sequential tasks.
|
||
|
||
What does **not** work: negative constraints ("do not read all files"), repeated
|
||
reminders (degrade quickly), or soft caps embedded in the prompt.
|
||
|
||
### 4.6a Conditional vs Imperative Prompt Design
|
||
|
||
> **Status:** Research synthesis. Captures an empirical finding from agent
|
||
> prompt analysis and its implications for prompt design.
|
||
>
|
||
> **Audience:** Engineers designing agent system prompts, AGENTS.md files,
|
||
> hook scripts, and enforcement layers.
|
||
|
||
---
|
||
|
||
#### The Problem: Conditional Steps Let Models Skip
|
||
|
||
A 328-line research agent prompt was analyzed for structural patterns and found
|
||
to be **60% conditional** — the majority of its instructions took the form
|
||
"when X, do Y." The downstream consequence: the model routinely exercised
|
||
discretion to decide X didn't apply, silently skipping entire sections of the
|
||
prompt. The agent was not failing to follow instructions; it was following
|
||
conditional instructions by choosing the branch that required less work.
|
||
|
||
This is not a model bug — it is a prompt design failure. Conditional steps hand
|
||
the model a discretionary on-ramp to skip compliance. The model's optimization
|
||
function is "complete the user's task efficiently," not "follow every step of
|
||
the prompt verbatim." When a step says "when X, do Y," the model's first
|
||
question is "does X hold?" — and it has strong incentives to answer "no."
|
||
|
||
---
|
||
|
||
#### Conditional vs Imperative: The Contrast
|
||
|
||
**Conditional pattern (fragile):**
|
||
|
||
> "When you encounter a test failure, first read the failing test, then check
|
||
> the relevant source file."
|
||
|
||
What happens: the model declares "I already know what's wrong" and skips
|
||
straight to editing. X = "encounter a test failure" is interpreted narrowly —
|
||
the model has encountered the *error output*, not the *test file*, so the
|
||
condition is not met.
|
||
|
||
**Imperative pattern (robust):**
|
||
|
||
> "Read the failing test. Then check the relevant source file."
|
||
|
||
What happens: the model reads the test before any other action. There is no
|
||
condition to evaluate, no discretion to exercise.
|
||
|
||
The difference is structural, not semantic. Both express the same intent; only
|
||
the imperative form removes the model's ability to opt out.
|
||
|
||
---
|
||
|
||
#### Why Conditionals Fail
|
||
|
||
Three mechanisms operate simultaneously:
|
||
|
||
1. **Discretion by design.** A conditional step contains a gate ("when X") that
|
||
the model must evaluate. Evaluation requires judgment, and judgment is
|
||
exercised toward the path of least effort. The model is not being lazy; it is
|
||
optimizing for task completion, not process compliance.
|
||
|
||
2. **Narrow interpretation of conditions.** The model interprets conditionals
|
||
narrowly to justify skipping them. "When you encounter a test failure" means
|
||
"when you have the test file open," not "when the test output is in context."
|
||
The condition becomes a self-fulfilling prophecy: the step is skipped because
|
||
the condition is defined to require the step's output.
|
||
|
||
3. **Efficiency optimization over process compliance.** The model's training
|
||
objective is to produce useful outputs, not to follow process. A conditional
|
||
step gives the model a legitimate-sounding rationale for skipping a step it
|
||
judges unnecessary — and the model is usually right that the step is
|
||
unnecessary for that specific case, which reinforces the skipping behavior.
|
||
|
||
---
|
||
|
||
#### The Fix
|
||
|
||
Three complementary strategies, ordered by reliability:
|
||
|
||
**1. Make instructions imperative.**
|
||
|
||
Replace every "when X, do Y" with "do Y." The model executes the step regardless
|
||
of its judgment about whether it's needed. This is the single highest-leverage
|
||
change to an agent prompt — converting conditionals to imperatives reduces
|
||
skipped steps dramatically.
|
||
|
||
Example transformation:
|
||
|
||
| Before (conditional) | After (imperative) |
|
||
| --------------------------------------------------- | ----------------------------------------- |
|
||
| "When editing a use case, check for `throw`" | "Check for `throw` before editing a use case" |
|
||
| "If the build fails, read the error first" | "Read the build error before any edit" |
|
||
| "When you see a TODO, resolve it" | "Resolve every TODO you encounter" |
|
||
| "If the test output mentions a file, read that file" | "Read the file mentioned in the test output" |
|
||
|
||
**2. Move genuine conditions to PreToolUse hooks.**
|
||
|
||
Some constraints are genuinely conditional — "block `npx` but allow `npm`" —
|
||
and conditional logic in the prompt is the wrong place for them. PreToolUse
|
||
hooks are structural enforcement: they fire on every tool call, evaluate the
|
||
condition deterministically, and deny before the model can opt out. The
|
||
condition is still evaluated, but the evaluation is in code, not in the model's
|
||
discretion.
|
||
|
||
This maps directly to the enforcement hierarchy (§3.6): **must-do constraints
|
||
belong in hooks** where they are structural and inescapable; **should-do
|
||
process steps belong imperative in the prompt** where the model has no
|
||
discretion to skip them.
|
||
|
||
**3. Add commit phrases ("Say STEP 1 DONE").**
|
||
|
||
For multi-step processes where the model must acknowledge completion of each
|
||
step before proceeding, add explicit acknowledgment phrases. The pattern:
|
||
|
||
> "Read the failing test. Say TEST READ DONE. Then check the relevant source
|
||
> file. Say SOURCE READ DONE."
|
||
|
||
Why this works: the acknowledgment phrase creates a visible boundary. The model
|
||
cannot skip the preceding step without producing the acknowledgment, and the
|
||
acknowledgment itself is a token cost the model has no incentive to avoid. This
|
||
is a lightweight form of chain-of-thought verification that doesn't rely on
|
||
self-critique (which Huang et al. show is unreliable).
|
||
|
||
---
|
||
|
||
#### Tie to the Enforcement Hierarchy
|
||
|
||
The enforcement hierarchy from §3.6 provides the decision rule for where
|
||
conditional logic belongs:
|
||
|
||
```
|
||
Permission-layer denial ← Tool not available. No discretion.
|
||
PreToolUse hard block ← Structural. Condition evaluated in code.
|
||
PostToolUse path-check ← Fires after the action. Context tail.
|
||
Nested AGENTS.md at path ← Always-on for scope. No condition evaluation.
|
||
Stop / SessionStart inject ← Broad reminders. Degrades under context pressure.
|
||
Root AGENTS.md sections ← Context-start only. Degraded by lost-in-the-middle.
|
||
```
|
||
|
||
Conditional instructions in the prompt occupy the weakest position in this
|
||
hierarchy: they sit in the root AGENTS.md, fire once at session start, and
|
||
require the model to evaluate a condition — exactly the setup for
|
||
lost-in-the-middle degradation combined with discretionary skipping.
|
||
|
||
**The decision rule:**
|
||
|
||
- If the constraint **must hold** regardless of model judgment (no `npx`, no
|
||
`throw`, no edits to generated files), it belongs in a hook — PreToolUse or
|
||
permission-layer denial. The condition is evaluated in code, not by the model.
|
||
- If the constraint is a **process step** that should always execute (read the
|
||
test, check for `throw`, resolve TODOs), it belongs imperative in the prompt —
|
||
no condition, no discretion.
|
||
- If the constraint is a **recommendation** that depends on context (use BFF
|
||
pattern for client pages), it belongs in a PostToolUse path-check — fires at
|
||
the right moment, in the high-attention context tail, scoped to the relevant
|
||
path.
|
||
|
||
Conditionals in prompts are a design smell. They indicate the author is trying
|
||
to use the weakest enforcement mechanism for a constraint that should live in a
|
||
stronger layer.
|
||
|
||
### 4.7 Compaction strategy
|
||
|
||
The Anthropic guidance, replicated independently elsewhere: **first maximize
|
||
recall (capture every relevant piece of context), then improve precision
|
||
(eliminate superfluous content).** A summary that drops a critical fact is worse
|
||
than a summary that is slightly too long. Iterate on the compaction prompt
|
||
itself, treating it as a small distinct prompt-engineering task.
|
||
|
||
The safest first-pass compaction target is **stale tool outputs**: raw file
|
||
contents or command outputs whose information has already been acted on. The
|
||
assistant's response citing them stays; the 500-token file dump does not.
|
||
|
||
For harnesses with a `PreCompact` hook: this is the right place to append open
|
||
todos, active hypotheses, or in-progress file paths to the input so the summary
|
||
preserves them.
|
||
|
||
**Anchored summary schema.** The most reliable production compaction prompt is
|
||
not free-form — it's a fixed Markdown skeleton with the original prompt
|
||
preserved verbatim, plus structured sections for clarifications, constraints,
|
||
progress, decisions, and next steps. A representative shape:
|
||
|
||
```markdown
|
||
## Original Prompt
|
||
|
||
- [the user's first prompt, verbatim]
|
||
|
||
## Clarifications
|
||
|
||
- [follow-up that refined the original]
|
||
|
||
## Constraints & Preferences
|
||
|
||
- [user constraints or "(none)"]
|
||
|
||
## Progress
|
||
|
||
### Done / In Progress / Blocked
|
||
|
||
## Key Decisions
|
||
|
||
- [decision and why]
|
||
|
||
## Next Steps
|
||
|
||
- [ordered actions]
|
||
|
||
## Critical Context
|
||
|
||
- [errors, open questions, technical facts]
|
||
|
||
## Relevant Files
|
||
|
||
- [path: why it matters]
|
||
```
|
||
|
||
Three properties that make this work:
|
||
|
||
1. **Verbatim original prompt.** The single most common compaction failure is
|
||
drift away from the user's actual ask. Anchoring the verbatim text resists
|
||
this.
|
||
2. **Empty sections kept.** "(none)" beats omission — the agent post- compaction
|
||
can tell whether "no blockers" is a fact or an oversight.
|
||
3. **Bullets, not prose.** Compaction prose tends to drop facts under token
|
||
pressure; structured bullets degrade more gracefully.
|
||
|
||
### 4.8 Attention engineering
|
||
|
||
A subset of context engineering, focused on _where_ in the context tokens land.
|
||
Practical heuristics:
|
||
|
||
- Task-critical content goes at the **tail** of the context (recency bias is
|
||
strong and consistent across models).
|
||
- Rules and constraints repeat at both ends — they are forgotten from the
|
||
middle.
|
||
- Long tool outputs should be **summarized in place** once stale rather than
|
||
scrolled away. The original is gone from effective attention either way; a
|
||
summary preserves the salient bits.
|
||
|
||
---
|
||
|
||
## 5. Tools, Skills, and Specs
|
||
|
||
### 5.1 The minimalist consensus
|
||
|
||
The empirically dominant tool set for coding agents has converged to roughly six
|
||
primitives:
|
||
|
||
1. **Read file** (with line ranges)
|
||
2. **Edit file** (string-replace or patch)
|
||
3. **Search** (grep / regex)
|
||
4. **Find files** (glob)
|
||
5. **Shell** (bounded, optionally sandboxed)
|
||
6. **Todo list** (or equivalent state tracker)
|
||
|
||
Plus, depending on agent surface:
|
||
|
||
7. **Subagent / task spawner** (for read-only exploration)
|
||
8. **Web fetch** (for docs lookup)
|
||
9. **Memory** (read/write the tier hierarchy)
|
||
|
||
### 5.2 What got absorbed
|
||
|
||
Tools that were once distinct but are now redundant given a capable shell:
|
||
|
||
- `create_file`, `delete_file`, `list_dir`, `move_file` — all expressible
|
||
through edit/shell, and modern models reliably emit the shell forms.
|
||
- Language-specific linters/formatters — better invoked through shell with the
|
||
project's actual configuration.
|
||
- Dedicated test runners — same.
|
||
|
||
Tools that were _supposed_ to win but didn't:
|
||
|
||
- Browser-automation tools as a default. Useful for frontend verification,
|
||
rarely critical otherwise.
|
||
- "Code interpreter" sandboxes as a separate tool from shell. Now usually
|
||
unified.
|
||
|
||
### 5.3 What's still genuinely needed beyond shell
|
||
|
||
- **Structured edits.** `sed -i` and `awk` corrupt files often enough that every
|
||
serious harness ships a dedicated string-replace or patch tool with whitespace
|
||
fidelity. This is the single tool that justifies its existence most clearly.
|
||
- **Todo tracking.** Could be a file, but a first-class tool gives the harness a
|
||
UI surface and gives the verify step a checklist.
|
||
- **Subagent spawning** with isolated context. Cannot be expressed as shell.
|
||
|
||
### 5.4 Tool-count thresholds
|
||
|
||
Empirical finding (replicated across Anthropic, OpenAI, and independent
|
||
research): **agent performance degrades non-monotonically once the tool list
|
||
exceeds roughly 40–50 tools.** The model spends attention on tool selection
|
||
rather than the task. Mitigations:
|
||
|
||
- **Tool grouping / lazy loading.** Surface only relevant tools per phase.
|
||
- **MCP-style tool servers** that present a small façade and route internally.
|
||
- **Code-execution-as-tooling** (Anthropic's "code as tools" approach, Cursor's
|
||
similar pattern): expose tools as a small API the model writes code against,
|
||
rather than as dozens of discrete function-call schemas. Drastically reduces
|
||
tool-selection overhead for large tool surfaces.
|
||
|
||
### 5.5 Skills and the SKILL.md convention
|
||
|
||
**Skills** are bounded, on-demand instruction packets — a `SKILL.md` file with a
|
||
`description:` frontmatter field that the model reads in the tool/skill list,
|
||
plus a body the model loads when it judges the skill relevant. They are the
|
||
answer to "how do I avoid loading my entire methodology library upfront?"
|
||
|
||
The format has stabilized as a community standard, with the **skills.sh**
|
||
registry (Vercel Labs, 2025) as a public distribution channel: Anthropic's
|
||
`frontend-design` skill (≈367k installs), `skill-creator`, Vercel's
|
||
React/composition skills, Supabase's Postgres skills. Install via
|
||
`npx skills add <owner>/<repo>`. Treat installed skills like third-party npm
|
||
packages: review before using.
|
||
|
||
Key principles for authoring skills:
|
||
|
||
- **Progressive disclosure.** A debugging skill loaded into a refactoring
|
||
request is context pollution. Skills load at invocation time, not session
|
||
start.
|
||
- **Create reactively.** The right trigger for a new skill is _"the agent failed
|
||
this same task type twice."_ Anticipatory skill creation is premature context
|
||
inflation.
|
||
- **Methodologies, not project rules.** Project-specific rules go in nested
|
||
AGENTS.md; reusable methodologies (how to research, how to brainstorm) go in
|
||
skills.
|
||
|
||
**Skills vs Hooks — diagnostic guide.** The two layers are complementary, not
|
||
competing: a skill triggers → the model reads it → the model acts → a hook
|
||
validates the action → the model corrects if the hook exits non-zero.
|
||
|
||
| | Skills | Hooks |
|
||
| -------------------- | ------------------------------------------------- | ------------------------------------------- |
|
||
| **Layer** | Context Engineering | Harness Engineering |
|
||
| **What it is** | Progressive disclosure of task-specific knowledge | Deterministic event-triggered execution |
|
||
| **Loaded when** | Task type activates it (on demand) | Tool-call boundaries (always) |
|
||
| **Activated by** | Model routing decision | System event (pre/post-tool, session start) |
|
||
| **Failure mode** | Pollutes context if loaded too broadly | Breaks agent loop if too noisy |
|
||
| **Success behavior** | Silent — enriches context | Silent — only speaks on failure |
|
||
| **Create when** | Agent fails same task type twice | Need deterministic enforcement |
|
||
|
||
If in doubt: use a hook when the rule _must_ hold regardless of model judgment;
|
||
use a skill when the rule only applies to a specific task type that the model
|
||
should route into.
|
||
|
||
### 5.6 Spec-driven development (OpenSpec)
|
||
|
||
[OpenSpec](https://openspec.dev) (Fission AI, 2025) introduced a workflow where
|
||
machine-readable specs (RFC 2119 SHALL/SHOULD/MAY + Gherkin scenarios) live
|
||
alongside code, and each PR produces a "spec delta" showing requirement changes
|
||
next to the diff. Supported by Claude Code, Cursor, Copilot, Codex, and 16+
|
||
tools.
|
||
|
||
The valid critique — _"isn't this just waterfall?"_ — OpenSpec answers cleanly:
|
||
the spec is not meant to be complete before coding starts; it's _co-evolved_
|
||
with the code. "Good enough plan + update as you go" is the Agile reading. This
|
||
is the same plan-then-iterate pattern from §3.2 applied at the requirement level
|
||
rather than the function level.
|
||
|
||
When it helps: features with complex, multi-stakeholder requirements where code
|
||
review benefits from being intent-first rather than diff-first. When it doesn't:
|
||
infrastructure work, one-off scripts, or codebases where intent is adequately
|
||
captured by tests.
|
||
|
||
### 5.7 MCP as portable deferred loading
|
||
|
||
The Model Context Protocol (MCP) has emerged as the cross-tool standard for two
|
||
deferred-loading patterns that previously required tool-specific machinery:
|
||
|
||
- **MCP tools** ↔ **skills.** A tool description is the routing signal; the
|
||
model decides whether to invoke. This is what VS Code Copilot's
|
||
`SkillsContextComputer` does internally with file-based
|
||
`.github/skills/<name>/SKILL.md`, but MCP makes it portable.
|
||
- **MCP prompts** ↔ **instructions / slash commands.** Exposed via
|
||
`prompts/list`; bodies load only at invocation. The portable equivalent of
|
||
Copilot's `InstructionsContextComputer` behavior for `description:`-only
|
||
`.instructions.md` files.
|
||
|
||
Practical implication: **prefer MCP tools/prompts over tool-specific
|
||
deferred-loading mechanisms** when targeting multiple harnesses. A
|
||
`description:`-only `.instructions.md` file is deferred-loaded in Copilot but
|
||
becomes always-on context pollution everywhere else. MCP avoids that asymmetry.
|
||
|
||
The protocol does not yet have lifecycle hooks (session start, post-tool-use,
|
||
session end). Active work — SEP-2624 (Interceptors, formal working group with
|
||
Bloomberg + Saxo Bank engineers) and SEP-2282 (server-declared behavioral hooks)
|
||
— aims to close this gap in upcoming spec revisions. Until then,
|
||
session-lifecycle behavior lives in harness-specific plugin layers (OpenCode
|
||
plugins, Copilot hooks).
|
||
|
||
---
|
||
|
||
## 6. Local Agents and Models
|
||
|
||
### 6.1 When local makes sense
|
||
|
||
- **Confidentiality:** code or data that cannot leave the network.
|
||
- **Cost at scale:** sustained heavy agent use (millions of tokens/day per
|
||
developer) eventually beats API pricing on amortized hardware.
|
||
- **Customization:** fine-tuning on house style, internal frameworks, or
|
||
domain-specific patterns.
|
||
- **Offline / air-gapped.**
|
||
|
||
When local does **not** make sense: occasional use, capability-frontier work,
|
||
single developers without dedicated hardware. The opportunity cost of slower,
|
||
weaker output usually exceeds API costs.
|
||
|
||
### 6.2 Hardware reality (mid-2026)
|
||
|
||
| VRAM | Practical ceiling for coding-grade quality |
|
||
| -------- | ---------------------------------------------------------------------- |
|
||
| 24 GB | Q4 of 30–32B dense, or Q4 of 30B-A3B MoE. Usable for narrow subagents. |
|
||
| 48 GB | Q4 70B dense, Q5–Q6 32B dense, MoE up to ~100B total params at Q4. |
|
||
| 80 GB | Q8 70B dense, Q4–Q5 of 200B+ MoE. |
|
||
| 2× 80 GB | Frontier open-weight MoE (DeepSeek-V3, Qwen3-Coder-480B) at Q4–Q5. |
|
||
|
||
Apple Silicon with unified memory (128–512 GB) is a credible alternative for MoE
|
||
inference, where bandwidth, not raw FLOPs, dominates. NVIDIA still leads on
|
||
prompt processing throughput.
|
||
|
||
### 6.3 Quantization
|
||
|
||
Updated rules of thumb (the conventional wisdom from 2023 — "Q4 is fine" — has
|
||
been refined considerably):
|
||
|
||
- **FP16 / BF16:** reference quality.
|
||
- **Q8 / FP8:** indistinguishable from FP16 in practice for coding tasks.
|
||
Default if memory permits. GGUF Q8_0 loses roughly 0.1–0.3% on most benchmarks
|
||
versus BF16 — not a meaningful degradation vector by itself.
|
||
- **Q6_K:** the practical sweet spot. ≤1% quality loss on coding benchmarks for
|
||
≥30B models.
|
||
- **Q5_K_M:** acceptable for ≥30B. Visible degradation below 14B.
|
||
- **Q4_K_M:** the lowest viable quant for serious coding agents on ≥30B models.
|
||
Below this, tool-call fidelity collapses faster than raw output quality.
|
||
- **AWQ / GPTQ:** for GPU-only inference, often higher quality than equivalent
|
||
GGUF Q4 due to per-channel calibration.
|
||
- **KV-cache quantization (Q8 KV) is often higher-leverage than weight
|
||
quantization** for long-context coding tasks. Underused; under-documented in
|
||
2024-era guides. **Critical reality:** with FP16 KV cache, a 9B model at 32k
|
||
context burns ≈4 GB just for KV — the KV cache, not weight precision, is the
|
||
dominant runtime memory constraint at long contexts. Quantize it.
|
||
|
||
### 6.4 Small-model failure modes and harness mitigations
|
||
|
||
For any agent driving a ≤14B model (quantized or not), the failure surface is
|
||
distinct from frontier models. The model's _parameter count_ is the primary
|
||
cause; quantization is a minor amplifier. The most important patterns:
|
||
|
||
**Instruction drift past ~12k tokens.** Rules stated in the system prompt hold
|
||
for the first 5–10 tool calls, then erode. Smaller models have fewer attention
|
||
heads (Qwen3-8B: 32 heads vs Qwen3-32B's 64), so per-token attention fidelity
|
||
degrades faster as context length grows. Mitigations:
|
||
|
||
- **Tool-response history pruning** (PostToolUse hook). Once a tool result has
|
||
been acted on, clear its raw content; keep the assistant's citation. The
|
||
single highest-leverage harness change for small models.
|
||
- **Compaction trigger at 60% fill** (not the default 80–90%). Small models hit
|
||
the quality cliff earlier; aggressive compaction keeps each window shorter and
|
||
fresher.
|
||
- **Periodic system-prompt echo.** Every N tool calls, inject the 3 most
|
||
critical rules at the context tail as a `<reminder>` block.
|
||
|
||
**Tool-call JSON malformation.** Smaller models have narrower "format channels"
|
||
— less capacity to track content and strict syntax simultaneously, especially in
|
||
long contexts. Mitigations:
|
||
|
||
- **PreToolUse JSON validation with schema-specific errors.** Generic errors
|
||
("invalid tool call") cause retry loops; schema-specific errors guide
|
||
correction:
|
||
```
|
||
Tool call JSON was invalid at position 47 (unexpected comma).
|
||
Required schema: {"path": string, "limit": number}
|
||
```
|
||
- **Grammar-constrained decoding.** GBNF (llama.cpp), Outlines, or
|
||
lm-format-enforcer pin generation to a valid schema at the decode step. More
|
||
reliable than re-prompting.
|
||
- **Trim tool responses to minimum fields.** For `read_file`, return content and
|
||
line range, not metadata. Fewer tokens per response = less schema to track in
|
||
working memory.
|
||
|
||
**Tool-selection errors past ~15 tools.** Working memory for "which tools exist"
|
||
degrades faster than for frontier models. Mitigations: minimum viable tool set;
|
||
consistent tool-name prefixes (`file_read`, `file_write`, `file_search`);
|
||
PreToolUse name validation that returns the available list on a miss.
|
||
|
||
**Think-block runaway.** Reasoning-trained small models can emit 2k–5k token
|
||
`<think>` blocks for a tool call that needed 50 tokens of reasoning. In a 32k
|
||
context, this consumes budget faster than tool outputs. Mitigations:
|
||
`num_predict` cap (e.g., 2048) in the modelfile; observability hooks that log
|
||
think-block length and flag outliers.
|
||
|
||
**Context-window cliff at ~20k+.** Output quality drops noticeably (not
|
||
catastrophically) past 60–70% fill on a 32k model — the pre-training data was
|
||
likely concentrated in shorter sequences. Mitigations: **context-pressure
|
||
injection** at ≥70% fill — the harness mechanically prepends:
|
||
|
||
```
|
||
[CONTEXT PRESSURE: ~70% full. Be concise. Prefer targeted tool calls over
|
||
broad ones. Write current progress to NOTES.md before proceeding.]
|
||
```
|
||
|
||
plus the early-compaction trigger above.
|
||
|
||
**Training-distribution mismatch.** Most open-weight coding models are heavily
|
||
Python/JavaScript. TypeScript-specific patterns (generic constraints,
|
||
conditional types, module augmentation, `satisfies`, complex inference) are less
|
||
reliable than equivalent Python. Mitigation: SYSTEM directives that force
|
||
grounding ("read `tsconfig.json` before asserting TypeScript configuration";
|
||
"read existing type definitions before suggesting new ones"), plus
|
||
explore-subagent delegation for type-heavy work to isolate the exploration to a
|
||
fresh context window.
|
||
|
||
**Prompt ambiguity → wrong directory (parametric knowledge conflict).** Small
|
||
models with narrower training distributions resolve ambiguous nouns ("the five
|
||
hook files") to the most common referent in their training data (`.husky/` for
|
||
"hook files" in a Node.js repo) rather than the project-specific one
|
||
(`.agents/hooks/`). The correct files may appear in tool output but not be
|
||
selected. This is a specific instance of **parametric knowledge conflict**: the
|
||
model's trained association competes with project-specific context and
|
||
frequently wins when prior confidence is high.
|
||
|
||
Prompt engineering is a subpar fix here. Telling the model "hook files means
|
||
`.agents/hooks/`" in AGENTS.md loses to a strong trained prior, especially under
|
||
context pressure (lost-in-the-middle degrades instruction recall). Two bodies of
|
||
research clarify why and what works instead:
|
||
|
||
- **ClashEval (Wu, Wu, Zou 2024, arXiv:2404.10198)** benchmarks this exact
|
||
tug-of-war across six LLMs. Key finding: the less confident a model is in its
|
||
prior, the more likely it is to defer to retrieved context. Corollary:
|
||
_specific, concrete contextual evidence_ is far more effective at overriding a
|
||
prior than an instruction to prefer context. A file listing showing the actual
|
||
paths removes the model's need to resolve the ambiguous noun at all.
|
||
|
||
- **Onoe et al. (ACL 2023, arXiv:2305.01651)** study knowledge propagation in
|
||
LLMs. Finding: gradient-based fine-tuning on new facts ("for this project,
|
||
hook files are in `.agents/hooks/`") shows little propagation — the injected
|
||
fact does not generalize to new usage patterns. **Prepending entity
|
||
definitions in context outperforms parameter-level injection across all
|
||
settings.** The practical instruction: inject evidence, don't update weights.
|
||
|
||
**What works, in order of effectiveness:**
|
||
|
||
1. **Context grounding via automatic structure injection.** A `UserPromptSubmit`
|
||
hook that appends a `<project-file-map>` block to every build-local prompt —
|
||
listing actual files under `.agents/`, `.opencode/`, and other
|
||
project-specific directories — removes the ambiguity entirely. The model sees
|
||
real paths; the trained prior is not consulted. This is the harness analog of
|
||
Aider's repo-map (Gauthier 2023), which injects a compressed AST-derived
|
||
structure map with every request for the same reason. Implementation: the
|
||
hook runs `find .agents -name "*.sh" -o -name "*.md" | sort` and prepends the
|
||
result as a structured block at the prompt tail.
|
||
|
||
2. **Automatic disambiguation expansion.** When the hook detects category nouns
|
||
("hook", "config", "agent") without an explicit path in the user's prompt,
|
||
expand the noun inline before the model sees it. Example: "the hook files" →
|
||
"the hook files (`.agents/hooks/pre-tool-use.sh`,
|
||
`.agents/hooks/post-tool-use.sh`, ...)". This converts a high-confidence
|
||
prior lookup into a zero-ambiguity ground truth.
|
||
|
||
3. **Explicit path in user prompts.** Still useful as a secondary layer, but
|
||
should not be the _only_ mitigation. Include the explicit path when writing
|
||
build-local tasks ("the `.agents/hooks/*.sh` files"). Do not rely on the
|
||
model inferring project conventions from context alone.
|
||
|
||
**What does not work:** repeating the mapping in AGENTS.md or system prompts
|
||
("hook files live in `.agents/hooks/`") — this is instructional and degrades
|
||
under context pressure. Temperature reduction does not help with noun resolution
|
||
and may hurt tool-call schema compliance on Qwen3-class models.
|
||
|
||
**Other forms of parametric knowledge conflict — and whether structure injection
|
||
handles them.**
|
||
|
||
File paths are a _low-to-medium_ confidence prior. The model knows `.husky/` is
|
||
common, but doesn't know your specific project layout, so it defers readily to
|
||
injected evidence. Structure injection works because the prior is weak. The
|
||
following conflict types have _higher_ confidence priors and require different
|
||
harness tools. The pattern from ClashEval holds throughout: **match intervention
|
||
strength to prior confidence**.
|
||
|
||
| Conflict type | Example | Prior confidence | Does structure injection help? | What actually works |
|
||
| --------------------------- | -------------------------------------------------------------- | ---------------- | ----------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||
| **Structural identity** | `.husky/` vs `.agents/hooks/` | Low–medium | ✅ Yes — file listing resolves ambiguity | `UserPromptSubmit` hook appends file map |
|
||
| **Framework semantics** | React patterns in a Solid.js project | High | ⚠️ Partially — seeing Solid.js files in the map signals the framework, but doesn't show the API | Inline code examples at prompt tail (`createSignal`, `createMemo` shown in use); `PostToolUse` pattern check for React imports |
|
||
| **Import path conventions** | `../../packages/core` vs `@cantrips/remnant-core` | Medium | ⚠️ Partially — package.json injection exposes aliases | Inject `tsconfig.json` paths section and package.json `imports`/`exports` map at session start |
|
||
| **Async convention** | `async/await` vs the callback pattern this project uses | Very high | ❌ No — file listing doesn't convey behavioral convention | Code example injection (show a canonical callback-pattern function from the codebase); `PostToolUse` grep for `async ` in files that should use callbacks |
|
||
| **Error handling** | Throwing exceptions vs returning error results | Very high | ❌ No | Same as async: inject a canonical example; `PreToolUse` or `PostToolUse` grep for `throw new` in use-case files |
|
||
| **Command invocations** | `npx jest` vs `npm test`, `docker-compose` vs `docker compose` | Medium–high | ❌ No | `PreToolUse` hard block + redirect — the incorrect command is interceptible before execution; this is the cleanest fix because the error is structural |
|
||
|
||
The general principle: **structure injection handles structural identity
|
||
conflicts only.** For semantic, convention, and behavioral conflicts — where the
|
||
model has deep training-data confidence in a competing pattern — the effective
|
||
interventions are **(a) concrete code examples at the prompt tail** (activates
|
||
pattern-matching against actual code rather than fighting a prior with
|
||
instructions) and **(b) PostToolUse pattern validation** (catches violations
|
||
immediately, in the high-attention context tail). `PreToolUse` blocks are the
|
||
right tool only when the incorrect behavior is interceptible as a specific
|
||
command or schema.
|
||
|
||
For the highest-confidence conflicts (async conventions, error handling idioms),
|
||
the Onoe et al. finding is most actionable: descriptions in AGENTS.md don't
|
||
propagate. A single concrete example from the actual codebase, injected at the
|
||
tail, outperforms any amount of prose instruction.
|
||
|
||
**Silent catch blocks mask enforcement failures completely.** Any `try/catch`
|
||
around a tool-call enforcement path that returns a safe default (e.g., `''`)
|
||
will silently disable enforcement when the underlying API changes. This is not a
|
||
small-model failure — it affects the harness itself. Mitigation: log all caught
|
||
errors to a debug file during development and verify the log is empty before
|
||
removing debug code. Never assume a hook or enforcement layer is working;
|
||
confirm with a test call.
|
||
|
||
**Scope-detection via todo-list interception.** When a small model attempts a
|
||
broad refactor it should not handle, it will typically call `manage_todo_list`
|
||
with many items to plan the work. A `PreToolUse` hook that blocks
|
||
`manage_todo_list` calls with ≥4 items and returns a specific error message
|
||
("this task is too broad — tell the user and stop") consistently causes the
|
||
model to report scope and stop, rather than proceeding. This is more reliable
|
||
than relying on the model's own Rule 5 compliance. Anthropic's pattern for this
|
||
is "guardrails via parallelization" (a separate model screens requests alongside
|
||
the working model); a hook-based deny is a lighter-weight equivalent.
|
||
|
||
**Poka-yoke tool design (Anthropic, 2024).** The harness should make incorrect
|
||
tool usage structurally harder, not just instructionally forbidden. Examples:
|
||
requiring absolute file paths (eliminates cwd-relative errors), enforcing
|
||
`limit` on every `read` call via a blocking hook (eliminates accidental
|
||
full-file reads), requiring `explanation` and `goal` fields on terminal calls
|
||
(forces pre-action reasoning). These structural constraints outperform
|
||
equivalent instruction-only approaches because they fire at the API boundary and
|
||
are not subject to instruction drift.
|
||
|
||
**Sampling parameters matter more.** Qwen3's documented thinking-mode defaults
|
||
are `temperature=0.6, top_p=0.95, top_k=20`, and these are empirically the right
|
||
starting point for agentic use as well — lower temperatures (e.g., 0.2) trade
|
||
reasoning quality and frequently _hurt_ tool-call schema compliance rather than
|
||
helping, because the model has less headroom to escape a local format error.
|
||
Earlier guidance suggesting low-temperature defaults for tool-call reliability
|
||
does not survive A/B testing on Qwen3-class models; keep the documented
|
||
thinking-mode values unless you measure a specific regression.
|
||
|
||
**Anti-filler-token system prompts.** Reasoning-trained small models tend to
|
||
open `<think>` blocks with filler ("Okay, let me think about this...", "The user
|
||
wants...") before any real analysis. Each filler opener wastes 50–150 tokens at
|
||
the start of every reasoning block, multiplied across tens of tool calls. A
|
||
direct system-prompt rule — _"Open `<think>` blocks with substantive analysis.
|
||
Do not begin with filler phrases like 'Okay, let me...' or 'The user
|
||
wants...'."_ — measurably trims reasoning length without affecting reasoning
|
||
quality. The win compounds on a 32k context.
|
||
|
||
# 20–30B Model Class: The Practical Sweet Spot
|
||
|
||
> **Status:** Operational reference, not a survey. Captures what has been
|
||
> observed running 20–30B models as local agent drivers through mid-2026.
|
||
>
|
||
> **Audience:** Engineers deploying local agentic harnesses who need concrete
|
||
> failure modes and countermeasures for the 20–30B class — not first-time
|
||
> quantization users.
|
||
>
|
||
> **Self-evaluation:** This document is opinionated and deliberately concrete;
|
||
> model-specific claims are date-stamped because they age within months.
|
||
|
||
---
|
||
|
||
## 1. The 20–30B Class Defined
|
||
|
||
Models in the 20–30B parameter range — **Qwen3-32B-dense**, **Qwopus3.6-27B**,
|
||
**GLM-4-32B** — occupy a unique position in the local deployment landscape. They
|
||
are large enough to hold meaningful instruction context and tool-call fidelity
|
||
without collapsing under quantization, yet small enough to run on consumer
|
||
hardware (single 24GB GPU at Q4, or dual-GPU setups with headroom). This class
|
||
has failure modes that are **not** shared by frontier models and **not** shared
|
||
by sub-14B models — they are uniquely theirs.
|
||
|
||
| Dimension | Sub-14B class | 20–30B class | Frontier (≥200B) |
|
||
| --- | --- | --- | --- |
|
||
| **Instruction drift** | Immediate (4–8 turns) | Delayed (10–15 turns) | Resistant |
|
||
| **Plan invention** | Poor (hallucinates steps) | Unreliable (skips, invents) | Strong |
|
||
| **Tool-call fidelity** | Breaks under load | Degrades gradually | Robust |
|
||
| **Context budget** | Collapses early | Degrades gradiently | Stretches far |
|
||
| **VRAM at Q4** | ≤12 GB | ≤24 GB | Not feasible |
|
||
|
||
The 20–30B class is **not frontier** and **not small**. It sits between two
|
||
established playbooks, and applying either playbook produces suboptimal results.
|
||
|
||
---
|
||
|
||
## 2. Failure Modes
|
||
|
||
### 2.1 Instruction Drift at Tool Call 10–15
|
||
|
||
The defining characteristic of this class is that it **starts strong and degrades
|
||
predictably**. A 27B model loaded with a 2k-token system prompt will follow all
|
||
rules faithfully for roughly 10–15 tool calls — then rules begin to drop. Not
|
||
catastrophically (as sub-14B models do at turn 4), but enough to produce
|
||
drift: the model stops checking lint before committing, stops writing to
|
||
NOTES.md, stops using `read` before `edit`.
|
||
|
||
**Mechanism.** The system prompt sits at the head of the context. By tool call
|
||
10–15, the accumulated conversation has pushed it deep into the effective
|
||
attention zone where recall is gradient, not binary. The model hasn't "forgotten"
|
||
the rules — it's attending to them less than to the immediate conversation
|
||
tail.
|
||
|
||
**What works:**
|
||
|
||
- **Periodic system-prompt echo every 8–10 calls** via `PostToolUse` hook
|
||
injection. A compressed version of the most-critical rules (3–5 bullets)
|
||
reappears at the context tail, restoring attention to constraints before
|
||
drift sets in. This is the single most impactful harness change for this
|
||
class — it reduces drift-related errors by an order of magnitude in
|
||
observed sessions.
|
||
- **Tail-positioned critical rules.** Place the few rules that matter most
|
||
(e.g., "read before edit", "run lint before commit") at the _end_ of the
|
||
system prompt, not the beginning. The tail survives longer.
|
||
|
||
**What does not work:** negative constraints ("DO NOT forget to check lint"),
|
||
repeated reminders in the user prompt (they degrade after 2–3 repetitions),
|
||
or asking the model to "re-read the instructions" (it won't).
|
||
|
||
### 2.2 Plan-Invention Failure
|
||
|
||
When asked to invent a multi-step plan from scratch, 20–30B models frequently
|
||
produce plans that are **structurally incomplete** (missing dependency edges),
|
||
**overconfident** (assuming APIs exist without checking), or **hallucinatory**
|
||
(inventing intermediate steps that serve no purpose). This is the class's
|
||
hardest intrinsic limitation — plan generation is the single most demanding
|
||
reasoning task an agent must perform.
|
||
|
||
**What works:**
|
||
|
||
- **Blueprint injection.** Instead of asking the model to invent a plan, inject
|
||
a structured blueprint at the prompt tail. A blueprint is a task-type-keyed
|
||
skeleton: "debug → read error → locate source → read file → hypothesize →
|
||
verify → fix → test." The model fills in the slots rather than inventing the
|
||
structure. This maps directly to the blueprint-guided execution pattern
|
||
(Han et al., [arXiv:2506.08669](https://arxiv.org/abs/2506.08669)).
|
||
- **Exploration subagent with blueprint handoff.** A larger orchestrator model
|
||
(or even the same model in a fresh context with higher `num_predict`) generates
|
||
the blueprint; the 20–30B model executes it. The context firewall between
|
||
subagents means the execution agent never sees the planning mess.
|
||
|
||
**What does not work:** asking the model to "think step by step" before acting
|
||
— this just produces a long chain that still misses the dependency.
|
||
|
||
### 2.3 Long CoT Degradation
|
||
|
||
Hassid et al. ([arXiv:2505.17813](https://arxiv.org/abs/2505.17813),
|
||
"Don't Overthink it") directly tested chain-of-thought length within a single
|
||
question and found that **the shortest chains are up to 34.5% more accurate than
|
||
the longest**. This effect is pronounced at the 20–30B scale: extended thinking
|
||
tokens do not accumulate reasoning — they accumulate noise. The model begins
|
||
repeating itself, inventing irrelevant intermediate steps, or drifting into
|
||
explanation mode rather than planning mode.
|
||
|
||
**What works:**
|
||
|
||
- **Cap reasoning-trace lengths** at inference time (`num_predict` on `<think>`
|
||
blocks). A practical cap for 20–30B models is 800–1200 thinking tokens per
|
||
call — enough for a plan, not enough for a treatise.
|
||
- **Short-m@k with ≤3 chains.** Generate `k` reasoning chains in parallel,
|
||
halt when the first `m` finish, take majority vote. At 20–30B, three chains
|
||
is the practical ceiling — more chains eat VRAM without accuracy gain.
|
||
Short chains with majority voting beat one long chain at equal or better
|
||
accuracy with fewer total thinking tokens.
|
||
|
||
**What does not work:** budget forcing (extending a single chain to consume a
|
||
fixed token budget). Budget forcing is a frontier-model technique; at 20–30B it
|
||
produces verbose, less-accurate chains.
|
||
|
||
### 2.4 The "Not Frontier, Not Small" Gap
|
||
|
||
The 20–30B class falls between two established deployment playbooks:
|
||
|
||
- **Frontier playbooks** assume robust tool-call fidelity, strong plan invention,
|
||
and deep context. A 20–30B model cannot sustain these assumptions past turn 10.
|
||
- **Small-model playbooks** assume immediate instruction collapse, severe
|
||
hallucination, and subagent-only deployment. A 20–30B model is far more
|
||
capable than these playbooks allow for.
|
||
|
||
Applying frontier patterns (long sessions, deep reasoning, no scaffolding) to
|
||
20–30B models produces gradual failure. Applying small-model patterns (extreme
|
||
task slicing, no primary-agent role) wastes the model's actual capability.
|
||
|
||
---
|
||
|
||
## 3. Harness Patterns
|
||
|
||
### 3.1 Periodic System-Prompt Echo (every 8–10 calls)
|
||
|
||
**Mechanism.** A `PostToolUse` hook counts tool calls and injects a compressed
|
||
rules reminder at the context tail every 8–10 calls. The reminder is 3–5
|
||
bullets covering the most-critical constraints:
|
||
|
||
```
|
||
[HOOK INJECTION: post-tool-use] System reminder:
|
||
- Read a file before editing it
|
||
- Run lint before committing
|
||
- Write findings to NOTES.md after each step
|
||
```
|
||
|
||
**Why it works.** The tail of the context is the high-attention zone (Liu et al.,
|
||
[arXiv:2307.03172](https://arxiv.org/abs/2307.03172)). Re-injecting rules at the
|
||
tail restores attention to constraints before drift sets in. The original system
|
||
prompt at the head is still there — this is not a replacement, it's a reinforcement.
|
||
|
||
**Implementation note.** The hook must be terse. A 200-token reminder every 8
|
||
calls adds 1600 tokens per 100-call session — manageable. A 500-token reminder
|
||
is not.
|
||
|
||
### 3.2 Blueprint Injection
|
||
|
||
**Mechanism.** When the orchestrator classifies the task type, inject a
|
||
structured blueprint at the prompt tail. The blueprint is a task-type-keyed
|
||
skeleton, not a plan for this specific task. The model fills in the slots:
|
||
|
||
```
|
||
## Task Blueprint: Debug
|
||
|
||
1. Read the error message
|
||
2. Locate the source file
|
||
3. Read the relevant section
|
||
4. Form a hypothesis
|
||
5. Verify with a targeted read or test
|
||
6. Apply a minimal fix
|
||
7. Run the build / test
|
||
```
|
||
|
||
**Why it works.** Plan invention is the 20–30B class's weakest reasoning mode.
|
||
Blueprints replace invention with execution — the model's strong suit. Han et
|
||
al. ([arXiv:2506.08669](https://arxiv.org/abs/2506.08669)) show this pattern
|
||
improves accuracy on GSM8K, MBPP, and BBH with no additional training.
|
||
|
||
### 3.3 Compaction at 65% Fill
|
||
|
||
**Mechanism.** Compact the conversation at 65% context-fill rather than the
|
||
conventional 80–90%. The 20–30B class degrades gradiently — by 80% fill,
|
||
effective recall of head-position content is already poor.
|
||
|
||
**Why 65%, not 80%.** At 20–30B, the effective context is roughly 40–50% of
|
||
advertised (consistent with the gradient degradation observed in Liu et al.).
|
||
Compacting at 65% of advertised leaves 35% headroom, which maps to roughly
|
||
the effective context limit. Compacting at 80% means the model has already
|
||
been operating in degraded mode for the last 15% of the session.
|
||
|
||
**Compaction target.** Stale tool outputs first (raw file contents whose
|
||
information has been acted on), then stale conversation turns. The
|
||
anchored-summary schema from §4.7 of the best-practices document applies
|
||
unchanged.
|
||
|
||
### 3.4 Short-m@k with ≤3 Chains
|
||
|
||
**Mechanism.** For tasks requiring reasoning (debug diagnosis, architecture
|
||
decisions), generate up to 3 reasoning chains in parallel, take majority
|
||
vote when the first 2 agree. This is the short-m@k pattern from Hassid et
|
||
al., adapted to 20–30B hardware constraints.
|
||
|
||
**Why ≤3 chains.** Each chain at 20–30B requires ~8–12 GB VRAM at Q4. Three
|
||
chains fit on dual-GPU setups; four push into swap territory with severe
|
||
latency penalty. The accuracy gain from chain 3 to chain 4 is marginal
|
||
compared to the latency cost.
|
||
|
||
### 3.5 Anti-Filler-Token Rules
|
||
|
||
**Mechanism.** Explicit rules in the system prompt or `AGENTS.md` that ban
|
||
filler behavior. The 20–30B class is particularly prone to generating
|
||
explanatory filler — long paragraphs explaining what it's about to do before
|
||
doing it, or summarizing files it just read.
|
||
|
||
**Concrete rules that work:**
|
||
|
||
- "Do not summarize a file you just read — proceed to the next action."
|
||
- "Do not explain your plan before executing it — act immediately."
|
||
- "When the user asks a yes/no question, answer in one sentence then proceed."
|
||
|
||
These rules target the specific filler modes observed in 20–30B models.
|
||
Generic rules ("be concise") are ignored; specific rules ("do not summarize
|
||
a file you just read") are followed because they are concrete and testable.
|
||
|
||
---
|
||
|
||
## 4. Prompt Design
|
||
|
||
### 4.1 Imperative, Not Conditional
|
||
|
||
**Rule:** Write instructions as commands, not conditions. The 20–30B class
|
||
processes imperative instructions more reliably than conditional ones.
|
||
|
||
| Conditional (weak) | Imperative (strong) |
|
||
| --- | --- |
|
||
| "If there's a file to edit, read it first" | "Read a file before editing it" |
|
||
| "When you encounter an error, check the source" | "On error, locate the source file" |
|
||
| "If the build fails, run lint" | "Build fails → run lint" |
|
||
|
||
Conditional instructions introduce a branch the model must evaluate — at 20–30B,
|
||
branch evaluation is unreliable. Imperative instructions are single-path and
|
||
easier to follow.
|
||
|
||
### 4.2 Tail Content
|
||
|
||
**Rule:** Place the most-critical instructions at the end of the system
|
||
prompt and at the end of the user prompt. The tail survives context pressure;
|
||
the head does not.
|
||
|
||
This applies to both the initial system prompt (most important rules last)
|
||
and to injected content (hooks inject at the tail). A rule at the head of a
|
||
3k-token system prompt is effectively invisible by tool call 12.
|
||
|
||
### 4.3 Concrete Examples Over Abstract Principles
|
||
|
||
**Rule:** Show a concrete example of the desired behavior rather than stating
|
||
an abstract principle. The 20–30B class has weaker abstraction-to-execution
|
||
transfer than frontier models.
|
||
|
||
| Abstract (weak) | Concrete (strong) |
|
||
| --- | --- |
|
||
| "Be precise with file paths" | "Use absolute paths: `/home/dev/code/remnant/src/file.ts`, not `src/file.ts`" |
|
||
| "Check for errors" | "After every `npm run build`, check the exit code before proceeding" |
|
||
| "Keep changes minimal" | "Edit only the lines that need changing; do not reformat adjacent code" |
|
||
|
||
### 4.4 No Self-Reflect Language
|
||
|
||
**Rule:** Do not include "reflect on your answer", "double-check", "are you
|
||
sure", or "take another look" in prompts targeting 20–30B models. Huang et al.
|
||
([arXiv:2310.01798](https://arxiv.org/abs/2310.01798), "Large Language Models
|
||
Are Not Reliable Self-Correctors") show that intrinsic self-correction without an
|
||
external oracle **consistently degrades** reasoning performance. At 20–30B,
|
||
the effect is stronger — the model's self-assessment is poorly calibrated, and
|
||
asking it to "reflect" produces longer, less-accurate chains.
|
||
|
||
Replace self-reflect prompts with external feedback: test runners, lint checks,
|
||
hook exit codes. The model does not need to check its own work — the harness
|
||
does.
|
||
|
||
### 4.5 Short CoT
|
||
|
||
**Rule:** When the prompt asks the model to reason, constrain the reasoning
|
||
trace explicitly. "Think step by step" produces verbose, less-accurate chains
|
||
at 20–30B. Instead:
|
||
|
||
| Verbose (weak) | Constrained (strong) |
|
||
| --- | --- |
|
||
| "Think step by step about this" | "List the 3 most likely causes, then test the first one" |
|
||
| "Analyze the problem thoroughly" | "State your hypothesis in one sentence, then verify it" |
|
||
| "Consider all possibilities" | "Name 2 candidate fixes, implement the first" |
|
||
|
||
This aligns with the Hassid et al. finding: shorter chains are more accurate.
|
||
The prompt constraint enforces short chains at the point of generation, not
|
||
just at the inference-time cap.
|
||
|
||
### 6.4a Reasoning density: getting more out of small local models
|
||
|
||
A separate question from "how do I keep a small model from breaking?" (§6.4) is
|
||
"how do I get more reasoning capability out of it without enlarging it?". Recent
|
||
research converges on four techniques that are particularly suited to local
|
||
deployment, where additional inference passes are cheap and the alternative
|
||
(swapping to a frontier model) defeats the reason for going local in the first
|
||
place.
|
||
|
||
**1. Prefer shorter reasoning chains, not longer ones.** The intuitive
|
||
assumption that more "thinking" helps was directly tested by Hassid et al.
|
||
([arXiv:2505.17813](https://arxiv.org/abs/2505.17813), "Don't Overthink it"):
|
||
within a single question, **the shortest chains the model produces are up to
|
||
34.5% more accurate than the longest**, and SFT on short chains beats SFT on
|
||
long ones. Practical translation:
|
||
|
||
- Cap reasoning-trace lengths at training time (curate short-CoT data) and at
|
||
inference time (`num_predict` on `<think>` blocks, per §6.4).
|
||
- For test-time scaling on local hardware, **short-m@k** is the right pattern:
|
||
generate `k` reasoning chains in parallel, halt as soon as the first `m`
|
||
finish, take majority vote among those `m`. Hassid reports up to 40% fewer
|
||
thinking tokens than standard majority voting at equal or better accuracy.
|
||
- This contradicts the early-2025 "scale test-time compute by extending one long
|
||
chain" framing (e.g., s1's budget forcing,
|
||
[arXiv:2501.19393](https://arxiv.org/abs/2501.19393)). Budget forcing works on
|
||
32B+ models; on ≤7B models the evidence increasingly favours shorter chains
|
||
and parallel sampling. Treat budget forcing as a frontier-model technique.
|
||
|
||
**2. The Small Model Learnability Gap dictates distillation strategy.** Li et
|
||
al. ([arXiv:2502.12143](https://arxiv.org/abs/2502.12143)) found that **models
|
||
≤3B do not consistently benefit from long-CoT distillation from larger
|
||
reasoners** — they perform _worse_ than when fine-tuned on shorter, simpler
|
||
chains better matched to their intrinsic learnability. Their proposed **Mix
|
||
Distillation** combines long and short CoT examples (and reasoning from both
|
||
larger and smaller teachers) and outperforms either alone. The standard "distill
|
||
from the strongest reasoner you can afford" instinct is wrong for ≤3B targets.
|
||
|
||
For local-driver training (anything in the 0.5–3B regime), the operational rule
|
||
is:
|
||
|
||
- Source ~60–70% of CoT data from teachers ≤14B (or from the target model itself
|
||
after a first round). Use larger teachers (≥30B) for the remaining 30–40%,
|
||
primarily on harder problems where the smaller teacher is unreliable.
|
||
- Curate or rewrite teacher outputs to **median chain length**, not maximum.
|
||
LIMO ([arXiv:2502.03387](https://arxiv.org/abs/2502.03387)) showed that 817
|
||
strategically-designed "cognitive template" demonstrations beat 100×-larger
|
||
CoT corpora at the 32B scale; the same logic applies more strongly at smaller
|
||
scales. Quality and chain-length appropriateness dominate quantity.
|
||
- The LIMO finding has an important boundary condition the paper states
|
||
explicitly: it assumes "domain knowledge has been comprehensively encoded
|
||
during pre-training." A 2B model with weaker domain coverage will not match
|
||
the same data efficiency — but the directional advice (concise high-quality
|
||
chains beat verbose mediocre ones) still holds.
|
||
|
||
**3. Blueprint-guided execution as an inference-time density booster.** Han et
|
||
al. ([arXiv:2506.08669](https://arxiv.org/abs/2506.08669), ICML 2025 TTODLer-FM)
|
||
show that **LLM-generated structured reasoning blueprints** — extracted by a
|
||
larger model from solved problems and reused as scaffolds — measurably improve
|
||
small-model accuracy on GSM8K, MBPP, and BBH, with no additional training. The
|
||
blueprint is a high-level step skeleton ("identify the goal → list known
|
||
variables → choose the operator type → ..."); the small model fills it in.
|
||
|
||
For an agentic harness, this maps onto:
|
||
|
||
- **A blueprint library** keyed by task type (debug, refactor, write-test,
|
||
search-and-summarize) injected at the prompt tail when the orchestrator
|
||
classifies the request. The small model is no longer asked to invent a plan
|
||
from scratch — it executes a known-good plan template, which is the single
|
||
hardest thing for it to do reliably.
|
||
- Pairs well with the explore-subagent pattern (§3.4): the orchestrator can
|
||
generate a blueprint, hand it to the subagent, and recover a 1–2k token
|
||
summary that's been structurally constrained.
|
||
|
||
**4. Test-time compute scaling is not free, and its effectiveness scales with
|
||
model size.** A persistent failure mode in 2025–2026 deployment writeups is
|
||
applying frontier test-time-compute patterns (MCTS, Best-of-N with a verifier,
|
||
extended budget-forced thinking) to ≤7B models and reporting flat or negative
|
||
results. The Kinetics work and follow-ups consistently find that test-time
|
||
compute pays off most above ~10–14B parameters, where attention capacity (not
|
||
raw parameter count) becomes the bottleneck. For smaller models:
|
||
|
||
- **Short-m@k with majority voting** remains net-positive on local hardware
|
||
because ternary / small dense inference is cheap. Budget: ≤3 parallel chains.
|
||
- **Verifier-guided search (MCTS / Best-of-N + judge)** is rarely worth the cost
|
||
unless the verifier is also small and runs on the same device. A 7B verifier
|
||
rating a 2B generator's outputs eats the compute budget the small model was
|
||
supposed to save.
|
||
- **Extended single-chain thinking** is the worst option at this scale — see
|
||
point 1.
|
||
|
||
**Synthesis.** For a sub-7B local model: train on shorter chains, run short-m@k
|
||
at inference when accuracy matters, inject blueprints when the task type is
|
||
known, and do not import frontier test-time-compute patterns wholesale. The
|
||
reasoning-density ceiling for a small model is shaped more by data composition
|
||
and inference-time structure than by raw model capability.
|
||
|
||
### 6.5 Local agent harnesses
|
||
|
||
- **OpenCode:** the current most-flexible model-agnostic harness. Strong for
|
||
routing between local and cloud models in a single workflow. Recommended
|
||
default for users who want control.
|
||
- **Aider:** still excellent for diff-based coding, particularly with its
|
||
repo-map. More limited as a general agent loop.
|
||
- **Cline / Continue / Roo Code:** good integrations into VS Code; varying
|
||
degrees of model-agnostic configuration.
|
||
- **llama.cpp / vLLM / MLX / Ollama:** the inference layer. vLLM dominates for
|
||
GPU throughput; llama.cpp for flexibility and CPU/Apple support; MLX for
|
||
Mac-native efficiency.
|
||
|
||
### 6.6 Pre-configured cloud agents vs local-DIY
|
||
|
||
The honest comparison:
|
||
|
||
- **Pre-configured wins** on out-of-the-box capability. Cursor, Claude Code,
|
||
Windsurf, GitHub Copilot ship with deeply tuned harnesses, hand-curated system
|
||
prompts, and routing logic that took teams of engineers months to build. A
|
||
naive local setup will not match this without significant effort.
|
||
- **Local-DIY wins** on customizability, privacy, cost-at-scale, and willingness
|
||
to invest in harness work. The ceiling is higher if you put in the engineering
|
||
hours; the floor is much lower.
|
||
|
||
A pragmatic middle path: pre-configured cloud agent as daily driver, local agent
|
||
for confidential work and bulk tasks. OpenCode is well-suited to this hybrid
|
||
pattern.
|
||
|
||
---
|
||
|
||
## 7. Prompt Engineering: Is It Still Relevant?
|
||
|
||
Mostly: no, not in the 2022–2023 sense. The techniques that used to deliver
|
||
double-digit accuracy improvements either:
|
||
|
||
- **Got partially baked into the models** (chain-of-thought via reasoning
|
||
training, instruction-following via RLHF/RLAIF) — but "baked in" is not the
|
||
same as "reliable." Even reasoning-trained CoT inherits and entrenches
|
||
pretraining priors via posterior collapse, especially on subjective tasks
|
||
(emotion, morality, intent inference —
|
||
[arXiv:2409.06173](https://arxiv.org/abs/2409.06173)). Larger
|
||
reasoning-trained models can anchor _harder_ to a wrong prior under CoT, not
|
||
softer. Treat "the model will reason its way out of a misread" as a weak
|
||
intervention, not a built-in safety net.
|
||
- **Got moved into the harness** (todo lists, plan/act, structured tool use).
|
||
|
||
What still matters about prompt construction:
|
||
|
||
- **Negative constraints.** Frontier labs spend disproportionate effort on "do
|
||
not do X" rules. Third-party harnesses under-invest here. Important caveat
|
||
from §4.6: negative constraints _lose_ to deeply trained behavioral priors.
|
||
They work for novel rules; they fail against "gather context first"-style
|
||
instincts. Match the rule to the mechanism.
|
||
- **Output-format guarantees.** Structured output, schema-constrained
|
||
generation, JSON mode — these still pay off, especially for tool calls.
|
||
- **Role/boundary definition for subagents.** Subagent system prompts are still
|
||
high-leverage because they shape what compressed report comes back. This is
|
||
about defining the _task contract_ and the _return format_, not about
|
||
injecting an expertise persona (see persona caveat below).
|
||
- **Stable identity across turns.** "You are an agent that..." framing has
|
||
little benefit. The folk claim that "consistent voice and persona instructions
|
||
reduce drift in long sessions" is uncited and unverified; given that small
|
||
variations in persona attributes can produce double-digit accuracy drops
|
||
(Principled Personas, EMNLP 2025), treat persona stability as cosmetic, not
|
||
load-bearing.
|
||
- **Expertise-ladder prompting for _divergent ideation_ (not accuracy).**
|
||
Community technique, no canonical paper, **and now in tension with the
|
||
persona-prompting empirical literature.** When a brainstorming or design task
|
||
risks collapsing to an "average" LLM answer, enumerating solutions across
|
||
explicit framings (e.g., _"What would a junior engineer propose? What would a
|
||
senior engineer with deep domain knowledge propose differently? What does an
|
||
outsider with zero context propose? What assumptions does the senior answer
|
||
make that the junior doesn't?"_) can broaden the sample of approaches the
|
||
model produces. **Critical scope limit:** recent persona- prompting work
|
||
(Principled Personas, EMNLP 2025; Persona is a Double-Edged Sword, IJCNLP
|
||
2025; [arXiv:2512.05858](https://arxiv.org/abs/2512.05858)) finds that
|
||
low-knowledge personas ("layperson," "outsider," "child") often _reduce_
|
||
accuracy on factual / reasoning benchmarks, sometimes substantially. The
|
||
ladder is therefore safe as a _divergent-thinking sampler_ (where high
|
||
variance is the goal) but **must not** be used as an accuracy improver, an
|
||
expertise injector, or the final answer producer. Use it to broaden the
|
||
candidate set, then evaluate candidates with the un-personified model under an
|
||
external rubric. If you only have budget for one of these two passes, skip the
|
||
ladder.
|
||
|
||
What no longer pays off meaningfully:
|
||
|
||
- Few-shot examples for capable models on common tasks. Often actively harms via
|
||
spurious pattern-matching.
|
||
- Elaborate "let's think step by step" preambles for reasoning models —
|
||
redundant.
|
||
- "You are an expert in X" puffery. No measurable effect on frontier models, and
|
||
on small models can be actively harmful via persona-attribute sensitivity (see
|
||
Principled Personas reference above).
|
||
- Asking the model to reflect on or critique its own output without an external
|
||
oracle. Per Huang et al. (arXiv:2310.01798), intrinsic self-correction
|
||
_degrades_ reasoning performance in the no-oracle setting. The intervention
|
||
feels productive (and reads well in transcripts) but the measurable effect on
|
||
correctness is negative. Use only when paired with an external verifier.
|
||
|
||
---
|
||
|
||
## 8. Verification, Sandboxing, and Safety
|
||
|
||
### 8.1 Verification as harness, not prompt
|
||
|
||
The most reliable indicator of an agent that works is whether **the harness
|
||
forces verification** rather than relying on the model to verify itself. Minimal
|
||
verification steps:
|
||
|
||
- Build/compile after edits.
|
||
- Test suite execution.
|
||
- Lint and format.
|
||
- Diff inspection (does the change touch unrelated areas?).
|
||
- Git-status awareness before destructive operations.
|
||
|
||
Three patterns extend the basics:
|
||
|
||
- **Block on policy-shaping files.** Some files (`eslint.config.js`,
|
||
`tsconfig.json`, deployment configs) shape the rules every other tool call
|
||
obeys. Edits should require explicit human review even from a trusted agent —
|
||
a PreToolUse hook that denies edits with an explanatory message ("propose the
|
||
change; let the user decide") is more reliable than asking the model to
|
||
remember.
|
||
- **Block on generated files.** Files marked `.generated.ts` (or similar) will
|
||
be overwritten on next build; an agent edit silently disappears. A PreToolUse
|
||
hard block with a redirect ("edit the generator script, then run
|
||
`npm run build:core`") closes the loop instead of relying on the agent to
|
||
remember.
|
||
- **Block on documented-anti-pattern commands.** `sed -i`, `awk` rewrites of
|
||
code files, `rm -rf .wireit`, `npm install` without confirmation,
|
||
`npm run build` while the dev server runs (port conflict): all are cheaper to
|
||
block at the harness than to instruct against in prose. The block message
|
||
should always include the alternative.
|
||
|
||
### 8.2 Sandboxing
|
||
|
||
- **Container-level isolation** for any agent that runs shell commands
|
||
autonomously is now table stakes. Docker, Firecracker microVMs, or
|
||
language-level sandboxes.
|
||
- **Network policy.** Egress whitelisting prevents prompt-injection-driven
|
||
exfiltration.
|
||
- **Filesystem scope.** Agents confined to a project directory eliminate a large
|
||
class of accidents.
|
||
|
||
### 8.3 Prompt injection
|
||
|
||
The unsolved problem of the field. Tool outputs (fetched web pages, file
|
||
contents from third-party repos, search results) can contain instructions that
|
||
hijack the agent. Current mitigations are partial:
|
||
|
||
- Treat tool output as data, never as instructions. Easier said than enforced —
|
||
models cannot fully separate the two.
|
||
- Egress controls and explicit user confirmation for destructive operations.
|
||
- Detection layers (a separate classifier model scanning tool output for
|
||
injection patterns) — partial coverage at best.
|
||
|
||
Assume injection _will_ succeed eventually. Design the blast radius accordingly.
|
||
|
||
---
|
||
|
||
## 9. The Self-Improving Harness
|
||
|
||
A pattern worth its own section because it's underused: **the harness should get
|
||
stronger with every difficult session.** The mechanism is a `Stop` hook that, at
|
||
session end, prompts the agent itself to reflect on whether the session was
|
||
unusually hard and, if so, what knowledge would have prevented most of the work.
|
||
|
||
A representative prompt:
|
||
|
||
> If this session required significant effort (many tool calls, multiple
|
||
> dead-ends, complex investigation): ask yourself what information, if it had
|
||
> existed at the start, would have prevented most of that work. First, determine
|
||
> scope — globally applicable, or specific to certain files / patterns? Then
|
||
> lean toward hooks as the solution: hard stops via PreToolUse, PostToolUse
|
||
> reminders at the relevant boundary, nested AGENTS.md, PreCompact state save,
|
||
> or SessionStart broad reminders. These are all more reliable than root
|
||
> AGENTS.md sections (lost-in-the-middle). Record the insight in the right hook
|
||
> or instructions file, not just in AGENTS.md.
|
||
|
||
Why this works:
|
||
|
||
- **The agent has the freshest signal** about what was painful in this session.
|
||
Asking 12 hours later loses fidelity.
|
||
- **The reflection is gated on effort**, so trivial sessions don't bloat the
|
||
rule set with low-value lessons.
|
||
- **The placement guidance is built into the prompt**, so the recorded lesson
|
||
lands at the right enforcement level (hook ≫ AGENTS.md) instead of defaulting
|
||
to the easiest place.
|
||
- **Repeated application compounds.** A harness that captures one lesson per
|
||
hard session per developer reaches its expressive ceiling fast, then stays
|
||
there.
|
||
|
||
The risk is rule bloat — each session is tempted to record something. Two
|
||
guardrails: (a) the prompt explicitly says "only record genuinely new insights";
|
||
(b) periodic audits remove rules that no longer fire or whose condition has been
|
||
superseded by a better mechanism.
|
||
|
||
A related pattern is **answer-completeness verification at session end**: the
|
||
`Stop` hook re-surfaces the user's last prompt (preserved by the
|
||
`UserPromptSubmit` hook) and asks the agent to confirm every distinct question
|
||
was addressed, not just the primary task. Cheap to implement; catches the most
|
||
common multi-part-prompt failure mode.
|
||
|
||
---
|
||
|
||
## 10. Operational Guidance (Synthesis)
|
||
|
||
A pragmatic playbook condensed from the above:
|
||
|
||
1. **Pick the harness first, model second.** A good harness with a mid-tier
|
||
model beats a great model with a bad harness.
|
||
2. **Default to a single agent loop with plan/act/verify.** Add subagents only
|
||
for read-only exploration or fully isolated tasks.
|
||
3. **Treat the context window as a budget.** Retrieve narrowly, summarize
|
||
aggressively, place task-critical content at the tail.
|
||
4. **Standardize on ~6 tools.** Resist tool proliferation. Use MCP-style façades
|
||
or code-as-tools above ~40 tools.
|
||
5. **Force verification in the harness.** Never rely on the model to grade
|
||
itself.
|
||
6. **Write `AGENTS.md` (or equivalent) for your repo.** Anti-patterns matter
|
||
more than positive instructions.
|
||
7. **Match model class to task.** Reasoning model for planning and diagnosis,
|
||
non-reasoning for mechanical work, cheap model for grep/summarize subagents.
|
||
8. **For local deployment:** Q6_K weights, Q8 KV cache, MoE for memory
|
||
efficiency, grammar-constrained tool calling.
|
||
9. **Build a 20-task internal eval suite** specific to your codebase. No public
|
||
benchmark substitutes.
|
||
10. **Date-stamp your conclusions.** The field moves fast enough that
|
||
model-specific advice rots in months.
|
||
|
||
---
|
||
|
||
## 11. Self-Evaluation
|
||
|
||
A frank assessment of this document's strengths and weaknesses, as instructed:
|
||
|
||
**Strengths**
|
||
|
||
- Categories are organized around the real axes of decision-making (harness vs
|
||
model, local vs cloud, reasoning vs not), not around vendor names, which would
|
||
have dated faster.
|
||
- Calls out specific failure modes per model family rather than treating all
|
||
frontier models as interchangeable.
|
||
- Acknowledges what _used_ to be true and has been uprooted, per request.
|
||
- Quantization and hardware guidance reflects mid-2026 reality (KV-cache quant,
|
||
MoE) rather than the 2023 "Q4 is fine" oversimplification.
|
||
- Self-contained: a reader without prior context can use it.
|
||
|
||
**Weaknesses and risks**
|
||
|
||
- **Model-specific claims rot fast.** The mid-2026 winners section will likely
|
||
be wrong in 3–6 months. The framing should survive longer than the specifics.
|
||
- **Citation density is now medium.** Primary sources have been added where
|
||
verifiable (Sharma 2310.13548 sycophancy, Liu 2307.03172 lost-in-the-middle,
|
||
Pan 2308.03188 self-correction, Zheng 2306.05685 LLM-as-judge, Anthropic Sep
|
||
2025 context engineering article). Several claims remain attributed to
|
||
community sources or unpublished internal evaluations (LangChain harness
|
||
result on Terminal-Bench 2.0, ETH Zurich AGENTS.md cost study, the 40–50 tool
|
||
threshold) — directionally trustworthy but a determined reader should verify
|
||
before quoting.
|
||
- **Possible bias toward the Anthropic / Claude ecosystem.** The author of this
|
||
document is a Claude-family model, and the "Claude leaks" framing reflects an
|
||
asymmetric leak landscape (Claude prompts leaked more visibly than
|
||
competitors'). Other labs do similar scaffolding work; the document implicitly
|
||
under-credits this.
|
||
- **Local-deployment section is hardware-specific** and will age as consumer
|
||
hardware changes (especially NVIDIA generational shifts and Apple's continued
|
||
unified-memory pushes).
|
||
- **Prompt injection section is appropriately pessimistic** but offers limited
|
||
actionable guidance because the field has limited actionable answers. This is
|
||
honest but unsatisfying.
|
||
- **Benchmarks section** treats Aider polyglot and SWE-Bench Verified as current
|
||
ground truth. Both will saturate; the criterion ("predicts your repo's
|
||
results") matters more than the named benchmark.
|
||
|
||
**What I would add with more space**
|
||
|
||
- A worked example of a `AGENTS.md` derived from the negative-instruction
|
||
principle, contrasted with a typical bloated one.
|
||
- Concrete numbers on the cost-at-scale crossover point for local hardware vs
|
||
API usage (these are knowable with reasonable assumptions).
|
||
- A section on fine-tuning vs RAG vs prompt-only customization, with the
|
||
cost/benefit thresholds.
|
||
- Empirical comparisons of grammar-constrained decoding tools (Outlines / GBNF /
|
||
lm-format-enforcer / function-calling-as-grammar) for tool-call reliability on
|
||
open-weight models.
|
||
|
||
**Overall confidence**
|
||
|
||
- **High** on the four opening shifts, the Prompt/Context/Harness diagnostic,
|
||
the cross-model failure modes, and the enforcement hierarchy — all
|
||
well-replicated and stable.
|
||
- **Medium** on the family-specific failure patterns, the tool-count threshold,
|
||
and the small-model harness mitigation set — directionally correct, specific
|
||
numbers vary by harness and model.
|
||
- **Lower** on specific model winners and exact hardware recommendations —
|
||
fast-moving facts.
|
||
|
||
**Changelog**
|
||
|
||
- **Revision 2:** integrated repo-internal research notes (Prompt/Context/
|
||
Harness taxonomy, ETH Zurich AGENTS.md study, LangChain Terminal-Bench harness
|
||
result, LLM-as-judge biases, sub-agent tiering, enforcement hierarchy,
|
||
Plan-and-Solve + Think-Anywhere, just-in-time retrieval, NOTES.md pattern,
|
||
sequential-constraint-ordering failure, small-model harness mitigations,
|
||
skills.sh / SKILL.md, OpenSpec, MCP-as-portable-deferred- loading,
|
||
expertise-ladder prompting). Added primary-source citations where available.
|
||
- **Revision 3:** added patterns observed in the repository's own agent
|
||
configuration (`.agents/`, hooks, modelfiles): counterbalance agent design
|
||
(§3.1a), circuit breakers as a first-class primitive (§3.2),
|
||
falsification-first investigation and dead-ends file (§3.4a), stateful hooks /
|
||
tool-specific PostToolUse warnings / path-scoped reminders (§3.7),
|
||
trigger-word nudges as positive-recommendation analog (§3.8), exploration
|
||
files as durable handoff artifacts and timing awareness (§4.5), anchored
|
||
compaction schema (§4.7), corrected Qwen3 sampling recommendations and
|
||
anti-filler-token prompts (§6.4), policy- and generated-file harness blocks
|
||
(§8.1), self-improving harness via Stop-hook reflection (new §9), and
|
||
outsider-persona expansion to the expertise-ladder prompt (§7). Old §9–10
|
||
renumbered to §10–11.
|
||
- **Revision 4:** elevated **permission-layer denial** above PreToolUse hard
|
||
blocks in the enforcement hierarchy (§3.6). A permission deny on an agent
|
||
definition removes the tool from the agent's available-tool set entirely,
|
||
rather than rejecting a tool call after the agent has chosen to make it.
|
||
Reflects the local-orchestration plan's structural-enforcement primitive
|
||
(OpenCode `permission: { edit: deny }`).
|
||
- **Revision 5:** added Skills vs Hooks comparison table to §5.5. Folded unique
|
||
content from `docs/research/agent-infrastructure.md` (which is now deleted);
|
||
everything else in that file was already synthesized in prior revisions.
|
||
- **Revision 6:** corrective edits driven by the 2026-05-16 text-intent-
|
||
interpretation investigation
|
||
(`docs/explorations/text-intent-interpretation- research.md`). Three claims
|
||
revised against new evidence: (a) §2.1 sycophancy reframed as
|
||
model-family-conditional, not a universal RLHF property, citing nostalgebraist
|
||
(2023) replication on OpenAI base models; (b) §3.5
|
||
intrinsic-self-correction-hurts claim upgraded to cite Huang et al.
|
||
(arXiv:2310.01798) as the strong primary source, with Pan et al. retained as
|
||
the survey reference, and rewritten to explicitly call out "ask the model to
|
||
reflect" as a tempting-but-counterproductive intervention without an external
|
||
oracle; (c) §7 expertise-ladder prompting scoped down to divergent ideation
|
||
only and explicitly flagged as in tension with persona-prompting empirical
|
||
literature (Principled Personas EMNLP 2025; Persona is a Double-Edged Sword
|
||
IJCNLP 2025; arXiv:2512.05858); CoT-baked-in claim softened to acknowledge
|
||
posterior collapse on subjective tasks (arXiv:2409.06173); "ask the model to
|
||
reflect" added to the "no longer pays off" list.
|