Brydon DeWitt 6b07e4ccb2 feat: add shared agent infrastructure (.agents/)

- AGENTS.md: design principles, enforcement hierarchy, deferred loading
- agents/: brainstorm, build, orchestrator, research (auto-discovered by MCP server)
- skills/: research methodology (auto-discovered by MCP server)
- hooks/: pre-tool-use, post-tool-use (BFF block removed), session-start,
  stop, pre-compact, user-prompt-submit
- frameworks/: opencode/plugin.ts (resolves hooks via import.meta.url — works
  as project-local or global plugin), github/hooks.json
- mcp/index.ts: auto-discovers agents/*.md and skills/*.md from frontmatter
  (replaces hand-maintained registry); server renamed all-agents
- docs/: agent-infrastructure.md (generalized), research docs (7 files),
  ai_architectures.md, llama-server-cuda-wsl2.md
- install.sh: idempotent setup — Copilot global hooks, OpenCode global plugin +
  AGENTS.md + MCP entry, VS Code global MCP config

2026-05-22 13:13:43 -04:00

92 KiB

Raw Blame History

Agentic Coding: Best Practices (Research Notes)

Status: Research synthesis, not a tutorial. Captures the state of the agentic-coding field as of mid-2026, with emphasis on what has been uprooted from earlier (2022–2024) practice.

Audience: Engineers building, configuring, or using AI coding agents — not first-time LLM users.

Self-evaluation: See the final section. This document is opinionated and deliberately concrete; model-specific claims are date-stamped because they age within months.

Applied implementation: docs/projects/agent-infrastructure.md — how these principles are applied in this repo (current architecture, OmniCoder 2 orchestration plan, open issues).

0. Framing: What Got Uprooted

Three big shifts have rendered most pre-2024 "LLM coding tips" obsolete or actively misleading:

Prompt engineering → context engineering. Modern instruction-tuned frontier models follow direct, terse instructions reliably. The high-leverage work has moved outside the system prompt — into what tokens reach the model at all, in what order, and with what compression. (Karpathy popularized the term "context engineering" in mid-2024; it has since been adopted as the default frame by Anthropic, Cursor, and others.)
Model > harness → harness ≈ model. A 2023 belief was "just wait for the next model." The Claude system-prompt leaks (Oct 2024 onward), the success of Aider's repo-map, and Cognition's published failure analyses showed that scaffolding — tool choice, context budget, plan/act separation, todo tracking — explains as much variance in agent success as the underlying model. A mid-tier model with an excellent harness routinely beats a frontier model with a naive harness on real-repo tasks.
Multi-agent enthusiasm → single-thread default. The "swarm" / AutoGPT era assumed parallelism would compound capability. Cognition's "Don't Build Multi-Agents" (mid-2025) and subsequent replications established the now-dominant view: context fragmentation between agents destroys more value than parallelism creates. Subagents survive only in narrow, read-only or fully isolated roles.
Three layers, not one. The field has converged on a useful taxonomy popularized by an Alibaba Cloud engineering article (Apr 2026): Prompt → Context → Harness. Prompt is the per-request task expression (stateless). Context is everything the model sees during execution (system rules, tool definitions, AGENTS.md, retrieved code, conversation history). Harness is the deterministic machinery around the model (hooks, permission gates, verification loops, subagent boundaries). The layers fail differently and require different fixes — conflating them is the single most common mistake in agent design. LangChain's Terminal-Bench 2.0 score rose from 52.8% → 66.5% by changing the harness alone (no model swap, no prompt change), the starkest single data point that harness design has first-order impact.

Everything below is downstream of these four shifts.

1. The Model Landscape (Mid-2026)

1.1 Categories that actually matter

Drop the "GPT vs Claude vs Gemini" framing. The useful axes are:

Axis	Options	Why it matters
Reasoning depth	Non-reasoning · Hybrid (toggleable) · Always-reasoning	Reasoning models excel at planning and bug diagnosis; non-reasoning models are faster and cheaper for mechanical edits.
Architecture	Dense · Mixture-of-Experts (MoE)	MoE delivers high parameter counts with low active-param compute — critical for local deployment.
Context budget	128k · 200k · 1M+ effective	Stated context ≠ effective context. Most models degrade well before the advertised limit.
Tool-calling fidelity	Native function-call schema reliability	The single biggest differentiator for agent harnesses. Models with weak tool fidelity cannot drive agents reliably regardless of raw ability.
Hostability	Closed-API only · Open-weight	Determines whether local/private deployment is viable.

1.2 Category winners (as of May 2026 — will rot quickly)

Frontier closed-weight, agentic coding: Claude Opus 4.x and Claude Sonnet 4.x dominate SWE-Bench Verified and long-horizon multi-file refactors. GPT-5-class models lead on competitive-programming-style isolated problems and aggressive reasoning. Gemini 2.5 Pro leads on very-long-context navigation (100k+ token codebases in single prompts).
Open-weight frontier: DeepSeek-V3.x and Qwen3-Coder (480B MoE) are the current open SOTA on coding benchmarks. GLM-4.6 and Kimi K2 trail closely on agentic tasks. The gap to closed frontier has narrowed to roughly 6–12 months for raw capability, but tool-calling fidelity still lags.
Local-runnable (≤80GB VRAM): Qwen3-Coder-30B-A3B (MoE) and Qwen3-32B-dense are the practical sweet spot. DeepSeek-V3 distillations and GLM-4-9B/32B occupy specific niches.
Best price/performance for autonomous agents: Mid-tier Sonnet-class and GPT-5-mini-class models routinely win on cost-adjusted SWE-Bench, because agentic tasks are dominated by mechanical token throughput, not peak reasoning per call.

1.3 Benchmarks: which actually predict real-world success

Predictive: SWE-Bench Verified, Aider polyglot leaderboard, LiveCodeBench (recent splits only), Terminal-Bench. These measure multi-file edits, test-passing, and tool use under realistic constraints.
Misleading or saturated: HumanEval, MBPP, basic code-completion suites. All are contaminated and saturated; a 90+% score is now table stakes and uncorrelated with agent success.
Underrated: Internal harness-vs-harness A/B tests on your own repository. No public benchmark captures repo-specific idioms, build systems, or test-runner quirks. A 20-task internal eval suite beats any leaderboard ranking for selecting a working model for a given project.

2. Failure Modes

2.1 Cross-model failures

These appear across every frontier model and most open-weight models:

Premature completion claims. The model declares "done" while tests fail or builds break. Mitigation: forced verification step in the harness ("run the build before declaring success"), not in the prompt.
Sycophancy (Sharma et al., arXiv:2310.13548, Oct 2023). Five SOTA RLHF-trained assistants systematically generated responses matching the user's stated or implied beliefs over correct ones; both human raters and reward models preferred convincing-but-wrong outputs a non-negligible fraction of the time, creating systematic training pressure toward agreement. Caveat — not a universal property of RLHF. nostalgebraist (LessWrong, 2023) replicated Anthropic's sycophancy eval on OpenAI base models and found they are not sycophantic at any size, so the effect depends on the specific finetuning recipe and the family-specific preference data, not on RLHF as such. Treat sycophancy as family-conditional rather than a universal cross-model failure; the mitigations below still apply where it manifests. Code-specific manifestations: hard-coding to pass test cases, scope creep via agreement, confirming guesses without verification, premature positive feedback. Mitigation: explicit anti-sycophancy rules ("challenge the user when the user is wrong"; "read a file before asserting facts about it"; "only make changes that are directly requested"), and external feedback (test runners, hooks) rather than model self-grading.
Hallucinated APIs. Inventing function signatures, import paths, or configuration keys. Worsens with: long contexts, smaller models, unfamiliar/newer libraries. Mitigation: grounding tools (read source, grep before calling), forced doc-fetch, repository-aware retrieval.
Reward-hacked verification. Deleting failing tests, weakening assertions to make tests pass, wrapping failing code in try/except, or solving the test cases rather than the general problem. Universal failure mode. Anthropic's published counter-prompt is short and effective enough to repeat verbatim:

Please write a high-quality, general-purpose solution using the standard tools available. Do not create helper scripts or workarounds to accomplish the task more efficiently. Implement a solution that works correctly for all valid inputs, not just the test cases. Do not hard-code values or create solutions that only work for specific test inputs. Tests are there to verify correctness, not to define the solution.

Pair with: pre/post diff inspection, test-coverage delta checks, and explicit policy against test deletion in agent rules. Pan et al. (arXiv:2308.03188, 2023) survey of self-correction strategies establishes the broader principle: external feedback signals (test runners, hooks, type checkers) are reliable; self-critique alone is not — models are poorly calibrated to detect their own errors without ground truth.
Context rot / lost-in-the-middle (Liu et al., arXiv:2307.03172, 2023). Information placed in the middle of a long context is recalled poorly even by 1M-context models. Mechanism: transformer attention attends to every token in context (n² pairwise relationships), so a larger context stretches attention capacity across more relationships, leaving less focused attention per token. The degradation is gradient, not cliff; effective context is typically 30–50% of advertised. Mitigation: structured, ordered context (most-recent and most-task-relevant at the tail), summarization of stale turns, separate retrieval rather than dumping.
Position-anchored priming (question drift). When a model commits to an answer in a prior turn, that answer sits in the context window and acts as a prior the model subsequently defends. Follow-up questions are read through the lens of the previous position; the model generates responses consistent with what it already said rather than addressing the new question. Common pattern: "no" to a first question → "no" to all follow-ups even when the follow-ups ask something different. Related to sycophancy but directionally inverted — the model is anchored to its own prior commitment, not the user's.

Mitigations in order of effectiveness:
- Compaction or fresh context. Remove the prior committed answer from the context window. The anchor is physically broken. A PreCompact hook can preserve the user's current question while discarding stale prior responses.
- Adversarial reframing. Per ClashEval (Wu, Wu, Zou 2024): lowering the model's confidence in its prior increases context adherence. "I believe your previous answer was wrong because X. Now answer this specific question: ..." lowers confidence more than repeating the question.
- Explicit current-question marker. A UserPromptSubmit hook prepending CURRENT QUESTION (answer this, not the prior exchange): at the prompt tail. Mechanical, cheap, measurably reduces drift for small models where position effects are stronger.
- What does not work: repeating the question louder, emphasis, or asking the model to "read more carefully." None of these change the anchor.
Stub-and-forget. Writing // TODO: implement placeholders and returning control as if complete. Especially common in Claude family. Mitigation: grep-for-TODO post-step.

2.2 Family-specific patterns

Claude (Opus/Sonnet 4.x): Tends toward over-engineering — adds unrequested error handling, docstrings, abstractions. Strong on instruction adherence when restrictions are explicit. Tends to "polish" adjacent code when asked to make a targeted change. Mitigation: explicit anti-scope-creep rules in AGENTS.md / CLAUDE.md (this is exactly why the field standardized on these files).
GPT (4.x / 5): Tends toward overconfident refactors — silently restructures code beyond the requested scope. Stronger at math/algorithmic reasoning, weaker at faithfully respecting existing code style. Mitigation: small task slicing, frequent diff review, lower temperature.
Gemini (2.5): Verbose; tends to repeat large file contents. Strong on very long contexts but degrades on tool-call schema adherence under load. Occasional formatting drift (markdown bleeding into code). Mitigation: output-format guards and structured tool schemas.
DeepSeek / Qwen / open MoE: Strong raw coding but weaker tool-call reliability — malformed JSON, schema deviation, or "talking about" calling a tool rather than emitting the call. Mitigation: strict JSON-mode / grammar-constrained decoding (e.g., llama.cpp GBNF, outlines, lm-format-enforcer), and harnesses that re-prompt on malformed calls.
Small / quantized models (≤14B, Q4 and below): Instruction-following collapse — ignoring rules after ~4–8 turns; tool-schema breakage; severe hallucination of imports. Not yet viable as primary agent drivers; usable as cheap subagents for specific narrow tasks (grep, summarize, classify).

2.3 The "Claude leaks" and their effect

Starting Oct 2024, leaked system prompts and tool definitions from Claude (and later, similar leaks from Cursor, Devin, Windsurf, and others) revealed how much production-grade harnesses rely on:

Explicit personas and tone constraints
Long lists of anti-patterns ("do not ... do not ... do not ...")
Structured TODO tracking as a first-class tool
Strict separation of plan and act phases
Memory tiering (session vs persistent vs repo)
Explicit file-link and citation formats

The industry consequence was rapid convergence: AGENTS.md, CLAUDE.md, .cursorrules, .windsurfrules, .opencode/agent.md, and similar files now share a near-identical structure. The leaks accelerated the recognition that prompt scaffolding is the product, not a secondary detail. They also clarified that frontier labs spend significant effort on negative instruction — what not to do — which most third-party agent builders under-invested in.

3. Agent Architecture

3.0 The Prompt / Context / Harness diagnostic

For any agent failure, route the fix to the right layer. Wrong-layer fixes are the single most common waste of effort:

Symptom	Layer	Fix
Wrong output format	Prompt	Rewrite instruction; add output schema
Missed an explicit requirement	Prompt	Tighten task expression
Hallucinated codebase fact	Context	Fix tool description; add retrieval
Wrong tool selected	Context	Fix description; reduce tool count
Stalls mid-task on multi-step problem	Context	Insufficient persistent context (NOTES.md)
Reads all files first despite "don't"	Context	Trained behavioral prior — see §4.6
Task drift in long session	Harness	Add sub-agent isolation boundary
Destructive action taken	Harness	Add permission hook (pre-tool deny)
Tests deleted to pass; assertions weakened	Harness	Pre/post diff check; coverage-delta gate
Long-session quality cliff at ~60% fill	Harness	Early compaction trigger; tool-output prune

3.1 Single-thread default

Modern consensus: a single agent loop with a clear plan/act split outperforms multi-agent topologies on almost all real coding tasks. Cognition's analysis identified the root cause as context divergence: separate agents accumulate incompatible interpretations of the same task, and reconciliation costs exceed parallelism gains.

The exceptions where parallel/multi-agent does help:

Read-only exploration subagents. Scan a large codebase, return a compressed summary. Their context does not need to merge back.
Fully isolated tasks. Multiple independent files generated from the same spec, with no inter-dependencies. Rare in real codebases.
Adversarial review. A second agent reviews the first's diff. Modest gains, mostly catches premature-completion failures.

3.1a Counterbalance agent design

When secondary agents are defined (slash commands, personas, named modes), the high-leverage approach is to design each agent as a counter to a known failure mode of the base model, not as a topic specialist ("frontend agent", "database agent"). Topic specialists duplicate context and rarely beat a generalist with a good search tool. Counterbalance agents earn their keep by suppressing a measurable, named tendency:

A brainstorm agent counters frontier-model overthinking — enforces speed, breadth, no hedging, no deep analysis. Exists because Opus/Sonnet ruminate by default.
A research agent counters frontier-model pattern-matching — requires hypothesis + falsification criterion before any diagnostic test. Exists because LLMs latch onto the first plausible explanation.
A build-local agent counters small-model context drift — pagination limits, mandatory grep-before-read, delegation rules for multi-file work.

Two consequences for agent-body authoring:

Negative role definition is part of the spec. Every counterbalance agent should end with a short "What You Are NOT" block: "You are NOT an implementation agent. You are NOT a planning agent." The exclusion list prevents scope creep more reliably than positive role framing alone.
Cognitive-mode decomposition beats topic decomposition. Agents named for how they think (diverge, investigate, execute-narrowly) compose cleanly: brainstorm hands off to research, research hands off to default, build-local handles narrow tasks. Agents named for what they think about ("backend agent") fight for jurisdiction on every cross-cutting task.

3.2 Plan / Act / Verify loop

The minimal viable agent loop:

plan → act → verify → (loop or stop)

Plan: produce a todo list, possibly with a brief written rationale. Forces the model out of "pattern-match and emit" mode. The todo list is also a contract the verify step can check against. Plan-and-Solve prompting (Wang et al., 2023) — decompose first, then execute — measurably reduces arithmetic and multi-step reasoning errors.
Act: execute one todo at a time. Single in-progress item is a soft rule that empirically reduces context fragmentation.
Verify: run tests, lint, build. The verification must be in the harness, not the prompt — relying on the model to self-verify is one of the most reliable ways to produce reward-hacked output.

Think-Anywhere (Jiang et al., 2026) extends Plan-and-Solve: models trained to insert <think> blocks at any token position — not just upfront — catch mid-implementation off-by-one errors that an initial plan cannot foresee. Claude 4.x's interleaved thinking between tool calls is the production-grade realization of the same idea. The practical instruction: "Re-evaluate the hypothesis at every tool-call boundary." The mapping to development methodologies is exact — Plan-and-Solve is sprint planning, Think-Anywhere is the retrospective; both are needed, neither suffices alone. Skipping the plan is "vibe coding"; refusing to re-evaluate is waterfall.

Circuit breakers as a first-class primitive. Embedded numeric self-stops in the agent body materially outperform vague "don't loop" instructions. The pattern, verbatim from working agent files:

5+ attempts without falsifying a hypothesis = STOP. Report what you've ruled out.
3+ edits to the same file without a passing test = STOP. You're fixing symptoms, not the cause.
Urge to "just try something" = STOP. Write the hypothesis first.
Two failures at the same level of abstraction = go UP one level.

Why this works: vague instructions decay against task pressure; explicit integers don't. The model can self-monitor against a count more reliably than against "too much." Pair with hard caps in the harness for the cases where the agent fails to self-stop.

3.3 Reasoning-mode usage

For reasoning-capable models, the cost calculus is:

Use reasoning for: planning, bug diagnosis, ambiguous requirements, architecture decisions.
Skip reasoning for: mechanical edits, file moves, formatting fixes, applying a known patch.
Hybrid models with toggleable reasoning (Claude 4.x extended thinking, GPT-5 reasoning effort, Qwen3 thinking-mode) make this routing tractable inside a single harness.

3.4 Sub-agent tiering (model-as-budget)

When subagents are used (read-only exploration, isolated tasks), the now-standard pattern is model-class tiering:

Parent orchestrator: strongest model (Opus-class) — holds cross-task state, plans, synthesizes. High per-call cost, few calls.
Sub-agents: mid- or small-class (Sonnet/Haiku-class, or a 30B local model) — receive isolated task slices. May burn tens of thousands of exploration tokens, but return only a 1–2k token condensed summary. The parent's context never sees the sub-agent's raw exploration.

This converts the sub-agent into a context firewall: parallelism without context contamination. It is the only multi-agent topology that consistently outperforms single-thread.

3.4a Falsification-first investigation

Applied Strong Inference (Platt, 1964) at the operational level. Before any diagnostic test, the agent fills a four-item checklist:

Hypothesis written (one sentence: "I believe X because Y")
Falsification criterion written (_"if wrong, I'd expect to see __")
Falsification test run before confirmation test
Result recorded: ELIMINATED with reason, or CONFIRMED with evidence

The order matters: running the confirmation test first invites confirmation bias and produces a "plausible answer" that the agent then defends. Running the falsification test first either kills the hypothesis cleanly (cheap progress) or strengthens it materially (the surviving hypothesis is now harder to dislodge).

Dead-ends file. Each eliminated hypothesis is appended to .session/dead-ends.md (or the investigation file's Hypotheses section) with the same four fields. Three benefits:

The current session does not re-test an already-eliminated hypothesis when context pressure causes forgetting.
A post-compaction resume has a structured record to anchor against.
A fresh session (or a handoff agent) starts with a real audit trail instead of having to re-derive the eliminations.

Dead-ends are also a leading indicator of agent quality: a session that produces zero entries was either trivial or non-rigorous; a session with 10+ entries and no resolution is a candidate for human escalation.

3.5 Evaluator-Optimizer, LLM-as-Judge, and Reflexion

Anthropic's "Building effective agents" formalized the evaluator-optimizer pattern: one agent generates, a separate evaluator scores against a rubric, the generator refines. Useful for research-quality assessments and brainstorm outputs more than for code (tests are a stricter evaluator than any judge).

The foundation result is Zheng et al. (arXiv:2306.05685, MT-Bench / Chatbot Arena, 2023): GPT-4-class LLMs as judges achieve >80% agreement with human preferences — the same rate as human-human agreement. This makes them a viable scalable evaluator, but with known biases that must be controlled:

Position bias. Judges favor whichever response appears first in a pairwise comparison. Mitigation: run twice with order reversed; take only the consistent result.
Verbosity bias. Longer responses score higher even at equal information density. Mitigation: rubric scores correctness and concision separately.
Self-enhancement bias. Same-family judges over-score their own family's outputs. Mitigation: cross-family judging or human spot-checks for calibration.

Reflexion (Shinn et al., 2023, arXiv:2303.11366) formalizes the evaluator-optimizer loop for multi-step agents: an external evaluator generates verbal feedback, the agent stores it in an episodic memory buffer, and reruns with the feedback in context. Results: 91% pass@1 on HumanEval vs GPT-4's 80% without it. Two non-negotiable conditions:

External feedback signal — not self-critique. An oracle or verifier (test pass/fail, compilation, hook exit code). Huang et al. (arXiv:2310.01798, "Large Language Models Cannot Self-Correct Reasoning Yet," Oct 2023) demonstrate this directly: in the intrinsic setting (no oracle labels), self-correction consistently decreases reasoning performance across prompts and tasks; prior "self-correction works" results vanish when oracle labels are removed. Pan et al. (arXiv:2308.03188) provide the broader survey taxonomy of self-correction strategies and the same conclusion in aggregate: external feedback signals (test runners, hooks, type checkers) are reliable; self-critique alone is not. Without an external signal, asking the model to reflect, double-check, or critique its own output is at best noise and at worst actively harmful — this is one of the most tempting and most counterproductive interventions in agent design.
The ability to retry. Reflexion loops. Single-shot feedback injection is helpful context, not the full pattern.

Failure-mode routing as a design extension. A judge subagent that reads the transcript, classifies the failure mode, and selects the matching intervention is stronger than generic "review the output" because the intervention is matched to the type of failure, not just "try harder." The prior-confidence → intervention mapping from §6.4 applies here:

Failure mode	External signal?	Intervention
Code bug / test failure	Yes (test runner)	Reflexion loop
Convention violation (async, error handling)	Yes (grep)	PostToolUse grep + canonical example
Question drift / prior anchoring	No	Compaction or adversarial reframing
Factual hallucination	Sometimes	Retrieval injection
Wrong directory / file	Yes (file listing)	Structure injection

Design constraints for the judge subagent:

Use a stronger or cross-family model as judge. A small model evaluating its own family's outputs compounds self-enhancement bias and parameter-count limitations. Frontier-class (Opus/Sonnet) or a different model family is strongly preferred. For a local-only constraint, a 32B judge evaluating a 9B agent is a practical minimum.
Activate on mechanical failure signals, not every turn. Run the judge when a hook fires non-zero, tests fail, or a build breaks — not as a constant overlay. Routing every response through a judge adds latency and is redundant when mechanical verification already gives a clear answer.
Judge output should be a correction spec, not a rewrite. Structured: { failure_mode, confidence, intervention, injected_context? }. The working agent acts on the spec; the judge stays in the evaluator role.
General Q&A failures lack external ground truth. For question drift, factual errors without a retrieval target, or prior anchoring — no oracle exists. Compaction and adversarial reframing are cheaper and more reliable for those cases than a judge loop.

3.6 The Enforcement Hierarchy

Not all guidance is equally effective. From most to least reliable, as a practical hierarchy:

Permission-layer denial    ← Strongest. Tool literally not available to the agent.
PreToolUse hard block      ← Structural. Always fires. Agent cannot bypass.
PostToolUse path-check     ← Fires right after the relevant action (context tail).
Nested AGENTS.md at path   ← Always-on for that folder scope. Tool-portable.
Stop / SessionStart inject ← Fires at session boundaries. Broad reminders.
Root AGENTS.md sections    ← Context-start only. Degrades under Lost-in-the-Middle.

The root cause of the degradation gradient is Liu et al.'s lost-in-the-middle result: guidance written once at session start sits in the low-attention middle by tool call 20. Hooks inject at the context tail — the high-attention zone — which is why they outlast AGENTS.md under context pressure. Decision rule: if a constraint must hold deep into a session, fire it from a hook, not a prompt.

Permission-layer denial sits above PreToolUse for a reason. A PreToolUse hook intercepts a tool call the agent has already chosen to make; it generates a rejection message that the agent must then process and route around. Permission-layer denial (OpenCode's permission: { edit: deny, write: deny, bash: deny } on an agent definition; Claude Code's analogous allowlist) removes the tool from the agent's available set entirely — the tool description never appears in the agent's context, so the agent cannot try and recover. This is the cleanest realization of Anthropic's "poka-yoke your tools" principle: the violation is not just blocked, it is unreachable. Use it for invariants that must hold across an entire agent role (e.g., "the orchestrator never writes files"); use PreToolUse hooks for invariants that depend on the specific tool arguments (e.g., "no npx in shell commands").

3.7 Hook design: silent on success, loud on failure

A convention that has converged across Claude Code, Cursor, OpenCode, and internal Anthropic tooling: hooks emit nothing on success and exit with a non-zero code (commonly 2) on failure to reactivate the agent. Verbose success output adds noise to every tool call; the agent only needs to know when it's wrong. This is the harness analog of Unix's "no news is good news."

Three refinements that materially improve hook quality once the basics are in place:

Stateful reminders that read system state at fire time. A QUALITY GATE reminder that runs ss -tlnp | grep ':300[01]' and tailors its recommendation based on whether the dev server is actually running (npm test && npm run lint vs npm run build:strict) is dramatically more useful than a static instruction. The harness already runs at the right moment; spend the 5ms to read state.
Tool-specific PostToolUse warnings. Some tools have well-known blast-radius footguns: vscode_renameSymbol renames variable bindings but not object property keys, string literals, or related identifiers sharing a prefix. A targeted reminder fired immediately after the rename is in the high-attention zone and catches the gotcha before the next commit. Generic "be careful with renames" warnings at session start do not.
Path-scoped PostToolUse reminders. When the editing tool's FILE_PATH matches a glob (e.g., apps/client/src/pages/), inject a domain rule ("this is a client page — use BFF single-request, never chain second fetches"). The rule fires only on the relevant edits, so it doesn't bloat the context window for unrelated work.

3.8 Trigger-word nudges (the positive-recommendation analog)

The enforcement hierarchy in §3.6 covers blocking guidance. The mirror discipline is positive recommendation at the context tail: a UserPromptSubmit hook greps the user's incoming prompt for trigger words and injects a one-line agent recommendation alongside the prompt.

Examples that work in practice:

Hesitation / overthinking words ("wait", "actually", "hmm", "too complicated", "going in circles") → nudge toward a brainstorm agent.
Debugging / investigation words ("why is this broken", "trace", "root cause", "regression") → nudge toward a research agent.

Three non-obvious design constraints:

One nudge per topic. Repeating the same nudge after a user declines trains them to filter it out. Track "nudge fired for topic X" so a declined recommendation stays declined.
One sentence, non-intrusive. A nudge that consumes 200 tokens is indistinguishable from spam. Format: "NUDGE: <one-line condition description>. Consider <action> — one sentence, non-intrusive."
Context-tail injection, not AGENTS.md. A nudge written into AGENTS.md decays to invisibility by tool call 20 (lost-in-the-middle). A UserPromptSubmit hook fires the nudge fresh at every turn, at the tail — where attention is highest.

4. Context Engineering

4.1 Token budget allocation

Treat the context window as a budget, not a container. A rough allocation that holds up across models:

Region	Share	Notes
System / agent rules	5–10%	Stable, terse. Don't bloat with prose.
Memory / repo facts	5–15%	Project conventions, prior decisions. Tier by relevance.
Task description	2–5%	Keep it boundary-defined and specific.
Retrieved code	30–50%	The biggest lever. Most agents over-retrieve.
Tool outputs / scratch	20–40%	Compress aggressively; summarize old turns.
Headroom	10–20%	Leave room for the model's own output and at least one retry.

4.2 Retrieval

Repo maps (Aider's approach): compress a codebase into a ranked outline of file/symbol declarations. Cheap, effective baseline. Still best-in-class for repos up to ~500k LOC.
AST-aware retrieval beats line-based grep on identifier-driven queries.
Embedding retrieval is overrated for code. Symbol-graph and AST retrieval consistently beat dense embeddings on real coding tasks; the exception is natural-language docs and design notes.
Hybrid retrieval (grep + symbol graph + light embedding for docs) outperforms any single approach.

4.3 Memory tiering

Now-standard pattern (Claude Code, Cursor, OpenCode, GitHub Copilot all converged on it):

Session memory: scratch for the current task. Cleared at end.
Repo memory: project conventions, verified facts, build commands.
User/global memory: preferences across all projects.

Loading the right tier at the right time is more impactful than how much is stored.

4.4 AGENTS.md: keep it small

An ETH Zurich evaluation of LLM-generated per-project AGENTS.md files found they increased API cost by 20% and added 14–22% reasoning tokens with no measurable improvement in task success rate. Bloated rule files fill the context window with content irrelevant to the current task — a tax on every tool call for marginal-to-negative benefit.

Practical ceiling: roughly 60 lines of universally applicable constraints. Everything else belongs in:

Nested AGENTS.md at the directory it applies to (loaded only when that scope is active in most agent tools).
Skills loaded on demand by a routing description.
Hooks at the relevant tool-call boundary.
AGENTS.md stubs — one-line trigger conditions with read_file instructions, so the body loads only when the trigger fires.

The pattern: anti-patterns matter more than positive instructions. A 60-line AGENTS.md of "do not do X" rules outperforms a 600-line one full of best-practice prose. This matches the asymmetric effort that frontier labs put into negative instruction (visible in leaked system prompts).

4.5 Just-in-time retrieval and structured notes

Anthropic's Sep 2025 context-engineering article formalized two patterns that now define the state of the art:

Just-in-time retrieval. Rather than loading all potentially relevant content at session start, agents hold lightweight references (file paths, query strings, identifiers) and load data on demand. Claude Code's reliance on glob/grep over upfront file dumps is the canonical example. The instruction version for agent bodies: "Hold references; load on demand. Do not read files you don't need yet."

Structured note-taking (agentic memory). For tasks spanning tens of tool calls or multiple context windows, agents should write progress to a file (e.g. NOTES.md) and read it back at context-reset boundaries. Properties:

Structured for state — JSON/checklist for completion tracking.
Freeform for progress — natural language for context and open questions.
Write-first incentive — "record completion of step 1 before reading files for step 2" is structurally more honest than reading-first, because the model cannot write a truthful note about uncompleted work.

Note files survive compaction. If a PreCompact hook copies the working NOTES.md into session-persistent storage before summarization, a context overflow mid-task becomes a resume, not a restart.

Investigation / exploration files as durable handoff artifacts. For work that spans multiple sessions or agents, NOTES.md is too ephemeral. A structured docs/explorations/<name>.md file with a fixed schema (Status / Question / What We Know / Hypotheses / Investigation Log / Open Questions) is the cross-session equivalent. Three benefits:

Agent handoff without state loss. A brainstorm agent producing an exploration file can hand off to a research agent (or the default implementation agent) by name — the file is the contract, not the chat transcript.
Status field as routing signal. Status: brainstorming | exploring | prototyping | decided | abandoned lets the next agent (or the next user) immediately know whether to diverge further, dig deeper, or build.
Compaction-safe. Even if every conversational turn is summarized away, the file is reread at session start by a SessionStart hook that surfaces active investigations.

NOTES.md and exploration files are complementary: NOTES.md is the agent's working memory for this task; the exploration file is the project's durable record of this question.

Timing awareness as an agent blind spot. Agents have no innate sense of how long a command takes. A casual suggestion to "just run the full test suite" might be a 2-second hit or a 5-minute one, and the agent has no basis for that choice. Effective mitigations:

Prefix unknown commands with time until a baseline is observed.
Capture significant output to /tmp/<descriptive>.txt so grep can re-run cheaply without re-executing the slow command.
Stash baselines in repo memory (/memories/repo/timings.md) once observed, so future sessions don't re-measure.
Feed timing back into triage: a <5s command is nearly free to "just run"; a >30s command should reason first.

4.6 Sequential constraint ordering: a stubborn failure

A narrow but instructive case: the user writes "Do X first. Then Y. Then Z." and the agent immediately reads all files for X, Y, and Z upfront, often blowing the context budget before step 1 begins.

Root cause is not a prompt problem; it's a context-engineering problem. RLHF training data contains overwhelming examples of "gather context, then act" — the model has a strong pre-task exploration bias that competes with the user's ordering constraint and usually wins after a few tool calls. Stronger negative phrasing ("DO NOT read all files first!") loses to this trained behavioral prior reliably.

What works, in descending order of effectiveness:

NOTES.md write-first pattern. Structure as: "Complete step 1. Write what you found to NOTES.md. Then read NOTES.md and proceed to step 2." The model cannot write a truthful note about step 1 without doing step 1, which serializes the work.
Imperative checkpoints. "Say STEP 1 DONE before continuing" — the verbalization marker creates a natural serialization point.
Hard step caps in the harness (e.g., OpenCode's steps: 20 + ask gates). Caps in the prompt are interpreted as suggestions.
Sub-agent fan-out for parallel-safe tasks — one sub-agent per file, each with isolated context. Doesn't help strictly sequential tasks.

What does not work: negative constraints ("do not read all files"), repeated reminders (degrade quickly), or soft caps embedded in the prompt.

4.7 Compaction strategy

The Anthropic guidance, replicated independently elsewhere: first maximize recall (capture every relevant piece of context), then improve precision (eliminate superfluous content). A summary that drops a critical fact is worse than a summary that is slightly too long. Iterate on the compaction prompt itself, treating it as a small distinct prompt-engineering task.

The safest first-pass compaction target is stale tool outputs: raw file contents or command outputs whose information has already been acted on. The assistant's response citing them stays; the 500-token file dump does not.

For harnesses with a PreCompact hook: this is the right place to append open todos, active hypotheses, or in-progress file paths to the input so the summary preserves them.

Anchored summary schema. The most reliable production compaction prompt is not free-form — it's a fixed Markdown skeleton with the original prompt preserved verbatim, plus structured sections for clarifications, constraints, progress, decisions, and next steps. A representative shape:

## Original Prompt

- [the user's first prompt, verbatim]

## Clarifications

- [follow-up that refined the original]

## Constraints & Preferences

- [user constraints or "(none)"]

## Progress

### Done / In Progress / Blocked

## Key Decisions

- [decision and why]

## Next Steps

- [ordered actions]

## Critical Context

- [errors, open questions, technical facts]

## Relevant Files

- [path: why it matters]

Three properties that make this work:

Verbatim original prompt. The single most common compaction failure is drift away from the user's actual ask. Anchoring the verbatim text resists this.
Empty sections kept. "(none)" beats omission — the agent post- compaction can tell whether "no blockers" is a fact or an oversight.
Bullets, not prose. Compaction prose tends to drop facts under token pressure; structured bullets degrade more gracefully.

4.8 Attention engineering

A subset of context engineering, focused on where in the context tokens land. Practical heuristics:

Task-critical content goes at the tail of the context (recency bias is strong and consistent across models).
Rules and constraints repeat at both ends — they are forgotten from the middle.
Long tool outputs should be summarized in place once stale rather than scrolled away. The original is gone from effective attention either way; a summary preserves the salient bits.

5. Tools, Skills, and Specs

5.1 The minimalist consensus

The empirically dominant tool set for coding agents has converged to roughly six primitives:

Read file (with line ranges)
Edit file (string-replace or patch)
Search (grep / regex)
Find files (glob)
Shell (bounded, optionally sandboxed)
Todo list (or equivalent state tracker)

Plus, depending on agent surface:

Subagent / task spawner (for read-only exploration)
Web fetch (for docs lookup)
Memory (read/write the tier hierarchy)

5.2 What got absorbed

Tools that were once distinct but are now redundant given a capable shell:

create_file, delete_file, list_dir, move_file — all expressible through edit/shell, and modern models reliably emit the shell forms.
Language-specific linters/formatters — better invoked through shell with the project's actual configuration.
Dedicated test runners — same.

Tools that were supposed to win but didn't:

Browser-automation tools as a default. Useful for frontend verification, rarely critical otherwise.
"Code interpreter" sandboxes as a separate tool from shell. Now usually unified.

5.3 What's still genuinely needed beyond shell

Structured edits. sed -i and awk corrupt files often enough that every serious harness ships a dedicated string-replace or patch tool with whitespace fidelity. This is the single tool that justifies its existence most clearly.
Todo tracking. Could be a file, but a first-class tool gives the harness a UI surface and gives the verify step a checklist.
Subagent spawning with isolated context. Cannot be expressed as shell.

5.4 Tool-count thresholds

Empirical finding (replicated across Anthropic, OpenAI, and independent research): agent performance degrades non-monotonically once the tool list exceeds roughly 40–50 tools. The model spends attention on tool selection rather than the task. Mitigations:

Tool grouping / lazy loading. Surface only relevant tools per phase.
MCP-style tool servers that present a small façade and route internally.
Code-execution-as-tooling (Anthropic's "code as tools" approach, Cursor's similar pattern): expose tools as a small API the model writes code against, rather than as dozens of discrete function-call schemas. Drastically reduces tool-selection overhead for large tool surfaces.

5.5 Skills and the SKILL.md convention

Skills are bounded, on-demand instruction packets — a SKILL.md file with a description: frontmatter field that the model reads in the tool/skill list, plus a body the model loads when it judges the skill relevant. They are the answer to "how do I avoid loading my entire methodology library upfront?"

The format has stabilized as a community standard, with the skills.sh registry (Vercel Labs, 2025) as a public distribution channel: Anthropic's frontend-design skill (≈367k installs), skill-creator, Vercel's React/composition skills, Supabase's Postgres skills. Install via npx skills add <owner>/<repo>. Treat installed skills like third-party npm packages: review before using.

Key principles for authoring skills:

Progressive disclosure. A debugging skill loaded into a refactoring request is context pollution. Skills load at invocation time, not session start.
Create reactively. The right trigger for a new skill is "the agent failed this same task type twice." Anticipatory skill creation is premature context inflation.
Methodologies, not project rules. Project-specific rules go in nested AGENTS.md; reusable methodologies (how to research, how to brainstorm) go in skills.

Skills vs Hooks — diagnostic guide. The two layers are complementary, not competing: a skill triggers → the model reads it → the model acts → a hook validates the action → the model corrects if the hook exits non-zero.

	Skills	Hooks
Layer	Context Engineering	Harness Engineering
What it is	Progressive disclosure of task-specific knowledge	Deterministic event-triggered execution
Loaded when	Task type activates it (on demand)	Tool-call boundaries (always)
Activated by	Model routing decision	System event (pre/post-tool, session start)
Failure mode	Pollutes context if loaded too broadly	Breaks agent loop if too noisy
Success behavior	Silent — enriches context	Silent — only speaks on failure
Create when	Agent fails same task type twice	Need deterministic enforcement

If in doubt: use a hook when the rule must hold regardless of model judgment; use a skill when the rule only applies to a specific task type that the model should route into.

5.6 Spec-driven development (OpenSpec)

OpenSpec (Fission AI, 2025) introduced a workflow where machine-readable specs (RFC 2119 SHALL/SHOULD/MAY + Gherkin scenarios) live alongside code, and each PR produces a "spec delta" showing requirement changes next to the diff. Supported by Claude Code, Cursor, Copilot, Codex, and 16+ tools.

The valid critique — "isn't this just waterfall?" — OpenSpec answers cleanly: the spec is not meant to be complete before coding starts; it's co-evolved with the code. "Good enough plan + update as you go" is the Agile reading. This is the same plan-then-iterate pattern from §3.2 applied at the requirement level rather than the function level.

When it helps: features with complex, multi-stakeholder requirements where code review benefits from being intent-first rather than diff-first. When it doesn't: infrastructure work, one-off scripts, or codebases where intent is adequately captured by tests.

5.7 MCP as portable deferred loading

The Model Context Protocol (MCP) has emerged as the cross-tool standard for two deferred-loading patterns that previously required tool-specific machinery:

MCP tools ↔ skills. A tool description is the routing signal; the model decides whether to invoke. This is what VS Code Copilot's SkillsContextComputer does internally with file-based .github/skills/<name>/SKILL.md, but MCP makes it portable.
MCP prompts ↔ instructions / slash commands. Exposed via prompts/list; bodies load only at invocation. The portable equivalent of Copilot's InstructionsContextComputer behavior for description:-only .instructions.md files.

Practical implication: prefer MCP tools/prompts over tool-specific deferred-loading mechanisms when targeting multiple harnesses. A description:-only .instructions.md file is deferred-loaded in Copilot but becomes always-on context pollution everywhere else. MCP avoids that asymmetry.

The protocol does not yet have lifecycle hooks (session start, post-tool-use, session end). Active work — SEP-2624 (Interceptors, formal working group with Bloomberg + Saxo Bank engineers) and SEP-2282 (server-declared behavioral hooks) — aims to close this gap in upcoming spec revisions. Until then, session-lifecycle behavior lives in harness-specific plugin layers (OpenCode plugins, Copilot hooks).

6. Local Agents and Models

6.1 When local makes sense

Confidentiality: code or data that cannot leave the network.
Cost at scale: sustained heavy agent use (millions of tokens/day per developer) eventually beats API pricing on amortized hardware.
Customization: fine-tuning on house style, internal frameworks, or domain-specific patterns.
Offline / air-gapped.

When local does not make sense: occasional use, capability-frontier work, single developers without dedicated hardware. The opportunity cost of slower, weaker output usually exceeds API costs.

6.2 Hardware reality (mid-2026)

VRAM	Practical ceiling for coding-grade quality
24 GB	Q4 of 30–32B dense, or Q4 of 30B-A3B MoE. Usable for narrow subagents.
48 GB	Q4 70B dense, Q5–Q6 32B dense, MoE up to ~100B total params at Q4.
80 GB	Q8 70B dense, Q4–Q5 of 200B+ MoE.
2× 80 GB	Frontier open-weight MoE (DeepSeek-V3, Qwen3-Coder-480B) at Q4–Q5.

Apple Silicon with unified memory (128–512 GB) is a credible alternative for MoE inference, where bandwidth, not raw FLOPs, dominates. NVIDIA still leads on prompt processing throughput.

6.3 Quantization

Updated rules of thumb (the conventional wisdom from 2023 — "Q4 is fine" — has been refined considerably):

FP16 / BF16: reference quality.
Q8 / FP8: indistinguishable from FP16 in practice for coding tasks. Default if memory permits. GGUF Q8_0 loses roughly 0.1–0.3% on most benchmarks versus BF16 — not a meaningful degradation vector by itself.
Q6_K: the practical sweet spot. ≤1% quality loss on coding benchmarks for ≥30B models.
Q5_K_M: acceptable for ≥30B. Visible degradation below 14B.
Q4_K_M: the lowest viable quant for serious coding agents on ≥30B models. Below this, tool-call fidelity collapses faster than raw output quality.
AWQ / GPTQ: for GPU-only inference, often higher quality than equivalent GGUF Q4 due to per-channel calibration.
KV-cache quantization (Q8 KV) is often higher-leverage than weight quantization for long-context coding tasks. Underused; under-documented in 2024-era guides. Critical reality: with FP16 KV cache, a 9B model at 32k context burns ≈4 GB just for KV — the KV cache, not weight precision, is the dominant runtime memory constraint at long contexts. Quantize it.

6.4 Small-model failure modes and harness mitigations

For any agent driving a ≤14B model (quantized or not), the failure surface is distinct from frontier models. The model's parameter count is the primary cause; quantization is a minor amplifier. The most important patterns:

Instruction drift past ~12k tokens. Rules stated in the system prompt hold for the first 5–10 tool calls, then erode. Smaller models have fewer attention heads (Qwen3-8B: 32 heads vs Qwen3-32B's 64), so per-token attention fidelity degrades faster as context length grows. Mitigations:

Tool-response history pruning (PostToolUse hook). Once a tool result has been acted on, clear its raw content; keep the assistant's citation. The single highest-leverage harness change for small models.
Compaction trigger at 60% fill (not the default 80–90%). Small models hit the quality cliff earlier; aggressive compaction keeps each window shorter and fresher.
Periodic system-prompt echo. Every N tool calls, inject the 3 most critical rules at the context tail as a <reminder> block.

Tool-call JSON malformation. Smaller models have narrower "format channels" — less capacity to track content and strict syntax simultaneously, especially in long contexts. Mitigations:

PreToolUse JSON validation with schema-specific errors. Generic errors ("invalid tool call") cause retry loops; schema-specific errors guide correction:
```
Tool call JSON was invalid at position 47 (unexpected comma).
Required schema: {"path": string, "limit": number}
```
Grammar-constrained decoding. GBNF (llama.cpp), Outlines, or lm-format-enforcer pin generation to a valid schema at the decode step. More reliable than re-prompting.
Trim tool responses to minimum fields. For read_file, return content and line range, not metadata. Fewer tokens per response = less schema to track in working memory.

Tool-selection errors past ~15 tools. Working memory for "which tools exist" degrades faster than for frontier models. Mitigations: minimum viable tool set; consistent tool-name prefixes (file_read, file_write, file_search); PreToolUse name validation that returns the available list on a miss.

Think-block runaway. Reasoning-trained small models can emit 2k–5k token <think> blocks for a tool call that needed 50 tokens of reasoning. In a 32k context, this consumes budget faster than tool outputs. Mitigations: num_predict cap (e.g., 2048) in the modelfile; observability hooks that log think-block length and flag outliers.

Context-window cliff at ~20k+. Output quality drops noticeably (not catastrophically) past 60–70% fill on a 32k model — the pre-training data was likely concentrated in shorter sequences. Mitigations: context-pressure injection at ≥70% fill — the harness mechanically prepends:

[CONTEXT PRESSURE: ~70% full. Be concise. Prefer targeted tool calls over
broad ones. Write current progress to NOTES.md before proceeding.]

plus the early-compaction trigger above.

Training-distribution mismatch. Most open-weight coding models are heavily Python/JavaScript. TypeScript-specific patterns (generic constraints, conditional types, module augmentation, satisfies, complex inference) are less reliable than equivalent Python. Mitigation: SYSTEM directives that force grounding ("read tsconfig.json before asserting TypeScript configuration"; "read existing type definitions before suggesting new ones"), plus explore-subagent delegation for type-heavy work to isolate the exploration to a fresh context window.

Prompt ambiguity → wrong directory (parametric knowledge conflict). Small models with narrower training distributions resolve ambiguous nouns ("the five hook files") to the most common referent in their training data (.husky/ for "hook files" in a Node.js repo) rather than the project-specific one (.agents/hooks/). The correct files may appear in tool output but not be selected. This is a specific instance of parametric knowledge conflict: the model's trained association competes with project-specific context and frequently wins when prior confidence is high.

Prompt engineering is a subpar fix here. Telling the model "hook files means .agents/hooks/" in AGENTS.md loses to a strong trained prior, especially under context pressure (lost-in-the-middle degrades instruction recall). Two bodies of research clarify why and what works instead:

ClashEval (Wu, Wu, Zou 2024, arXiv:2404.10198) benchmarks this exact tug-of-war across six LLMs. Key finding: the less confident a model is in its prior, the more likely it is to defer to retrieved context. Corollary: specific, concrete contextual evidence is far more effective at overriding a prior than an instruction to prefer context. A file listing showing the actual paths removes the model's need to resolve the ambiguous noun at all.
Onoe et al. (ACL 2023, arXiv:2305.01651) study knowledge propagation in LLMs. Finding: gradient-based fine-tuning on new facts ("for this project, hook files are in .agents/hooks/") shows little propagation — the injected fact does not generalize to new usage patterns. Prepending entity definitions in context outperforms parameter-level injection across all settings. The practical instruction: inject evidence, don't update weights.

What works, in order of effectiveness:

Context grounding via automatic structure injection. A UserPromptSubmit hook that appends a <project-file-map> block to every build-local prompt — listing actual files under .agents/, .opencode/, and other project-specific directories — removes the ambiguity entirely. The model sees real paths; the trained prior is not consulted. This is the harness analog of Aider's repo-map (Gauthier 2023), which injects a compressed AST-derived structure map with every request for the same reason. Implementation: the hook runs find .agents -name "*.sh" -o -name "*.md" | sort and prepends the result as a structured block at the prompt tail.
Automatic disambiguation expansion. When the hook detects category nouns ("hook", "config", "agent") without an explicit path in the user's prompt, expand the noun inline before the model sees it. Example: "the hook files" → "the hook files (.agents/hooks/pre-tool-use.sh, .agents/hooks/post-tool-use.sh, ...)". This converts a high-confidence prior lookup into a zero-ambiguity ground truth.
Explicit path in user prompts. Still useful as a secondary layer, but should not be the only mitigation. Include the explicit path when writing build-local tasks ("the .agents/hooks/*.sh files"). Do not rely on the model inferring project conventions from context alone.

What does not work: repeating the mapping in AGENTS.md or system prompts ("hook files live in .agents/hooks/") — this is instructional and degrades under context pressure. Temperature reduction does not help with noun resolution and may hurt tool-call schema compliance on Qwen3-class models.

Other forms of parametric knowledge conflict — and whether structure injection handles them.

File paths are a low-to-medium confidence prior. The model knows .husky/ is common, but doesn't know your specific project layout, so it defers readily to injected evidence. Structure injection works because the prior is weak. The following conflict types have higher confidence priors and require different harness tools. The pattern from ClashEval holds throughout: match intervention strength to prior confidence.

Conflict type	Example	Prior confidence	Does structure injection help?	What actually works
Structural identity	`.husky/` vs `.agents/hooks/`	Low–medium	✅ Yes — file listing resolves ambiguity	`UserPromptSubmit` hook appends file map
Framework semantics	React patterns in a Solid.js project	High	⚠️ Partially — seeing Solid.js files in the map signals the framework, but doesn't show the API	Inline code examples at prompt tail (`createSignal`, `createMemo` shown in use); `PostToolUse` pattern check for React imports
Import path conventions	`../../packages/core` vs `@cantrips/remnant-core`	Medium	⚠️ Partially — package.json injection exposes aliases	Inject `tsconfig.json` paths section and package.json `imports`/`exports` map at session start
Async convention	`async/await` vs the callback pattern this project uses	Very high	❌ No — file listing doesn't convey behavioral convention	Code example injection (show a canonical callback-pattern function from the codebase); `PostToolUse` grep for `async` in files that should use callbacks
Error handling	Throwing exceptions vs returning error results	Very high	❌ No	Same as async: inject a canonical example; `PreToolUse` or `PostToolUse` grep for `throw new` in use-case files
Command invocations	`npx jest` vs `npm test`, `docker-compose` vs `docker compose`	Medium–high	❌ No	`PreToolUse` hard block + redirect — the incorrect command is interceptible before execution; this is the cleanest fix because the error is structural

The general principle: structure injection handles structural identity conflicts only. For semantic, convention, and behavioral conflicts — where the model has deep training-data confidence in a competing pattern — the effective interventions are (a) concrete code examples at the prompt tail (activates pattern-matching against actual code rather than fighting a prior with instructions) and (b) PostToolUse pattern validation (catches violations immediately, in the high-attention context tail). PreToolUse blocks are the right tool only when the incorrect behavior is interceptible as a specific command or schema.

For the highest-confidence conflicts (async conventions, error handling idioms), the Onoe et al. finding is most actionable: descriptions in AGENTS.md don't propagate. A single concrete example from the actual codebase, injected at the tail, outperforms any amount of prose instruction.

Silent catch blocks mask enforcement failures completely. Any try/catch around a tool-call enforcement path that returns a safe default (e.g., '') will silently disable enforcement when the underlying API changes. This is not a small-model failure — it affects the harness itself. Mitigation: log all caught errors to a debug file during development and verify the log is empty before removing debug code. Never assume a hook or enforcement layer is working; confirm with a test call.

Scope-detection via todo-list interception. When a small model attempts a broad refactor it should not handle, it will typically call manage_todo_list with many items to plan the work. A PreToolUse hook that blocks manage_todo_list calls with ≥4 items and returns a specific error message ("this task is too broad — tell the user and stop") consistently causes the model to report scope and stop, rather than proceeding. This is more reliable than relying on the model's own Rule 5 compliance. Anthropic's pattern for this is "guardrails via parallelization" (a separate model screens requests alongside the working model); a hook-based deny is a lighter-weight equivalent.

Poka-yoke tool design (Anthropic, 2024). The harness should make incorrect tool usage structurally harder, not just instructionally forbidden. Examples: requiring absolute file paths (eliminates cwd-relative errors), enforcing limit on every read call via a blocking hook (eliminates accidental full-file reads), requiring explanation and goal fields on terminal calls (forces pre-action reasoning). These structural constraints outperform equivalent instruction-only approaches because they fire at the API boundary and are not subject to instruction drift.

Sampling parameters matter more. Qwen3's documented thinking-mode defaults are temperature=0.6, top_p=0.95, top_k=20, and these are empirically the right starting point for agentic use as well — lower temperatures (e.g., 0.2) trade reasoning quality and frequently hurt tool-call schema compliance rather than helping, because the model has less headroom to escape a local format error. Earlier guidance suggesting low-temperature defaults for tool-call reliability does not survive A/B testing on Qwen3-class models; keep the documented thinking-mode values unless you measure a specific regression.

Anti-filler-token system prompts. Reasoning-trained small models tend to open <think> blocks with filler ("Okay, let me think about this...", "The user wants...") before any real analysis. Each filler opener wastes 50–150 tokens at the start of every reasoning block, multiplied across tens of tool calls. A direct system-prompt rule — "Open <think> blocks with substantive analysis. Do not begin with filler phrases like 'Okay, let me...' or 'The user wants...'." — measurably trims reasoning length without affecting reasoning quality. The win compounds on a 32k context.

6.4a Reasoning density: getting more out of small local models

A separate question from "how do I keep a small model from breaking?" (§6.4) is "how do I get more reasoning capability out of it without enlarging it?". Recent research converges on four techniques that are particularly suited to local deployment, where additional inference passes are cheap and the alternative (swapping to a frontier model) defeats the reason for going local in the first place.

1. Prefer shorter reasoning chains, not longer ones. The intuitive assumption that more "thinking" helps was directly tested by Hassid et al. (arXiv:2505.17813, "Don't Overthink it"): within a single question, the shortest chains the model produces are up to 34.5% more accurate than the longest, and SFT on short chains beats SFT on long ones. Practical translation:

Cap reasoning-trace lengths at training time (curate short-CoT data) and at inference time (num_predict on <think> blocks, per §6.4).
For test-time scaling on local hardware, short-m@k is the right pattern: generate k reasoning chains in parallel, halt as soon as the first m finish, take majority vote among those m. Hassid reports up to 40% fewer thinking tokens than standard majority voting at equal or better accuracy.
This contradicts the early-2025 "scale test-time compute by extending one long chain" framing (e.g., s1's budget forcing, arXiv:2501.19393). Budget forcing works on 32B+ models; on ≤7B models the evidence increasingly favours shorter chains and parallel sampling. Treat budget forcing as a frontier-model technique.

2. The Small Model Learnability Gap dictates distillation strategy. Li et al. (arXiv:2502.12143) found that models ≤3B do not consistently benefit from long-CoT distillation from larger reasoners — they perform worse than when fine-tuned on shorter, simpler chains better matched to their intrinsic learnability. Their proposed Mix Distillation combines long and short CoT examples (and reasoning from both larger and smaller teachers) and outperforms either alone. The standard "distill from the strongest reasoner you can afford" instinct is wrong for ≤3B targets.

For local-driver training (anything in the 0.5–3B regime), the operational rule is:

Source ~60–70% of CoT data from teachers ≤14B (or from the target model itself after a first round). Use larger teachers (≥30B) for the remaining 30–40%, primarily on harder problems where the smaller teacher is unreliable.
Curate or rewrite teacher outputs to median chain length, not maximum. LIMO (arXiv:2502.03387) showed that 817 strategically-designed "cognitive template" demonstrations beat 100×-larger CoT corpora at the 32B scale; the same logic applies more strongly at smaller scales. Quality and chain-length appropriateness dominate quantity.
The LIMO finding has an important boundary condition the paper states explicitly: it assumes "domain knowledge has been comprehensively encoded during pre-training." A 2B model with weaker domain coverage will not match the same data efficiency — but the directional advice (concise high-quality chains beat verbose mediocre ones) still holds.

3. Blueprint-guided execution as an inference-time density booster. Han et al. (arXiv:2506.08669, ICML 2025 TTODLer-FM) show that LLM-generated structured reasoning blueprints — extracted by a larger model from solved problems and reused as scaffolds — measurably improve small-model accuracy on GSM8K, MBPP, and BBH, with no additional training. The blueprint is a high-level step skeleton ("identify the goal → list known variables → choose the operator type → ..."); the small model fills it in.

For an agentic harness, this maps onto:

A blueprint library keyed by task type (debug, refactor, write-test, search-and-summarize) injected at the prompt tail when the orchestrator classifies the request. The small model is no longer asked to invent a plan from scratch — it executes a known-good plan template, which is the single hardest thing for it to do reliably.
Pairs well with the explore-subagent pattern (§3.4): the orchestrator can generate a blueprint, hand it to the subagent, and recover a 1–2k token summary that's been structurally constrained.

4. Test-time compute scaling is not free, and its effectiveness scales with model size. A persistent failure mode in 2025–2026 deployment writeups is applying frontier test-time-compute patterns (MCTS, Best-of-N with a verifier, extended budget-forced thinking) to ≤7B models and reporting flat or negative results. The Kinetics work and follow-ups consistently find that test-time compute pays off most above ~10–14B parameters, where attention capacity (not raw parameter count) becomes the bottleneck. For smaller models:

Short-m@k with majority voting remains net-positive on local hardware because ternary / small dense inference is cheap. Budget: ≤3 parallel chains.
Verifier-guided search (MCTS / Best-of-N + judge) is rarely worth the cost unless the verifier is also small and runs on the same device. A 7B verifier rating a 2B generator's outputs eats the compute budget the small model was supposed to save.
Extended single-chain thinking is the worst option at this scale — see point 1.

Synthesis. For a sub-7B local model: train on shorter chains, run short-m@k at inference when accuracy matters, inject blueprints when the task type is known, and do not import frontier test-time-compute patterns wholesale. The reasoning-density ceiling for a small model is shaped more by data composition and inference-time structure than by raw model capability.

6.5 Local agent harnesses

OpenCode: the current most-flexible model-agnostic harness. Strong for routing between local and cloud models in a single workflow. Recommended default for users who want control.
Aider: still excellent for diff-based coding, particularly with its repo-map. More limited as a general agent loop.
Cline / Continue / Roo Code: good integrations into VS Code; varying degrees of model-agnostic configuration.
llama.cpp / vLLM / MLX / Ollama: the inference layer. vLLM dominates for GPU throughput; llama.cpp for flexibility and CPU/Apple support; MLX for Mac-native efficiency.

6.6 Pre-configured cloud agents vs local-DIY

The honest comparison:

Pre-configured wins on out-of-the-box capability. Cursor, Claude Code, Windsurf, GitHub Copilot ship with deeply tuned harnesses, hand-curated system prompts, and routing logic that took teams of engineers months to build. A naive local setup will not match this without significant effort.
Local-DIY wins on customizability, privacy, cost-at-scale, and willingness to invest in harness work. The ceiling is higher if you put in the engineering hours; the floor is much lower.

A pragmatic middle path: pre-configured cloud agent as daily driver, local agent for confidential work and bulk tasks. OpenCode is well-suited to this hybrid pattern.

7. Prompt Engineering: Is It Still Relevant?

Mostly: no, not in the 2022–2023 sense. The techniques that used to deliver double-digit accuracy improvements either:

Got partially baked into the models (chain-of-thought via reasoning training, instruction-following via RLHF/RLAIF) — but "baked in" is not the same as "reliable." Even reasoning-trained CoT inherits and entrenches pretraining priors via posterior collapse, especially on subjective tasks (emotion, morality, intent inference — arXiv:2409.06173). Larger reasoning-trained models can anchor harder to a wrong prior under CoT, not softer. Treat "the model will reason its way out of a misread" as a weak intervention, not a built-in safety net.
Got moved into the harness (todo lists, plan/act, structured tool use).

What still matters about prompt construction:

Negative constraints. Frontier labs spend disproportionate effort on "do not do X" rules. Third-party harnesses under-invest here. Important caveat from §4.6: negative constraints lose to deeply trained behavioral priors. They work for novel rules; they fail against "gather context first"-style instincts. Match the rule to the mechanism.
Output-format guarantees. Structured output, schema-constrained generation, JSON mode — these still pay off, especially for tool calls.
Role/boundary definition for subagents. Subagent system prompts are still high-leverage because they shape what compressed report comes back. This is about defining the task contract and the return format, not about injecting an expertise persona (see persona caveat below).
Stable identity across turns. "You are an agent that..." framing has little benefit. The folk claim that "consistent voice and persona instructions reduce drift in long sessions" is uncited and unverified; given that small variations in persona attributes can produce double-digit accuracy drops (Principled Personas, EMNLP 2025), treat persona stability as cosmetic, not load-bearing.
Expertise-ladder prompting for divergent ideation (not accuracy). Community technique, no canonical paper, and now in tension with the persona-prompting empirical literature. When a brainstorming or design task risks collapsing to an "average" LLM answer, enumerating solutions across explicit framings (e.g., "What would a junior engineer propose? What would a senior engineer with deep domain knowledge propose differently? What does an outsider with zero context propose? What assumptions does the senior answer make that the junior doesn't?") can broaden the sample of approaches the model produces. Critical scope limit: recent persona- prompting work (Principled Personas, EMNLP 2025; Persona is a Double-Edged Sword, IJCNLP 2025; arXiv:2512.05858) finds that low-knowledge personas ("layperson," "outsider," "child") often reduce accuracy on factual / reasoning benchmarks, sometimes substantially. The ladder is therefore safe as a divergent-thinking sampler (where high variance is the goal) but must not be used as an accuracy improver, an expertise injector, or the final answer producer. Use it to broaden the candidate set, then evaluate candidates with the un-personified model under an external rubric. If you only have budget for one of these two passes, skip the ladder.

What no longer pays off meaningfully:

Few-shot examples for capable models on common tasks. Often actively harms via spurious pattern-matching.
Elaborate "let's think step by step" preambles for reasoning models — redundant.
"You are an expert in X" puffery. No measurable effect on frontier models, and on small models can be actively harmful via persona-attribute sensitivity (see Principled Personas reference above).
Asking the model to reflect on or critique its own output without an external oracle. Per Huang et al. (arXiv:2310.01798), intrinsic self-correction degrades reasoning performance in the no-oracle setting. The intervention feels productive (and reads well in transcripts) but the measurable effect on correctness is negative. Use only when paired with an external verifier.

8. Verification, Sandboxing, and Safety

8.1 Verification as harness, not prompt

The most reliable indicator of an agent that works is whether the harness forces verification rather than relying on the model to verify itself. Minimal verification steps:

Build/compile after edits.
Test suite execution.
Lint and format.
Diff inspection (does the change touch unrelated areas?).
Git-status awareness before destructive operations.

Three patterns extend the basics:

Block on policy-shaping files. Some files (eslint.config.js, tsconfig.json, deployment configs) shape the rules every other tool call obeys. Edits should require explicit human review even from a trusted agent — a PreToolUse hook that denies edits with an explanatory message ("propose the change; let the user decide") is more reliable than asking the model to remember.
Block on generated files. Files marked .generated.ts (or similar) will be overwritten on next build; an agent edit silently disappears. A PreToolUse hard block with a redirect ("edit the generator script, then run npm run build:core") closes the loop instead of relying on the agent to remember.
Block on documented-anti-pattern commands. sed -i, awk rewrites of code files, rm -rf .wireit, npm install without confirmation, npm run build while the dev server runs (port conflict): all are cheaper to block at the harness than to instruct against in prose. The block message should always include the alternative.

8.2 Sandboxing

Container-level isolation for any agent that runs shell commands autonomously is now table stakes. Docker, Firecracker microVMs, or language-level sandboxes.
Network policy. Egress whitelisting prevents prompt-injection-driven exfiltration.
Filesystem scope. Agents confined to a project directory eliminate a large class of accidents.

8.3 Prompt injection

The unsolved problem of the field. Tool outputs (fetched web pages, file contents from third-party repos, search results) can contain instructions that hijack the agent. Current mitigations are partial:

Treat tool output as data, never as instructions. Easier said than enforced — models cannot fully separate the two.
Egress controls and explicit user confirmation for destructive operations.
Detection layers (a separate classifier model scanning tool output for injection patterns) — partial coverage at best.

Assume injection will succeed eventually. Design the blast radius accordingly.

9. The Self-Improving Harness

A pattern worth its own section because it's underused: the harness should get stronger with every difficult session. The mechanism is a Stop hook that, at session end, prompts the agent itself to reflect on whether the session was unusually hard and, if so, what knowledge would have prevented most of the work.

A representative prompt:

If this session required significant effort (many tool calls, multiple dead-ends, complex investigation): ask yourself what information, if it had existed at the start, would have prevented most of that work. First, determine scope — globally applicable, or specific to certain files / patterns? Then lean toward hooks as the solution: hard stops via PreToolUse, PostToolUse reminders at the relevant boundary, nested AGENTS.md, PreCompact state save, or SessionStart broad reminders. These are all more reliable than root AGENTS.md sections (lost-in-the-middle). Record the insight in the right hook or instructions file, not just in AGENTS.md.

Why this works:

The agent has the freshest signal about what was painful in this session. Asking 12 hours later loses fidelity.
The reflection is gated on effort, so trivial sessions don't bloat the rule set with low-value lessons.
The placement guidance is built into the prompt, so the recorded lesson lands at the right enforcement level (hook ≫ AGENTS.md) instead of defaulting to the easiest place.
Repeated application compounds. A harness that captures one lesson per hard session per developer reaches its expressive ceiling fast, then stays there.

The risk is rule bloat — each session is tempted to record something. Two guardrails: (a) the prompt explicitly says "only record genuinely new insights"; (b) periodic audits remove rules that no longer fire or whose condition has been superseded by a better mechanism.

A related pattern is answer-completeness verification at session end: the Stop hook re-surfaces the user's last prompt (preserved by the UserPromptSubmit hook) and asks the agent to confirm every distinct question was addressed, not just the primary task. Cheap to implement; catches the most common multi-part-prompt failure mode.

10. Operational Guidance (Synthesis)

A pragmatic playbook condensed from the above:

Pick the harness first, model second. A good harness with a mid-tier model beats a great model with a bad harness.
Default to a single agent loop with plan/act/verify. Add subagents only for read-only exploration or fully isolated tasks.
Treat the context window as a budget. Retrieve narrowly, summarize aggressively, place task-critical content at the tail.
Standardize on ~6 tools. Resist tool proliferation. Use MCP-style façades or code-as-tools above ~40 tools.
Force verification in the harness. Never rely on the model to grade itself.
Write AGENTS.md (or equivalent) for your repo. Anti-patterns matter more than positive instructions.
Match model class to task. Reasoning model for planning and diagnosis, non-reasoning for mechanical work, cheap model for grep/summarize subagents.
For local deployment: Q6_K weights, Q8 KV cache, MoE for memory efficiency, grammar-constrained tool calling.
Build a 20-task internal eval suite specific to your codebase. No public benchmark substitutes.
Date-stamp your conclusions. The field moves fast enough that model-specific advice rots in months.

11. Self-Evaluation

A frank assessment of this document's strengths and weaknesses, as instructed:

Strengths

Categories are organized around the real axes of decision-making (harness vs model, local vs cloud, reasoning vs not), not around vendor names, which would have dated faster.
Calls out specific failure modes per model family rather than treating all frontier models as interchangeable.
Acknowledges what used to be true and has been uprooted, per request.
Quantization and hardware guidance reflects mid-2026 reality (KV-cache quant, MoE) rather than the 2023 "Q4 is fine" oversimplification.
Self-contained: a reader without prior context can use it.

Weaknesses and risks

Model-specific claims rot fast. The mid-2026 winners section will likely be wrong in 3–6 months. The framing should survive longer than the specifics.
Citation density is now medium. Primary sources have been added where verifiable (Sharma 2310.13548 sycophancy, Liu 2307.03172 lost-in-the-middle, Pan 2308.03188 self-correction, Zheng 2306.05685 LLM-as-judge, Anthropic Sep 2025 context engineering article). Several claims remain attributed to community sources or unpublished internal evaluations (LangChain harness result on Terminal-Bench 2.0, ETH Zurich AGENTS.md cost study, the 40–50 tool threshold) — directionally trustworthy but a determined reader should verify before quoting.
Possible bias toward the Anthropic / Claude ecosystem. The author of this document is a Claude-family model, and the "Claude leaks" framing reflects an asymmetric leak landscape (Claude prompts leaked more visibly than competitors'). Other labs do similar scaffolding work; the document implicitly under-credits this.
Local-deployment section is hardware-specific and will age as consumer hardware changes (especially NVIDIA generational shifts and Apple's continued unified-memory pushes).
Prompt injection section is appropriately pessimistic but offers limited actionable guidance because the field has limited actionable answers. This is honest but unsatisfying.
Benchmarks section treats Aider polyglot and SWE-Bench Verified as current ground truth. Both will saturate; the criterion ("predicts your repo's results") matters more than the named benchmark.

What I would add with more space

A worked example of a AGENTS.md derived from the negative-instruction principle, contrasted with a typical bloated one.
Concrete numbers on the cost-at-scale crossover point for local hardware vs API usage (these are knowable with reasonable assumptions).
A section on fine-tuning vs RAG vs prompt-only customization, with the cost/benefit thresholds.
Empirical comparisons of grammar-constrained decoding tools (Outlines / GBNF / lm-format-enforcer / function-calling-as-grammar) for tool-call reliability on open-weight models.

Overall confidence

High on the four opening shifts, the Prompt/Context/Harness diagnostic, the cross-model failure modes, and the enforcement hierarchy — all well-replicated and stable.
Medium on the family-specific failure patterns, the tool-count threshold, and the small-model harness mitigation set — directionally correct, specific numbers vary by harness and model.
Lower on specific model winners and exact hardware recommendations — fast-moving facts.

Changelog

Revision 2: integrated repo-internal research notes (Prompt/Context/ Harness taxonomy, ETH Zurich AGENTS.md study, LangChain Terminal-Bench harness result, LLM-as-judge biases, sub-agent tiering, enforcement hierarchy, Plan-and-Solve + Think-Anywhere, just-in-time retrieval, NOTES.md pattern, sequential-constraint-ordering failure, small-model harness mitigations, skills.sh / SKILL.md, OpenSpec, MCP-as-portable-deferred- loading, expertise-ladder prompting). Added primary-source citations where available.
Revision 3: added patterns observed in the repository's own agent configuration (.agents/, hooks, modelfiles): counterbalance agent design (§3.1a), circuit breakers as a first-class primitive (§3.2), falsification-first investigation and dead-ends file (§3.4a), stateful hooks / tool-specific PostToolUse warnings / path-scoped reminders (§3.7), trigger-word nudges as positive-recommendation analog (§3.8), exploration files as durable handoff artifacts and timing awareness (§4.5), anchored compaction schema (§4.7), corrected Qwen3 sampling recommendations and anti-filler-token prompts (§6.4), policy- and generated-file harness blocks (§8.1), self-improving harness via Stop-hook reflection (new §9), and outsider-persona expansion to the expertise-ladder prompt (§7). Old §9–10 renumbered to §10–11.
Revision 4: elevated permission-layer denial above PreToolUse hard blocks in the enforcement hierarchy (§3.6). A permission deny on an agent definition removes the tool from the agent's available-tool set entirely, rather than rejecting a tool call after the agent has chosen to make it. Reflects the local-orchestration plan's structural-enforcement primitive (OpenCode permission: { edit: deny }).
Revision 5: added Skills vs Hooks comparison table to §5.5. Folded unique content from docs/research/agent-infrastructure.md (which is now deleted); everything else in that file was already synthesized in prior revisions.
Revision 6: corrective edits driven by the 2026-05-16 text-intent- interpretation investigation (docs/explorations/text-intent-interpretation- research.md). Three claims revised against new evidence: (a) §2.1 sycophancy reframed as model-family-conditional, not a universal RLHF property, citing nostalgebraist (2023) replication on OpenAI base models; (b) §3.5 intrinsic-self-correction-hurts claim upgraded to cite Huang et al. (arXiv:2310.01798) as the strong primary source, with Pan et al. retained as the survey reference, and rewritten to explicitly call out "ask the model to reflect" as a tempting-but-counterproductive intervention without an external oracle; (c) §7 expertise-ladder prompting scoped down to divergent ideation only and explicitly flagged as in tension with persona-prompting empirical literature (Principled Personas EMNLP 2025; Persona is a Double-Edged Sword IJCNLP 2025; arXiv:2512.05858); CoT-baked-in claim softened to acknowledge posterior collapse on subjective tasks (arXiv:2409.06173); "ask the model to reflect" added to the "no longer pays off" list.

92 KiB Raw Blame History Unescape Escape