- AGENTS.md: design principles, enforcement hierarchy, deferred loading - agents/: brainstorm, build, orchestrator, research (auto-discovered by MCP server) - skills/: research methodology (auto-discovered by MCP server) - hooks/: pre-tool-use, post-tool-use (BFF block removed), session-start, stop, pre-compact, user-prompt-submit - frameworks/: opencode/plugin.ts (resolves hooks via import.meta.url — works as project-local or global plugin), github/hooks.json - mcp/index.ts: auto-discovers agents/*.md and skills/*.md from frontmatter (replaces hand-maintained registry); server renamed all-agents - docs/: agent-infrastructure.md (generalized), research docs (7 files), ai_architectures.md, llama-server-cuda-wsl2.md - install.sh: idempotent setup — Copilot global hooks, OpenCode global plugin + AGENTS.md + MCP entry, VS Code global MCP config
27 KiB
How LLMs Interpret Intent in Text Prompts: Evidence-Based Guidance
Status: Research synthesis. Companion to
text-communication-interpretation.md— that doc covers humans reading text; this one covers LLMs.Scope: Why current frontier and local models misinterpret prompts, what the underlying mechanisms are (training, architecture, quantization, position bias), and which counter-measures have empirical or vendor-documented support.
Models in scope (May 2026): Claude Opus 4.7, Opus 4.6, Sonnet 4.6, Haiku 4.5; the Qwen2.5, Qwen3, and Qwen3.5 ("qwen35") families including the OmniCoder-9B fine-tune; and the current open-weight engineering tier (DeepSeek V4, Kimi K2.6, GLM-5, Mistral Small 4, Gemma 4).
Audience: Engineers building agents, prompts, and scaffolding — not first-time LLM users.
0. Framing: Why Models Misread Prompts Differently Than Humans Do
Humans misread text mostly because of egocentric anchoring and emotional projection (see the companion doc). LLMs misread for structurally different reasons:
- No persistent self. Every turn re-derives "intent" from the visible token stream. Anything outside the context window doesn't exist.
- Distributional priors dominate. The model's behavior is its training distribution conditioned on your tokens. Ambiguity is resolved toward whatever was most common in pretraining/RLHF, not toward what you meant.
- Style → role. Models infer who is speaking from textual style rather than from cryptographic provenance, which is why prompt injection works at all (see §1.4). [13]
- Quantization, depth, and routing change behavior under load, not cleanly and not always at the points you'd expect (see §3).
The practical consequence: the levers that work on humans (charity, delay, perspective-taking) have direct analogs for LLMs — structured context, explicit scope, separated reasoning — but for very different mechanistic reasons.
1. The Core Problem (Why This Is Hard)
1.1 Models resolve ambiguity toward the training prior
When intent is underspecified, models fall back to whatever the training distribution made most likely. Anthropic explicitly documents that Opus 4.7 is more literal than 4.6: it will not silently generalize an instruction from one item to another, and will not infer requests you didn't make. [1] The upside is precision; the downside is that prompts that worked on 4.6 by relying on "obvious" generalization may stop working. Stating scope explicitly ("apply to every section, not just the first") is now required, not optional.
1.2 Instruction following is not bit-width monotonic
Quantization does not uniformly degrade behavior. The Llama-3.1-8B-Instruct GGUF sweep [3] shows:
- GSM8K (reasoning): F16 baseline 77.6; Q3K_S drops to 68.3 (−9.3); Q4_K_S/M essentially match baseline; Q5/Q6/Q8 sometimes _exceed F16.
- IFEval (instruction following): F16 baseline 78.9; Q3K_S drops to 73.9, but Q4_K_S _improves to 80.3 and Q5_0 to 80.1. Q6_K drops to 77.6 and Q8_0 sits at 78.8 — i.e., higher bit-width does not guarantee better compliance.
Practical floor: for agentic / tool-using workflows, 4–5 bit K-quants (Q4_K_M, Q5_K_M) are the safe band; 3-bit risks reasoning collapse; 8-bit is not automatically "best" for instruction following.
1.3 Long-context attention is U-shaped ("lost in the middle")
Liu et al. (TACL 2024) showed performance is highest when relevant information is at the beginning or end of the context, with a sharp dip in the middle — even for explicitly long-context models. [4] The effect persists across Claude, GPT, and Llama lineages through early 2026. [5] Mechanism: training documents are mostly short, and when long, important content tends to sit at the boundaries; the model never learns strong middle-extraction habits.
Implication: the position of an instruction inside a 200K-token context matters more than its wording. Put critical instructions at the top or just before the user turn, not buried in the middle of system context.
1.4 Role confusion: style determines authority
Models do not robustly track where text came from; they infer the role of each span from stylistic cues. Recent work on "CoT Forgery" [13] demonstrates that injected reasoning traces that look like the model's own scratchpad inherit the trust the model places in its own thoughts — external text, by contrast, is normally scrutinized and rejected. This is the structural reason prompt injection in tool outputs works.
Implication: any content you don't fully trust (tool output, fetched web content, user-pasted text) must be wrapped in unambiguous structural markers, and the model must be told what kind of content it is and how much authority it carries.
1.5 Sycophancy / agreement bias
Some RLHF'd models lean toward agreeing with the user's framing, especially when the user states a belief or pushes back. Sharma et al. (ICLR 2024) [14] found this across five SOTA assistants and traced it to human preference labels favoring agreement. Important caveat: the original Perez et al. (2022) finding that sycophancy appears even at zero RLHF steps did not replicate across model families — nostalgebraist (2023) [15] showed OpenAI base models are not sycophantic at any size. So this is model-family- and training-data-specific, not a universal RLHF property. Mitigations: ask for the model's best answer before revealing your view; explicitly invite disagreement; in agent prompts, instruct "persist through genuine blockers; do not pivot just because the previous attempt failed."
Stronger mitigation — context isolation (S2A): System 2 Attention (Weston & Sukhbaatar, 2023) [20] shows that asking the LLM to first rewrite its input context — extracting only the portions relevant to the current query and discarding irrelevant or opinionated material — measurably reduces sycophancy and improves factuality across QA, math word problems, and longform generation. The mechanism is direct: soft attention in Transformers is susceptible to incorporating irrelevant prior context; explicit isolation severs the anchor before generation. In a harness context, the full two-pass S2A (rewrite then respond) requires a second LLM call; the lightweight equivalent is placing an explicit current-question marker at the context tail (recency- bias zone), which isolates the current query from prior anchor answers without a second inference pass.
2. Highest-Leverage Counter-Practices
Ranked by effect size and breadth of support across vendor docs, peer- reviewed work, and field practice.
2.1 Be literal and explicit; state scope
Anthropic's official guidance for 4.6/4.7: "Claude responds well to clear, explicit instructions. Being specific about your desired output can help enhance results. If you want 'above and beyond' behavior, explicitly request it rather than relying on the model to infer it from vague prompts." [1] This is the single most-cited lever in their docs.
Apply equally to Qwen3-class local models, whose Apache-2.0 instruct tunes are now competitive at instruction-following but show the same literal-by-default behavior as Claude 4.7. [2]
2.2 Use XML (or unambiguous) structural tags around heterogeneous content
Wrapping each kind of input — instructions, examples, retrieved context, user query, tool output — in its own tag reduces misinterpretation because the model can attend to "tag boundaries" rather than guessing where one block ends and another begins. [1] This is the cheapest mitigation for §1.3 (lost-in-the-middle) and §1.4 (role confusion) simultaneously.
2.3 Provide context and motivation, not just the instruction
Vendor-documented (Anthropic) and consistently effective: explaining why improves targeting. [1][6] Mechanism: motivation tokens disambiguate which training prior to condition on. A request to "make this shorter" with context "for a P0 incident page, every line costs attention" lands in a different region of model behavior than the same request without justification.
2.4 Prefer general reasoning instructions over prescriptive steps —
for reasoning-capable models
Anthropic: "A prompt like 'think thoroughly' often produces better reasoning
than a hand-written step-by-step plan. Claude's reasoning frequently exceeds
what a human would prescribe." [1] Qwen3's thinking mode is similarly designed
to be triggered with light cues (/think) rather than micromanaged. [2]
For non-reasoning models (or thinking-off mode), the Prompting Science Report 2 [7] finds chain-of-thought provides only a small average boost and increases variance — sometimes flipping previously-correct answers to wrong. For reasoning models the explicit CoT request is essentially zero-value and just burns tokens.
Additional caveat — subjective tasks: arXiv:2409.06173 (2024) [16] shows CoT suffers from posterior collapse: the format of CoT retrieves reasoning priors that remain relatively unchanged despite the evidence in the prompt. This is especially pronounced on subjective tasks (emotion, morality) and on larger models. So for intent-interpretation tasks — exactly the kind this doc is about — CoT may actively entrench the model's prior reading rather than update it on new evidence. Prefer perspective-taking prompts (see §2.4a) or clarifying-question prompts over generic "think step by step" for ambiguous intent.
2.5 Calibrate reasoning length to task complexity
"When More is Less" (Wang et al., 2025) [8] established an inverted-U: accuracy rises with CoT length, then declines as error accumulation outpaces decomposition benefit. Optimal length increases with task difficulty and decreases with model capability. Practical rules:
- For Claude adaptive thinking (4.6/4.7): set the
effortparameter to match task complexity; do not push it higher than needed. [1] - For Qwen3: use the
thinking_budgetmechanism rather than letting thinking run unbounded. [2] - For small local models (≤9B): prefer many short reasoning steps in multiple turns over one long monolithic chain.
2.6 Default-to-action vs. default-to-clarify is promptable
Anthropic publishes both directions verbatim. For agent work:
By default, implement changes rather than only suggesting them. If the user's intent is unclear, infer the most useful likely action and proceed, using tools to discover any missing details instead of guessing. [1]
For research/exploration work, invert it: instruct the model to clarify or plan before acting. The point is that "agentic-ness" is a prompt-controlled dial, not a model property.
2.7 Place critical instructions at the boundaries of the context
Direct consequence of §1.3. The top of the system prompt and the position immediately preceding the user's most recent turn are the high-attention zones. Anthropic, Cursor, and Aider all converge on this in practice — system prompts grow at the top, repo-map / recent-turn context grows just before the user message.
Stronger form — full context recontextualization (S2A [20]): if the context contains opinionated or anchor-setting material that will skew the answer, the boundary-placement advice is necessary but not sufficient. S2A's two-pass pattern (rewrite context to strip irrelevant content → generate from rewritten context) further reduces the effect of prior anchors. For agent harnesses where a second LLM call is too expensive, the single-pass equivalent is an explicit current-question isolation instruction injected at the context tail — same recency zone, same isolation intent, no extra inference. [20]
2.8 Truncate and structure tool output aggressively
Local-model failure modes documented in this repo's own
agent-infrastructure.md match the
broader pattern: tool-call history is the largest context consumer, and
untruncated outputs both push content into the lost-in-the-middle zone and
widen the prompt-injection attack surface (§1.4). The repo's ~1500-token
post-tool-use truncation is consistent with what the Cursor and Aider teams have
published.
2.9 Lower temperature for tool-calling / structured output
Convergent vendor guidance across Anthropic, Qwen, and Tesslate (OmniCoder): for tool-calling and JSON-emitting paths, temperature 0.2–0.4 substantially reduces schema violations and hallucinated arguments. [10] This effect is amplified in quantized models where sampling noise compounds with quantization noise.
2.10 Role / persona prompting is at best a weak intervention
A 2025 wave of replication-style studies converges on a folklore-busting result: assigning expert personas ("you are a senior software engineer…") does not reliably improve task performance, and in many cases hurts.
- Principled Personas (EMNLP 2025) [17]: across 9 SOTA models × 27 tasks, expert personas usually give "positive or non-significant" effects, and models are highly sensitive to irrelevant persona details, with drops of almost 30 percentage points.
- Persona is a Double-Edged Sword (IJCNLP Findings 2025) [18]: dataset- aligned personas can hurt; only instance-aligned personas selected per- query reliably help.
- Persona-prompt evaluation across QA benchmarks (arXiv:2512.05858) [19]: "persona prompts generally did not improve accuracy" across both benchmarks tested; low-knowledge personas (layperson, child) actively degrade results.
Practical guidance: do not rely on personas as a precision lever for intent interpretation. If a persona is included for stylistic reasons (tone, register), keep it minimal and avoid attributes that are irrelevant to the task. For correctness, prefer the levers in §2.1–§2.9.
3. Architecture, Parameters, and Quantization — What Actually Changes
3.1 Parameter count and "emergence"
The classical scaling-laws picture (Kaplan, Chinchilla) holds for loss, but emergent capabilities are noisier than originally reported. "Distributional Scaling Laws for Emergent Capabilities" (2025) [9] shows that at scales near a capability threshold, performance across random seeds is bimodal — some runs acquire the skill, some don't — so "emergence" at a given scale is partly stochastic. Bigger models collapse the bimodal distribution and acquire skills more reliably.
Practical implication for choosing model size:
- ≤4B: reliable for narrow extraction, classification, short agentic steps; instruction following degrades sharply with prompt length and as context fills.
- 7–14B (incl. OmniCoder-9B): the current sweet spot for local engineering work. Tool-calling and structured output work reliably when the prompt is well-structured; reasoning is acceptable; long- horizon plans drift.
- 30–70B dense / 100–400B MoE: comparable behavior to mid-tier cloud models on most tasks; remaining gaps are agentic (BrowseComp, TerminalBench, OSWorld) where open models still trail. [11]
3.2 Dense vs. Mixture-of-Experts
Shen et al. (ICLR 2024, "FLAN-MoE") [12] established a counter-intuitive result that still holds: MoE models underperform dense models of equivalent FLOPs when only directly fine-tuned, but surpass them dramatically after instruction tuning — and benefit more from instruction tuning than dense models do. FLAN-MoE-32B beat Flan-PaLM-62B on four benchmarks at ⅓ the FLOPs.
Practical implications for prompt design:
- MoE models (DeepSeek V4, Kimi K2.6, GLM-5, Qwen3 235B-A22B) are more sensitive to instruction style matching their tuning distribution. Clean, structured prompts pay off more than on dense models.
- Routing instability shows up as occasional out-of-distribution responses on edge cases. Few-shot examples are an effective stabilizer because they shift activation into well-traveled expert combinations.
- Active-parameter count (e.g., 22B active in Qwen3-235B-A22B) is the better predictor of per-token latency and small-task quality than total parameter count.
3.3 Quantization
Detailed numbers in §1.2. Summary heuristics:
| Bit-width | Reasoning (GSM8K) | Instruction (IFEval) | Recommendation |
|---|---|---|---|
| Q3_K_S/M | Notable drop | Variable, often drop | Avoid for agents |
| Q4_K_S/M | ~Baseline | Often ≥ baseline | Default for local agents |
| Q5_K_M | ≥ Baseline | ≥ Baseline | Best quality/size trade-off [3] |
| Q6_K | ≥ Baseline | Sometimes slight dip | Use if VRAM allows |
| Q8_0 / bf16 | Baseline | Baseline | No guaranteed advantage over Q5 |
Calibration-aware methods (AWQ, GPTQ with good calibration data, EXL2) generally
outperform naive GGUF at the same bit-width; for instruction- heavy work, prefer
K-quants over legacy _0 / _1 quants. [3]
3.4 Architecture variants worth knowing in 2026
- Standard Transformer + GQA: still the default (Llama, Mistral, most Qwen2/2.5).
- Hybrid attention (Qwen3.5 / "qwen35" / OmniCoder backbone): Gated Delta Networks interleaved with standard attention; enables efficient 262K native context with extension to 1M+. [10] In practice this changes the lost-in-the-middle profile somewhat but does not eliminate it — the same boundary-placement advice applies.
- Thinking-mode fusion (Qwen3): a single model trained for both reasoning
and direct response, switched by
/thinkand/no_thinkflags in user/system messages, with an emergent "stop thinking now" capability used by thethinking_budgetcontroller. [2]
4. Model-Specific Notes (May 2026)
Claude Opus 4.7 / Opus 4.6 / Sonnet 4.6 / Haiku 4.5
- Opus 4.7 is more literal than 4.6 at low effort. Prompts tuned for 4.6 may need scope made explicit on 4.7. [1]
- Adaptive thinking is the default; do not hand-write step-by-step plans unless the task is genuinely procedural. [1]
- The "default-to-action" / "default-to-clarify" prompt is the highest- leverage knob for changing agent behavior without changing model. [1]
- Subagent delegation (Opus parent → Sonnet/Haiku children) is cheaper-and-comparable for isolated subtasks; the parent retains reasoning, the children execute.
Qwen3 family (0.6B – 235B, dense + MoE; Qwen3.5 hybrid)
- Two-mode model:
/thinkand/no_thinkflags toggle reasoning;thinking_budgetcaps token spend. [2] - Instruction following on Qwen3 instruct surpasses Qwen2.5 instruct, especially in non-thinking mode. [2]
- Multilingual support jumped from 29 languages (Qwen2.5) to 119 (Qwen3). [2]
- Qwen3.5 (the "qwen35" architecture, base for OmniCoder-9B) introduces hybrid Gated Delta + standard attention, 262K native context. [10]
OmniCoder 2 / OmniCoder-9B (Tesslate, Qwen3.5-9B base)
- Fine-tuned on 425K agentic trajectories distilled from Claude Opus 4.6, GPT-5.3-Codex, GPT-5.4, Gemini 3.1 Pro on Claude Code, OpenCode, Codex, and Droid scaffolding. [10]
- Specifically learned read-before-write, LSP-diagnostic response, and minimal-diff edits.
- Tesslate's own guidance: temperature 0.2–0.4 for agentic / tool use.
- Failure modes documented in this repo:
agent-infrastructure.md§ "Smaller-scale local models" — narrower training distribution (Python/JS heavy), JSON-schema compliance drops as context fills, instruction drift faster than larger Qwen3 due to fewer attention heads.
Other engineering-capable local models (2026 tier)
- DeepSeek V4 Pro (Max), Kimi K2.6, GLM-5: current open-weight ceiling; strong on coding/agentic, still trail proprietary models on BrowseComp, TerminalBench, OSWorld. [11]
- Qwen3.5 397B (Reasoning): competitive with the above at reasoning-heavy work.
- Mistral Small 4 (24B, 256K ctx): best quality-to-resource ratio for single-GPU deployments; Apache 2.0.
- Gemma 4 31B (256K ctx): strong LiveCodeBench; single high-end consumer GPU viable.
- Llama 4 (Maverick/Scout): now trails the Chinese open-weight leaders on benchmarks but retains ecosystem advantages. [11]
5. Minimal Operating Checklist
When writing a prompt or system message for any of these models:
- State scope and motivation explicitly. Don't expect generalization.
- Structure heterogeneous content with tags. Especially anything from a tool or external source.
- Put critical instructions at the boundaries (top of system, or immediately before user turn) — not buried.
- Pick reasoning intensity deliberately. Adaptive/
thinking_budgetfor capable models; multi-turn small steps for ≤9B locals; skip forced CoT on reasoning models. - Truncate tool output and never paste untrusted text without a wrapper that names its provenance.
- For tool-calling: lower temperature (0.2–0.4) regardless of model.
- For local deployments: target Q4_K_M or Q5_K_M. Verify on IFEval-style tests, not just perplexity.
- Ask for the answer before stating your own view to avoid sycophantic agreement.
6. What the Evidence Does Not Support
- "Just use a bigger model." Architecture, instruction tuning, and prompt structure account for as much variance as raw parameter count for most engineering tasks. [9][12]
- "Always use chain-of-thought." Outdated. Marginal for non- reasoning models, near-zero for reasoning models, and CoT increases answer variance — flipping some correct answers to wrong. [7][8]
- "Higher quantization is always better." IFEval is not bit-width monotonic; Q4_K_S can beat Q8_0 on compliance. [3]
- "MoE > dense at equivalent total params." Without instruction tuning, MoE underperforms dense at equal FLOPs. [12]
- "Role-play personas reliably steer behavior." Style-based role cues are exactly what prompt-injection attacks exploit; do not rely on persona prompts for security boundaries. [13] Stronger version of this debunk: persona prompts also don't reliably improve task performance — they're often ineffective and frequently harmful when persona attributes are even mildly irrelevant to the task. [17][18][19] See §2.10.
- "Longer reasoning is better reasoning." Inverted-U on accuracy vs. CoT length is well-established. [8]
7. Sources
The foundational survey of prompting techniques used to cross-check claims in this doc is Schulhoff et al. (2024), The Prompt Report: A Systematic Survey of Prompting Techniques (arXiv:2406.06608). PRISMA-based review of 1,565 papers; taxonomy of 58 text prompting techniques. Cited as [PR] where relevant.
- Anthropic. Prompting best practices (covers Opus 4.7, 4.6, Sonnet 4.6, Haiku 4.5). Claude API Docs. https://platform.claude.com/docs/en/docs/build-with-claude/prompt-engineering/be-clear-and-direct
- Yang, A. et al. (2025). Qwen3 Technical Report. arXiv:2505.09388. (Dense + MoE family 0.6B–235B; thinking-mode fusion; thinking budget; 119-language support.)
- Which Quantization Should I Use? A Unified Evaluation of llama.cpp Quantization on Llama-3.1-8B-Instruct. arXiv preprint. (GSM8K, IFEval, MMLU, HellaSwag, TruthfulQA across all GGUF variants.)
- Liu, N. F. et al. (2024). Lost in the Middle: How Language Models Use Long Contexts. TACL 12, 157–173.
- The Neural Base. Lost-in-middle behavior across major models through early 2026. (Replication note; U-shaped curve persists across Claude, GPT, Llama.)
- Anthropic. Prompt engineering for business performance. https://www.anthropic.com/news/prompt-engineering-for-business-performance
- Meincke, L. et al. (2025). Prompting Science Report 2: The Decreasing Value of Chain of Thought in Prompting. arXiv:2506.07142.
- Wang, Y. et al. (2025). When More is Less: Understanding Chain-of-Thought Length in LLMs. arXiv:2502.07266.
- Distributional Scaling Laws for Emergent Capabilities. (2025) arXiv:2502.17356. (Bimodal performance distributions near capability thresholds; "emergence" as stochastic property at scale.)
- Tesslate. OmniCoder-9B model card. Hugging Face, March 2026. (Qwen3.5-9B base; 425K agentic trajectories from Claude Opus 4.6, GPT-5.3-Codex, GPT-5.4, Gemini 3.1 Pro; Gated Delta + attention hybrid; 262K context; recommended temperature 0.2–0.4 for tool use.) https://huggingface.co/Tesslate/OmniCoder-9B
- BenchLM.ai. Best Open Source LLM in 2026: Rankings, Benchmarks, and the Models Worth Running. April 2026. (DeepSeek V4 Pro, Kimi K2.6, GLM-5, Qwen3.5 397B, Mistral Small 4, Gemma 4, Llama 4 comparison.)
- Shen, S. et al. (2024). Mixture-of-Experts Meets Instruction Tuning: A Winning Combination for Large Language Models. ICLR. (FLAN-MoE-32B vs Flan-PaLM-62B; MoE benefits more from instruction tuning than dense.)
- Role Confusion and CoT Forgery: Stylistic Spoofing as a Prompt- Injection Mechanism. arXiv preprint, 2026. (Models infer roles from style; forged reasoning traces inherit self-trust.)
- Sharma, M. et al. (2024). Towards Understanding Sycophancy in Language Models. ICLR 2024. arXiv:2310.13548.
- nostalgebraist (2023). OpenAI API base models are not sycophantic, at any size. LessWrong. https://www.lesswrong.com/posts/3ou8DayvDXxufkjHD/openai-api-base-models-are-not-sycophantic-at-any-size (Disconfirms the strongest reading of Perez et al. 2022 for OpenAI base models. Not peer-reviewed but the data and code are public.)
- Chain-of-Thought is not all you need: Posterior collapse of CoT under distributional shift. arXiv:2409.06173 (2024). (Larger models anchor harder to reasoning priors under CoT, especially on subjective tasks.)
- Principled Personas: Defining and Measuring the Intended Effects of Persona Prompting on Task Performance. EMNLP 2025. https://aclanthology.org/2025.emnlp-main.1364/
- Persona is a Double-Edged Sword: Rethinking the Impact of Role-play Prompts in Zero-shot Reasoning Tasks. IJCNLP Findings 2025. https://aclanthology.org/2025.findings-ijcnlp.51/
- When personas help and when they don't: A persona-prompt evaluation across QA benchmarks. arXiv:2512.05858 (2025). PR. Schulhoff, S. et al. (2024). The Prompt Report: A Systematic Survey of Prompting Techniques. arXiv:2406.06608. PRISMA review of 1,565 papers; taxonomy of 58 prompting techniques.
- Weston, J. & Sukhbaatar, S. (2023). System 2 Attention (is something you might need too). arXiv:2311.11829. (Two-pass technique: LLM first rewrites input context to remove irrelevant/opinionated material, then generates response from cleaned context. Reduces sycophancy and increases factuality on QA, math word problems, and longform generation. The lightweight harness equivalent is a current-question isolation instruction at the context tail.)