- AGENTS.md: design principles, enforcement hierarchy, deferred loading - agents/: brainstorm, build, orchestrator, research (auto-discovered by MCP server) - skills/: research methodology (auto-discovered by MCP server) - hooks/: pre-tool-use, post-tool-use (BFF block removed), session-start, stop, pre-compact, user-prompt-submit - frameworks/: opencode/plugin.ts (resolves hooks via import.meta.url — works as project-local or global plugin), github/hooks.json - mcp/index.ts: auto-discovers agents/*.md and skills/*.md from frontmatter (replaces hand-maintained registry); server renamed all-agents - docs/: agent-infrastructure.md (generalized), research docs (7 files), ai_architectures.md, llama-server-cuda-wsl2.md - install.sh: idempotent setup — Copilot global hooks, OpenCode global plugin + AGENTS.md + MCP entry, VS Code global MCP config
515 lines
27 KiB
Markdown
515 lines
27 KiB
Markdown
# How LLMs Interpret Intent in Text Prompts: Evidence-Based Guidance
|
||
|
||
> **Status:** Research synthesis. Companion to
|
||
> [`text-communication-interpretation.md`](./text-communication-interpretation.md)
|
||
> — that doc covers humans reading text; this one covers LLMs.
|
||
>
|
||
> **Scope:** Why current frontier and local models misinterpret prompts, what
|
||
> the underlying mechanisms are (training, architecture, quantization, position
|
||
> bias), and which counter-measures have empirical or vendor-documented support.
|
||
>
|
||
> **Models in scope (May 2026):** Claude Opus 4.7, Opus 4.6, Sonnet 4.6, Haiku
|
||
> 4.5; the Qwen2.5, Qwen3, and Qwen3.5 ("qwen35") families including the
|
||
> OmniCoder-9B fine-tune; and the current open-weight engineering tier (DeepSeek
|
||
> V4, Kimi K2.6, GLM-5, Mistral Small 4, Gemma 4).
|
||
>
|
||
> **Audience:** Engineers building agents, prompts, and scaffolding — not
|
||
> first-time LLM users.
|
||
|
||
---
|
||
|
||
## 0. Framing: Why Models Misread Prompts Differently Than Humans Do
|
||
|
||
Humans misread text mostly because of egocentric anchoring and emotional
|
||
projection (see the companion doc). LLMs misread for structurally different
|
||
reasons:
|
||
|
||
- **No persistent self.** Every turn re-derives "intent" from the visible token
|
||
stream. Anything outside the context window doesn't exist.
|
||
- **Distributional priors dominate.** The model's behavior is its training
|
||
distribution conditioned on your tokens. Ambiguity is resolved toward whatever
|
||
was most common in pretraining/RLHF, not toward what you meant.
|
||
- **Style → role.** Models infer _who_ is speaking from textual style rather
|
||
than from cryptographic provenance, which is why prompt injection works at all
|
||
(see §1.4). [13]
|
||
- **Quantization, depth, and routing change behavior under load**, not cleanly
|
||
and not always at the points you'd expect (see §3).
|
||
|
||
The practical consequence: the levers that work on humans (charity, delay,
|
||
perspective-taking) have direct analogs for LLMs — structured context, explicit
|
||
scope, separated reasoning — but for very different mechanistic reasons.
|
||
|
||
---
|
||
|
||
## 1. The Core Problem (Why This Is Hard)
|
||
|
||
### 1.1 Models resolve ambiguity toward the training prior
|
||
|
||
When intent is underspecified, models fall back to whatever the training
|
||
distribution made most likely. Anthropic explicitly documents that **Opus 4.7 is
|
||
more literal than 4.6**: it will not silently generalize an instruction from one
|
||
item to another, and will not infer requests you didn't make. [1] The upside is
|
||
precision; the downside is that prompts that worked on 4.6 by relying on
|
||
"obvious" generalization may stop working. Stating scope explicitly ("apply to
|
||
every section, not just the first") is now required, not optional.
|
||
|
||
### 1.2 Instruction following is not bit-width monotonic
|
||
|
||
Quantization does not uniformly degrade behavior. The Llama-3.1-8B-Instruct GGUF
|
||
sweep [3] shows:
|
||
|
||
- **GSM8K (reasoning):** F16 baseline 77.6; Q3*K_S drops to 68.3 (−9.3);
|
||
Q4_K_S/M essentially match baseline; Q5/Q6/Q8 sometimes \_exceed* F16.
|
||
- **IFEval (instruction following):** F16 baseline 78.9; Q3*K_S drops to 73.9,
|
||
but Q4_K_S \_improves* to 80.3 and Q5_0 to 80.1. Q6_K drops to 77.6 and Q8_0
|
||
sits at 78.8 — i.e., higher bit-width does not guarantee better compliance.
|
||
|
||
**Practical floor:** for agentic / tool-using workflows, **4–5 bit K-quants
|
||
(Q4_K_M, Q5_K_M) are the safe band**; 3-bit risks reasoning collapse; 8-bit is
|
||
not automatically "best" for instruction following.
|
||
|
||
### 1.3 Long-context attention is U-shaped ("lost in the middle")
|
||
|
||
Liu et al. (TACL 2024) showed performance is highest when relevant information
|
||
is at the **beginning** or **end** of the context, with a sharp dip in the
|
||
middle — even for explicitly long-context models. [4] The effect persists across
|
||
Claude, GPT, and Llama lineages through early 2026. [5] Mechanism: training
|
||
documents are mostly short, and when long, important content tends to sit at the
|
||
boundaries; the model never learns strong middle-extraction habits.
|
||
|
||
**Implication:** the position of an instruction inside a 200K-token context
|
||
matters more than its wording. Put critical instructions at the top or just
|
||
before the user turn, not buried in the middle of system context.
|
||
|
||
### 1.4 Role confusion: style determines authority
|
||
|
||
Models do not robustly track _where text came from_; they infer the role of each
|
||
span from stylistic cues. Recent work on "CoT Forgery" [13] demonstrates that
|
||
injected reasoning traces that look like the model's own scratchpad inherit the
|
||
trust the model places in its own thoughts — external text, by contrast, is
|
||
normally scrutinized and rejected. This is the structural reason prompt
|
||
injection in tool outputs works.
|
||
|
||
**Implication:** any content you don't fully trust (tool output, fetched web
|
||
content, user-pasted text) must be wrapped in unambiguous structural markers,
|
||
and the model must be told what kind of content it is and how much authority it
|
||
carries.
|
||
|
||
### 1.5 Sycophancy / agreement bias
|
||
|
||
Some RLHF'd models lean toward agreeing with the user's framing, especially when
|
||
the user states a belief or pushes back. Sharma et al. (ICLR 2024) [14] found
|
||
this across five SOTA assistants and traced it to human preference labels
|
||
favoring agreement. **Important caveat:** the original Perez et al. (2022)
|
||
finding that sycophancy appears even at zero RLHF steps did **not** replicate
|
||
across model families — nostalgebraist (2023) [15] showed OpenAI base models are
|
||
not sycophantic at any size. So this is model-family- and
|
||
training-data-specific, not a universal RLHF property. Mitigations: ask for the
|
||
model's best answer _before_ revealing your view; explicitly invite
|
||
disagreement; in agent prompts, instruct "persist through genuine blockers; do
|
||
not pivot just because the previous attempt failed."
|
||
|
||
**Stronger mitigation — context isolation (S2A):** System 2 Attention (Weston &
|
||
Sukhbaatar, 2023) [20] shows that asking the LLM to first _rewrite_ its input
|
||
context — extracting only the portions relevant to the current query and
|
||
discarding irrelevant or opinionated material — measurably reduces sycophancy
|
||
and improves factuality across QA, math word problems, and longform generation.
|
||
The mechanism is direct: soft attention in Transformers is susceptible to
|
||
incorporating irrelevant prior context; explicit isolation severs the anchor
|
||
before generation. In a harness context, the full two-pass S2A (rewrite then
|
||
respond) requires a second LLM call; the lightweight equivalent is placing an
|
||
explicit current-question marker at the context tail (recency- bias zone), which
|
||
isolates the current query from prior anchor answers without a second inference
|
||
pass.
|
||
|
||
---
|
||
|
||
## 2. Highest-Leverage Counter-Practices
|
||
|
||
Ranked by effect size and breadth of support across vendor docs, peer- reviewed
|
||
work, and field practice.
|
||
|
||
### 2.1 Be literal and explicit; state scope
|
||
|
||
Anthropic's official guidance for 4.6/4.7: "Claude responds well to clear,
|
||
explicit instructions. Being specific about your desired output can help enhance
|
||
results. If you want 'above and beyond' behavior, explicitly request it rather
|
||
than relying on the model to infer it from vague prompts." [1] This is the
|
||
single most-cited lever in their docs.
|
||
|
||
Apply equally to Qwen3-class local models, whose Apache-2.0 instruct tunes are
|
||
now competitive at instruction-following but show the same literal-by-default
|
||
behavior as Claude 4.7. [2]
|
||
|
||
### 2.2 Use XML (or unambiguous) structural tags around heterogeneous content
|
||
|
||
Wrapping each kind of input — instructions, examples, retrieved context, user
|
||
query, tool output — in its own tag reduces misinterpretation because the model
|
||
can attend to "tag boundaries" rather than guessing where one block ends and
|
||
another begins. [1] This is the cheapest mitigation for §1.3
|
||
(lost-in-the-middle) and §1.4 (role confusion) simultaneously.
|
||
|
||
### 2.3 Provide context and motivation, not just the instruction
|
||
|
||
Vendor-documented (Anthropic) and consistently effective: explaining _why_
|
||
improves targeting. [1][6] Mechanism: motivation tokens disambiguate which
|
||
training prior to condition on. A request to "make this shorter" with context
|
||
"for a P0 incident page, every line costs attention" lands in a different region
|
||
of model behavior than the same request without justification.
|
||
|
||
### 2.4 Prefer general reasoning instructions over prescriptive steps —
|
||
|
||
**for reasoning-capable models**
|
||
|
||
Anthropic: "A prompt like 'think thoroughly' often produces better reasoning
|
||
than a hand-written step-by-step plan. Claude's reasoning frequently exceeds
|
||
what a human would prescribe." [1] Qwen3's thinking mode is similarly designed
|
||
to be triggered with light cues (`/think`) rather than micromanaged. [2]
|
||
|
||
For **non-reasoning** models (or thinking-off mode), the Prompting Science
|
||
Report 2 [7] finds chain-of-thought provides only a small average boost and
|
||
**increases variance** — sometimes flipping previously-correct answers to wrong.
|
||
For reasoning models the explicit CoT request is essentially zero-value and just
|
||
burns tokens.
|
||
|
||
**Additional caveat — subjective tasks:** arXiv:2409.06173 (2024) [16] shows CoT
|
||
suffers from _posterior collapse_: the format of CoT retrieves reasoning priors
|
||
that remain relatively unchanged despite the evidence in the prompt. This is
|
||
especially pronounced on subjective tasks (emotion, morality) and on larger
|
||
models. So for intent-interpretation tasks — exactly the kind this doc is about
|
||
— CoT may actively entrench the model's prior reading rather than update it on
|
||
new evidence. Prefer perspective-taking prompts (see §2.4a) or
|
||
clarifying-question prompts over generic "think step by step" for ambiguous
|
||
intent.
|
||
|
||
### 2.5 Calibrate reasoning length to task complexity
|
||
|
||
"When More is Less" (Wang et al., 2025) [8] established an inverted-U: accuracy
|
||
rises with CoT length, then declines as error accumulation outpaces
|
||
decomposition benefit. Optimal length _increases_ with task difficulty and
|
||
_decreases_ with model capability. Practical rules:
|
||
|
||
- For Claude adaptive thinking (4.6/4.7): set the `effort` parameter to match
|
||
task complexity; do not push it higher than needed. [1]
|
||
- For Qwen3: use the `thinking_budget` mechanism rather than letting thinking
|
||
run unbounded. [2]
|
||
- For small local models (≤9B): prefer many short reasoning steps in multiple
|
||
turns over one long monolithic chain.
|
||
|
||
### 2.6 Default-to-action vs. default-to-clarify is promptable
|
||
|
||
Anthropic publishes both directions verbatim. For agent work:
|
||
|
||
> By default, implement changes rather than only suggesting them. If the user's
|
||
> intent is unclear, infer the most useful likely action and proceed, using
|
||
> tools to discover any missing details instead of guessing. [1]
|
||
|
||
For research/exploration work, invert it: instruct the model to clarify or plan
|
||
before acting. The point is that "agentic-ness" is a prompt-controlled dial, not
|
||
a model property.
|
||
|
||
### 2.7 Place critical instructions at the boundaries of the context
|
||
|
||
Direct consequence of §1.3. The top of the system prompt and the position
|
||
immediately preceding the user's most recent turn are the high-attention zones.
|
||
Anthropic, Cursor, and Aider all converge on this in practice — system prompts
|
||
grow at the top, repo-map / recent-turn context grows just before the user
|
||
message.
|
||
|
||
**Stronger form — full context recontextualization (S2A [20]):** if the context
|
||
contains opinionated or anchor-setting material that will skew the answer, the
|
||
boundary-placement advice is necessary but not sufficient. S2A's two-pass
|
||
pattern (rewrite context to strip irrelevant content → generate from rewritten
|
||
context) further reduces the effect of prior anchors. For agent harnesses where
|
||
a second LLM call is too expensive, the single-pass equivalent is an explicit
|
||
current-question isolation instruction injected at the context tail — same
|
||
recency zone, same isolation intent, no extra inference. [20]
|
||
|
||
### 2.8 Truncate and structure tool output aggressively
|
||
|
||
Local-model failure modes documented in this repo's own
|
||
[`agent-infrastructure.md`](../projects/agent-infrastructure.md) match the
|
||
broader pattern: tool-call history is the largest context consumer, and
|
||
untruncated outputs both push content into the lost-in-the-middle zone _and_
|
||
widen the prompt-injection attack surface (§1.4). The repo's ~1500-token
|
||
post-tool-use truncation is consistent with what the Cursor and Aider teams have
|
||
published.
|
||
|
||
### 2.9 Lower temperature for tool-calling / structured output
|
||
|
||
Convergent vendor guidance across Anthropic, Qwen, and Tesslate (OmniCoder): for
|
||
tool-calling and JSON-emitting paths, temperature 0.2–0.4 substantially reduces
|
||
schema violations and hallucinated arguments. [10] This effect is amplified in
|
||
quantized models where sampling noise compounds with quantization noise.
|
||
|
||
### 2.10 Role / persona prompting is at best a weak intervention
|
||
|
||
A 2025 wave of replication-style studies converges on a folklore-busting result:
|
||
assigning expert personas ("you are a senior software engineer…") does not
|
||
reliably improve task performance, and in many cases hurts.
|
||
|
||
- **Principled Personas** (EMNLP 2025) [17]: across 9 SOTA models × 27 tasks,
|
||
expert personas usually give "positive or non-significant" effects, and models
|
||
are **highly sensitive to irrelevant persona details, with drops of almost 30
|
||
percentage points**.
|
||
- **Persona is a Double-Edged Sword** (IJCNLP Findings 2025) [18]: dataset-
|
||
aligned personas can hurt; only _instance_-aligned personas selected per-
|
||
query reliably help.
|
||
- **Persona-prompt evaluation across QA benchmarks** (arXiv:2512.05858) [19]:
|
||
"persona prompts generally did not improve accuracy" across both benchmarks
|
||
tested; low-knowledge personas (layperson, child) actively degrade results.
|
||
|
||
**Practical guidance:** do not rely on personas as a precision lever for intent
|
||
interpretation. If a persona is included for stylistic reasons (tone, register),
|
||
keep it minimal and avoid attributes that are irrelevant to the task. For
|
||
correctness, prefer the levers in §2.1–§2.9.
|
||
|
||
---
|
||
|
||
## 3. Architecture, Parameters, and Quantization — What Actually Changes
|
||
|
||
### 3.1 Parameter count and "emergence"
|
||
|
||
The classical scaling-laws picture (Kaplan, Chinchilla) holds for loss, but
|
||
emergent _capabilities_ are noisier than originally reported. "Distributional
|
||
Scaling Laws for Emergent Capabilities" (2025) [9] shows that at scales near a
|
||
capability threshold, performance across random seeds is **bimodal** — some runs
|
||
acquire the skill, some don't — so "emergence" at a given scale is partly
|
||
stochastic. Bigger models collapse the bimodal distribution and acquire skills
|
||
more reliably.
|
||
|
||
Practical implication for choosing model size:
|
||
|
||
- **≤4B:** reliable for narrow extraction, classification, short agentic steps;
|
||
instruction following degrades sharply with prompt length and as context
|
||
fills.
|
||
- **7–14B (incl. OmniCoder-9B):** the current sweet spot for local engineering
|
||
work. Tool-calling and structured output work reliably when the prompt is
|
||
well-structured; reasoning is acceptable; long- horizon plans drift.
|
||
- **30–70B dense / 100–400B MoE:** comparable behavior to mid-tier cloud models
|
||
on most tasks; remaining gaps are agentic (BrowseComp, TerminalBench, OSWorld)
|
||
where open models still trail. [11]
|
||
|
||
### 3.2 Dense vs. Mixture-of-Experts
|
||
|
||
Shen et al. (ICLR 2024, "FLAN-MoE") [12] established a counter-intuitive result
|
||
that still holds: **MoE models underperform dense models of equivalent FLOPs
|
||
when only directly fine-tuned, but surpass them dramatically after instruction
|
||
tuning** — and benefit _more_ from instruction tuning than dense models do.
|
||
FLAN-MoE-32B beat Flan-PaLM-62B on four benchmarks at ⅓ the FLOPs.
|
||
|
||
Practical implications for prompt design:
|
||
|
||
- MoE models (DeepSeek V4, Kimi K2.6, GLM-5, Qwen3 235B-A22B) are more sensitive
|
||
to instruction _style_ matching their tuning distribution. Clean, structured
|
||
prompts pay off more than on dense models.
|
||
- Routing instability shows up as occasional out-of-distribution responses on
|
||
edge cases. Few-shot examples are an effective stabilizer because they shift
|
||
activation into well-traveled expert combinations.
|
||
- Active-parameter count (e.g., 22B active in Qwen3-235B-A22B) is the better
|
||
predictor of per-token latency and small-task quality than total parameter
|
||
count.
|
||
|
||
### 3.3 Quantization
|
||
|
||
Detailed numbers in §1.2. Summary heuristics:
|
||
|
||
| Bit-width | Reasoning (GSM8K) | Instruction (IFEval) | Recommendation |
|
||
| ----------- | ----------------- | -------------------- | ------------------------------- |
|
||
| Q3_K_S/M | Notable drop | Variable, often drop | Avoid for agents |
|
||
| Q4_K_S/M | ~Baseline | Often ≥ baseline | Default for local agents |
|
||
| Q5_K_M | ≥ Baseline | ≥ Baseline | Best quality/size trade-off [3] |
|
||
| Q6_K | ≥ Baseline | Sometimes slight dip | Use if VRAM allows |
|
||
| Q8_0 / bf16 | Baseline | Baseline | No guaranteed advantage over Q5 |
|
||
|
||
Calibration-aware methods (AWQ, GPTQ with good calibration data, EXL2) generally
|
||
outperform naive GGUF at the same bit-width; for instruction- heavy work, prefer
|
||
K-quants over legacy `_0` / `_1` quants. [3]
|
||
|
||
### 3.4 Architecture variants worth knowing in 2026
|
||
|
||
- **Standard Transformer + GQA:** still the default (Llama, Mistral, most
|
||
Qwen2/2.5).
|
||
- **Hybrid attention (Qwen3.5 / "qwen35" / OmniCoder backbone):** Gated Delta
|
||
Networks interleaved with standard attention; enables efficient 262K native
|
||
context with extension to 1M+. [10] In practice this changes the
|
||
lost-in-the-middle profile somewhat but does not eliminate it — the same
|
||
boundary-placement advice applies.
|
||
- **Thinking-mode fusion (Qwen3):** a single model trained for both reasoning
|
||
and direct response, switched by `/think` and `/no_think` flags in user/system
|
||
messages, with an emergent "stop thinking now" capability used by the
|
||
`thinking_budget` controller. [2]
|
||
|
||
---
|
||
|
||
## 4. Model-Specific Notes (May 2026)
|
||
|
||
### Claude Opus 4.7 / Opus 4.6 / Sonnet 4.6 / Haiku 4.5
|
||
|
||
- **Opus 4.7 is more literal than 4.6 at low effort.** Prompts tuned for 4.6 may
|
||
need scope made explicit on 4.7. [1]
|
||
- Adaptive thinking is the default; do not hand-write step-by-step plans unless
|
||
the task is genuinely procedural. [1]
|
||
- The "default-to-action" / "default-to-clarify" prompt is the highest- leverage
|
||
knob for changing agent behavior without changing model. [1]
|
||
- Subagent delegation (Opus parent → Sonnet/Haiku children) is
|
||
cheaper-and-comparable for isolated subtasks; the parent retains reasoning,
|
||
the children execute.
|
||
|
||
### Qwen3 family (0.6B – 235B, dense + MoE; Qwen3.5 hybrid)
|
||
|
||
- Two-mode model: `/think` and `/no_think` flags toggle reasoning;
|
||
`thinking_budget` caps token spend. [2]
|
||
- Instruction following on Qwen3 instruct surpasses Qwen2.5 instruct, especially
|
||
in non-thinking mode. [2]
|
||
- Multilingual support jumped from 29 languages (Qwen2.5) to 119 (Qwen3). [2]
|
||
- Qwen3.5 (the "qwen35" architecture, base for OmniCoder-9B) introduces hybrid
|
||
Gated Delta + standard attention, 262K native context. [10]
|
||
|
||
### OmniCoder 2 / OmniCoder-9B (Tesslate, Qwen3.5-9B base)
|
||
|
||
- Fine-tuned on 425K agentic trajectories distilled from Claude Opus 4.6,
|
||
GPT-5.3-Codex, GPT-5.4, Gemini 3.1 Pro on Claude Code, OpenCode, Codex, and
|
||
Droid scaffolding. [10]
|
||
- Specifically learned read-before-write, LSP-diagnostic response, and
|
||
minimal-diff edits.
|
||
- Tesslate's own guidance: temperature 0.2–0.4 for agentic / tool use.
|
||
- Failure modes documented in this repo:
|
||
[`agent-infrastructure.md`](../projects/agent-infrastructure.md) §
|
||
"Smaller-scale local models" — narrower training distribution (Python/JS
|
||
heavy), JSON-schema compliance drops as context fills, instruction drift
|
||
faster than larger Qwen3 due to fewer attention heads.
|
||
|
||
### Other engineering-capable local models (2026 tier)
|
||
|
||
- **DeepSeek V4 Pro (Max), Kimi K2.6, GLM-5:** current open-weight ceiling;
|
||
strong on coding/agentic, still trail proprietary models on BrowseComp,
|
||
TerminalBench, OSWorld. [11]
|
||
- **Qwen3.5 397B (Reasoning):** competitive with the above at reasoning-heavy
|
||
work.
|
||
- **Mistral Small 4 (24B, 256K ctx):** best quality-to-resource ratio for
|
||
single-GPU deployments; Apache 2.0.
|
||
- **Gemma 4 31B (256K ctx):** strong LiveCodeBench; single high-end consumer GPU
|
||
viable.
|
||
- **Llama 4 (Maverick/Scout):** now trails the Chinese open-weight leaders on
|
||
benchmarks but retains ecosystem advantages. [11]
|
||
|
||
---
|
||
|
||
## 5. Minimal Operating Checklist
|
||
|
||
When writing a prompt or system message for any of these models:
|
||
|
||
1. **State scope and motivation explicitly.** Don't expect generalization.
|
||
2. **Structure heterogeneous content with tags.** Especially anything from a
|
||
tool or external source.
|
||
3. **Put critical instructions at the boundaries** (top of system, or
|
||
immediately before user turn) — not buried.
|
||
4. **Pick reasoning intensity deliberately.** Adaptive/`thinking_budget` for
|
||
capable models; multi-turn small steps for ≤9B locals; skip forced CoT on
|
||
reasoning models.
|
||
5. **Truncate tool output** and never paste untrusted text without a wrapper
|
||
that names its provenance.
|
||
6. **For tool-calling: lower temperature** (0.2–0.4) regardless of model.
|
||
7. **For local deployments: target Q4_K_M or Q5_K_M.** Verify on IFEval-style
|
||
tests, not just perplexity.
|
||
8. **Ask for the answer before stating your own view** to avoid sycophantic
|
||
agreement.
|
||
|
||
---
|
||
|
||
## 6. What the Evidence Does _Not_ Support
|
||
|
||
- **"Just use a bigger model."** Architecture, instruction tuning, and prompt
|
||
structure account for as much variance as raw parameter count for most
|
||
engineering tasks. [9][12]
|
||
- **"Always use chain-of-thought."** Outdated. Marginal for non- reasoning
|
||
models, near-zero for reasoning models, and CoT _increases answer variance_ —
|
||
flipping some correct answers to wrong. [7][8]
|
||
- **"Higher quantization is always better."** IFEval is not bit-width monotonic;
|
||
Q4_K_S can beat Q8_0 on compliance. [3]
|
||
- **"MoE > dense at equivalent total params."** Without instruction tuning, MoE
|
||
underperforms dense at equal FLOPs. [12]
|
||
- **"Role-play personas reliably steer behavior."** Style-based role cues are
|
||
exactly what prompt-injection attacks exploit; do not rely on persona prompts
|
||
for security boundaries. [13] **Stronger version of this debunk:** persona
|
||
prompts also don't reliably improve _task performance_ — they're often
|
||
ineffective and frequently harmful when persona attributes are even mildly
|
||
irrelevant to the task. [17][18][19] See §2.10.
|
||
- **"Longer reasoning is better reasoning."** Inverted-U on accuracy vs. CoT
|
||
length is well-established. [8]
|
||
|
||
---
|
||
|
||
## 7. Sources
|
||
|
||
The foundational survey of prompting techniques used to cross-check claims in
|
||
this doc is **Schulhoff et al. (2024), _The Prompt Report: A Systematic Survey
|
||
of Prompting Techniques_** (arXiv:2406.06608). PRISMA-based review of 1,565
|
||
papers; taxonomy of 58 text prompting techniques. Cited as [PR] where relevant.
|
||
|
||
1. Anthropic. _Prompting best practices_ (covers Opus 4.7, 4.6, Sonnet 4.6,
|
||
Haiku 4.5). Claude API Docs.
|
||
https://platform.claude.com/docs/en/docs/build-with-claude/prompt-engineering/be-clear-and-direct
|
||
2. Yang, A. et al. (2025). _Qwen3 Technical Report._ arXiv:2505.09388. (Dense +
|
||
MoE family 0.6B–235B; thinking-mode fusion; thinking budget; 119-language
|
||
support.)
|
||
3. _Which Quantization Should I Use? A Unified Evaluation of llama.cpp
|
||
Quantization on Llama-3.1-8B-Instruct._ arXiv preprint. (GSM8K, IFEval, MMLU,
|
||
HellaSwag, TruthfulQA across all GGUF variants.)
|
||
4. Liu, N. F. et al. (2024). _Lost in the Middle: How Language Models Use Long
|
||
Contexts._ TACL 12, 157–173.
|
||
5. The Neural Base. _Lost-in-middle behavior across major models through
|
||
early 2026._ (Replication note; U-shaped curve persists across Claude, GPT,
|
||
Llama.)
|
||
6. Anthropic. _Prompt engineering for business performance._
|
||
https://www.anthropic.com/news/prompt-engineering-for-business-performance
|
||
7. Meincke, L. et al. (2025). _Prompting Science Report 2: The Decreasing Value
|
||
of Chain of Thought in Prompting._ arXiv:2506.07142.
|
||
8. Wang, Y. et al. (2025). _When More is Less: Understanding Chain-of-Thought
|
||
Length in LLMs._ arXiv:2502.07266.
|
||
9. _Distributional Scaling Laws for Emergent Capabilities._ (2025)
|
||
arXiv:2502.17356. (Bimodal performance distributions near capability
|
||
thresholds; "emergence" as stochastic property at scale.)
|
||
10. Tesslate. _OmniCoder-9B model card._ Hugging Face, March 2026. (Qwen3.5-9B
|
||
base; 425K agentic trajectories from Claude Opus 4.6, GPT-5.3-Codex,
|
||
GPT-5.4, Gemini 3.1 Pro; Gated Delta + attention hybrid; 262K context;
|
||
recommended temperature 0.2–0.4 for tool use.)
|
||
https://huggingface.co/Tesslate/OmniCoder-9B
|
||
11. BenchLM.ai. _Best Open Source LLM in 2026: Rankings, Benchmarks, and the
|
||
Models Worth Running._ April 2026. (DeepSeek V4 Pro, Kimi K2.6, GLM-5,
|
||
Qwen3.5 397B, Mistral Small 4, Gemma 4, Llama 4 comparison.)
|
||
12. Shen, S. et al. (2024). _Mixture-of-Experts Meets Instruction Tuning: A
|
||
Winning Combination for Large Language Models._ ICLR. (FLAN-MoE-32B vs
|
||
Flan-PaLM-62B; MoE benefits more from instruction tuning than dense.)
|
||
13. _Role Confusion and CoT Forgery: Stylistic Spoofing as a Prompt- Injection
|
||
Mechanism._ arXiv preprint, 2026. (Models infer roles from style; forged
|
||
reasoning traces inherit self-trust.)
|
||
14. Sharma, M. et al. (2024). _Towards Understanding Sycophancy in Language
|
||
Models._ ICLR 2024. arXiv:2310.13548.
|
||
15. nostalgebraist (2023). _OpenAI API base models are not sycophantic, at any
|
||
size._ LessWrong.
|
||
https://www.lesswrong.com/posts/3ou8DayvDXxufkjHD/openai-api-base-models-are-not-sycophantic-at-any-size
|
||
(Disconfirms the strongest reading of Perez et al. 2022 for OpenAI base
|
||
models. Not peer-reviewed but the data and code are public.)
|
||
16. _Chain-of-Thought is not all you need: Posterior collapse of CoT under
|
||
distributional shift._ arXiv:2409.06173 (2024). (Larger models anchor harder
|
||
to reasoning priors under CoT, especially on subjective tasks.)
|
||
17. _Principled Personas: Defining and Measuring the Intended Effects of Persona
|
||
Prompting on Task Performance._ EMNLP 2025.
|
||
https://aclanthology.org/2025.emnlp-main.1364/
|
||
18. _Persona is a Double-Edged Sword: Rethinking the Impact of Role-play Prompts
|
||
in Zero-shot Reasoning Tasks._ IJCNLP Findings 2025.
|
||
https://aclanthology.org/2025.findings-ijcnlp.51/
|
||
19. _When personas help and when they don't: A persona-prompt evaluation across
|
||
QA benchmarks._ arXiv:2512.05858 (2025). PR. Schulhoff, S. et al. (2024).
|
||
_The Prompt Report: A Systematic Survey of Prompting Techniques._
|
||
arXiv:2406.06608. PRISMA review of 1,565 papers; taxonomy of 58 prompting
|
||
techniques.
|
||
20. Weston, J. & Sukhbaatar, S. (2023). _System 2 Attention (is something you
|
||
might need too)._ arXiv:2311.11829. (Two-pass technique: LLM first rewrites
|
||
input context to remove irrelevant/opinionated material, then generates
|
||
response from cleaned context. Reduces sycophancy and increases factuality
|
||
on QA, math word problems, and longform generation. The lightweight harness
|
||
equivalent is a current-question isolation instruction at the context tail.)
|