dotfiles/.agents/docs/llm-intent-interpretation.md
Brydon DeWitt 6b07e4ccb2 feat: add shared agent infrastructure (.agents/)
- AGENTS.md: design principles, enforcement hierarchy, deferred loading
- agents/: brainstorm, build, orchestrator, research (auto-discovered by MCP server)
- skills/: research methodology (auto-discovered by MCP server)
- hooks/: pre-tool-use, post-tool-use (BFF block removed), session-start,
  stop, pre-compact, user-prompt-submit
- frameworks/: opencode/plugin.ts (resolves hooks via import.meta.url — works
  as project-local or global plugin), github/hooks.json
- mcp/index.ts: auto-discovers agents/*.md and skills/*.md from frontmatter
  (replaces hand-maintained registry); server renamed all-agents
- docs/: agent-infrastructure.md (generalized), research docs (7 files),
  ai_architectures.md, llama-server-cuda-wsl2.md
- install.sh: idempotent setup — Copilot global hooks, OpenCode global plugin +
  AGENTS.md + MCP entry, VS Code global MCP config
2026-05-22 13:13:43 -04:00

515 lines
27 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# How LLMs Interpret Intent in Text Prompts: Evidence-Based Guidance
> **Status:** Research synthesis. Companion to
> [`text-communication-interpretation.md`](./text-communication-interpretation.md)
> — that doc covers humans reading text; this one covers LLMs.
>
> **Scope:** Why current frontier and local models misinterpret prompts, what
> the underlying mechanisms are (training, architecture, quantization, position
> bias), and which counter-measures have empirical or vendor-documented support.
>
> **Models in scope (May 2026):** Claude Opus 4.7, Opus 4.6, Sonnet 4.6, Haiku
> 4.5; the Qwen2.5, Qwen3, and Qwen3.5 ("qwen35") families including the
> OmniCoder-9B fine-tune; and the current open-weight engineering tier (DeepSeek
> V4, Kimi K2.6, GLM-5, Mistral Small 4, Gemma 4).
>
> **Audience:** Engineers building agents, prompts, and scaffolding — not
> first-time LLM users.
---
## 0. Framing: Why Models Misread Prompts Differently Than Humans Do
Humans misread text mostly because of egocentric anchoring and emotional
projection (see the companion doc). LLMs misread for structurally different
reasons:
- **No persistent self.** Every turn re-derives "intent" from the visible token
stream. Anything outside the context window doesn't exist.
- **Distributional priors dominate.** The model's behavior is its training
distribution conditioned on your tokens. Ambiguity is resolved toward whatever
was most common in pretraining/RLHF, not toward what you meant.
- **Style → role.** Models infer _who_ is speaking from textual style rather
than from cryptographic provenance, which is why prompt injection works at all
(see §1.4). [13]
- **Quantization, depth, and routing change behavior under load**, not cleanly
and not always at the points you'd expect (see §3).
The practical consequence: the levers that work on humans (charity, delay,
perspective-taking) have direct analogs for LLMs — structured context, explicit
scope, separated reasoning — but for very different mechanistic reasons.
---
## 1. The Core Problem (Why This Is Hard)
### 1.1 Models resolve ambiguity toward the training prior
When intent is underspecified, models fall back to whatever the training
distribution made most likely. Anthropic explicitly documents that **Opus 4.7 is
more literal than 4.6**: it will not silently generalize an instruction from one
item to another, and will not infer requests you didn't make. [1] The upside is
precision; the downside is that prompts that worked on 4.6 by relying on
"obvious" generalization may stop working. Stating scope explicitly ("apply to
every section, not just the first") is now required, not optional.
### 1.2 Instruction following is not bit-width monotonic
Quantization does not uniformly degrade behavior. The Llama-3.1-8B-Instruct GGUF
sweep [3] shows:
- **GSM8K (reasoning):** F16 baseline 77.6; Q3*K_S drops to 68.3 (9.3);
Q4_K_S/M essentially match baseline; Q5/Q6/Q8 sometimes \_exceed* F16.
- **IFEval (instruction following):** F16 baseline 78.9; Q3*K_S drops to 73.9,
but Q4_K_S \_improves* to 80.3 and Q5_0 to 80.1. Q6_K drops to 77.6 and Q8_0
sits at 78.8 — i.e., higher bit-width does not guarantee better compliance.
**Practical floor:** for agentic / tool-using workflows, **45 bit K-quants
(Q4_K_M, Q5_K_M) are the safe band**; 3-bit risks reasoning collapse; 8-bit is
not automatically "best" for instruction following.
### 1.3 Long-context attention is U-shaped ("lost in the middle")
Liu et al. (TACL 2024) showed performance is highest when relevant information
is at the **beginning** or **end** of the context, with a sharp dip in the
middle — even for explicitly long-context models. [4] The effect persists across
Claude, GPT, and Llama lineages through early 2026. [5] Mechanism: training
documents are mostly short, and when long, important content tends to sit at the
boundaries; the model never learns strong middle-extraction habits.
**Implication:** the position of an instruction inside a 200K-token context
matters more than its wording. Put critical instructions at the top or just
before the user turn, not buried in the middle of system context.
### 1.4 Role confusion: style determines authority
Models do not robustly track _where text came from_; they infer the role of each
span from stylistic cues. Recent work on "CoT Forgery" [13] demonstrates that
injected reasoning traces that look like the model's own scratchpad inherit the
trust the model places in its own thoughts — external text, by contrast, is
normally scrutinized and rejected. This is the structural reason prompt
injection in tool outputs works.
**Implication:** any content you don't fully trust (tool output, fetched web
content, user-pasted text) must be wrapped in unambiguous structural markers,
and the model must be told what kind of content it is and how much authority it
carries.
### 1.5 Sycophancy / agreement bias
Some RLHF'd models lean toward agreeing with the user's framing, especially when
the user states a belief or pushes back. Sharma et al. (ICLR 2024) [14] found
this across five SOTA assistants and traced it to human preference labels
favoring agreement. **Important caveat:** the original Perez et al. (2022)
finding that sycophancy appears even at zero RLHF steps did **not** replicate
across model families — nostalgebraist (2023) [15] showed OpenAI base models are
not sycophantic at any size. So this is model-family- and
training-data-specific, not a universal RLHF property. Mitigations: ask for the
model's best answer _before_ revealing your view; explicitly invite
disagreement; in agent prompts, instruct "persist through genuine blockers; do
not pivot just because the previous attempt failed."
**Stronger mitigation — context isolation (S2A):** System 2 Attention (Weston &
Sukhbaatar, 2023) [20] shows that asking the LLM to first _rewrite_ its input
context — extracting only the portions relevant to the current query and
discarding irrelevant or opinionated material — measurably reduces sycophancy
and improves factuality across QA, math word problems, and longform generation.
The mechanism is direct: soft attention in Transformers is susceptible to
incorporating irrelevant prior context; explicit isolation severs the anchor
before generation. In a harness context, the full two-pass S2A (rewrite then
respond) requires a second LLM call; the lightweight equivalent is placing an
explicit current-question marker at the context tail (recency- bias zone), which
isolates the current query from prior anchor answers without a second inference
pass.
---
## 2. Highest-Leverage Counter-Practices
Ranked by effect size and breadth of support across vendor docs, peer- reviewed
work, and field practice.
### 2.1 Be literal and explicit; state scope
Anthropic's official guidance for 4.6/4.7: "Claude responds well to clear,
explicit instructions. Being specific about your desired output can help enhance
results. If you want 'above and beyond' behavior, explicitly request it rather
than relying on the model to infer it from vague prompts." [1] This is the
single most-cited lever in their docs.
Apply equally to Qwen3-class local models, whose Apache-2.0 instruct tunes are
now competitive at instruction-following but show the same literal-by-default
behavior as Claude 4.7. [2]
### 2.2 Use XML (or unambiguous) structural tags around heterogeneous content
Wrapping each kind of input — instructions, examples, retrieved context, user
query, tool output — in its own tag reduces misinterpretation because the model
can attend to "tag boundaries" rather than guessing where one block ends and
another begins. [1] This is the cheapest mitigation for §1.3
(lost-in-the-middle) and §1.4 (role confusion) simultaneously.
### 2.3 Provide context and motivation, not just the instruction
Vendor-documented (Anthropic) and consistently effective: explaining _why_
improves targeting. [1][6] Mechanism: motivation tokens disambiguate which
training prior to condition on. A request to "make this shorter" with context
"for a P0 incident page, every line costs attention" lands in a different region
of model behavior than the same request without justification.
### 2.4 Prefer general reasoning instructions over prescriptive steps —
**for reasoning-capable models**
Anthropic: "A prompt like 'think thoroughly' often produces better reasoning
than a hand-written step-by-step plan. Claude's reasoning frequently exceeds
what a human would prescribe." [1] Qwen3's thinking mode is similarly designed
to be triggered with light cues (`/think`) rather than micromanaged. [2]
For **non-reasoning** models (or thinking-off mode), the Prompting Science
Report 2 [7] finds chain-of-thought provides only a small average boost and
**increases variance** — sometimes flipping previously-correct answers to wrong.
For reasoning models the explicit CoT request is essentially zero-value and just
burns tokens.
**Additional caveat — subjective tasks:** arXiv:2409.06173 (2024) [16] shows CoT
suffers from _posterior collapse_: the format of CoT retrieves reasoning priors
that remain relatively unchanged despite the evidence in the prompt. This is
especially pronounced on subjective tasks (emotion, morality) and on larger
models. So for intent-interpretation tasks — exactly the kind this doc is about
— CoT may actively entrench the model's prior reading rather than update it on
new evidence. Prefer perspective-taking prompts (see §2.4a) or
clarifying-question prompts over generic "think step by step" for ambiguous
intent.
### 2.5 Calibrate reasoning length to task complexity
"When More is Less" (Wang et al., 2025) [8] established an inverted-U: accuracy
rises with CoT length, then declines as error accumulation outpaces
decomposition benefit. Optimal length _increases_ with task difficulty and
_decreases_ with model capability. Practical rules:
- For Claude adaptive thinking (4.6/4.7): set the `effort` parameter to match
task complexity; do not push it higher than needed. [1]
- For Qwen3: use the `thinking_budget` mechanism rather than letting thinking
run unbounded. [2]
- For small local models (≤9B): prefer many short reasoning steps in multiple
turns over one long monolithic chain.
### 2.6 Default-to-action vs. default-to-clarify is promptable
Anthropic publishes both directions verbatim. For agent work:
> By default, implement changes rather than only suggesting them. If the user's
> intent is unclear, infer the most useful likely action and proceed, using
> tools to discover any missing details instead of guessing. [1]
For research/exploration work, invert it: instruct the model to clarify or plan
before acting. The point is that "agentic-ness" is a prompt-controlled dial, not
a model property.
### 2.7 Place critical instructions at the boundaries of the context
Direct consequence of §1.3. The top of the system prompt and the position
immediately preceding the user's most recent turn are the high-attention zones.
Anthropic, Cursor, and Aider all converge on this in practice — system prompts
grow at the top, repo-map / recent-turn context grows just before the user
message.
**Stronger form — full context recontextualization (S2A [20]):** if the context
contains opinionated or anchor-setting material that will skew the answer, the
boundary-placement advice is necessary but not sufficient. S2A's two-pass
pattern (rewrite context to strip irrelevant content → generate from rewritten
context) further reduces the effect of prior anchors. For agent harnesses where
a second LLM call is too expensive, the single-pass equivalent is an explicit
current-question isolation instruction injected at the context tail — same
recency zone, same isolation intent, no extra inference. [20]
### 2.8 Truncate and structure tool output aggressively
Local-model failure modes documented in this repo's own
[`agent-infrastructure.md`](../projects/agent-infrastructure.md) match the
broader pattern: tool-call history is the largest context consumer, and
untruncated outputs both push content into the lost-in-the-middle zone _and_
widen the prompt-injection attack surface (§1.4). The repo's ~1500-token
post-tool-use truncation is consistent with what the Cursor and Aider teams have
published.
### 2.9 Lower temperature for tool-calling / structured output
Convergent vendor guidance across Anthropic, Qwen, and Tesslate (OmniCoder): for
tool-calling and JSON-emitting paths, temperature 0.20.4 substantially reduces
schema violations and hallucinated arguments. [10] This effect is amplified in
quantized models where sampling noise compounds with quantization noise.
### 2.10 Role / persona prompting is at best a weak intervention
A 2025 wave of replication-style studies converges on a folklore-busting result:
assigning expert personas ("you are a senior software engineer…") does not
reliably improve task performance, and in many cases hurts.
- **Principled Personas** (EMNLP 2025) [17]: across 9 SOTA models × 27 tasks,
expert personas usually give "positive or non-significant" effects, and models
are **highly sensitive to irrelevant persona details, with drops of almost 30
percentage points**.
- **Persona is a Double-Edged Sword** (IJCNLP Findings 2025) [18]: dataset-
aligned personas can hurt; only _instance_-aligned personas selected per-
query reliably help.
- **Persona-prompt evaluation across QA benchmarks** (arXiv:2512.05858) [19]:
"persona prompts generally did not improve accuracy" across both benchmarks
tested; low-knowledge personas (layperson, child) actively degrade results.
**Practical guidance:** do not rely on personas as a precision lever for intent
interpretation. If a persona is included for stylistic reasons (tone, register),
keep it minimal and avoid attributes that are irrelevant to the task. For
correctness, prefer the levers in §2.1§2.9.
---
## 3. Architecture, Parameters, and Quantization — What Actually Changes
### 3.1 Parameter count and "emergence"
The classical scaling-laws picture (Kaplan, Chinchilla) holds for loss, but
emergent _capabilities_ are noisier than originally reported. "Distributional
Scaling Laws for Emergent Capabilities" (2025) [9] shows that at scales near a
capability threshold, performance across random seeds is **bimodal** — some runs
acquire the skill, some don't — so "emergence" at a given scale is partly
stochastic. Bigger models collapse the bimodal distribution and acquire skills
more reliably.
Practical implication for choosing model size:
- **≤4B:** reliable for narrow extraction, classification, short agentic steps;
instruction following degrades sharply with prompt length and as context
fills.
- **714B (incl. OmniCoder-9B):** the current sweet spot for local engineering
work. Tool-calling and structured output work reliably when the prompt is
well-structured; reasoning is acceptable; long- horizon plans drift.
- **3070B dense / 100400B MoE:** comparable behavior to mid-tier cloud models
on most tasks; remaining gaps are agentic (BrowseComp, TerminalBench, OSWorld)
where open models still trail. [11]
### 3.2 Dense vs. Mixture-of-Experts
Shen et al. (ICLR 2024, "FLAN-MoE") [12] established a counter-intuitive result
that still holds: **MoE models underperform dense models of equivalent FLOPs
when only directly fine-tuned, but surpass them dramatically after instruction
tuning** — and benefit _more_ from instruction tuning than dense models do.
FLAN-MoE-32B beat Flan-PaLM-62B on four benchmarks at ⅓ the FLOPs.
Practical implications for prompt design:
- MoE models (DeepSeek V4, Kimi K2.6, GLM-5, Qwen3 235B-A22B) are more sensitive
to instruction _style_ matching their tuning distribution. Clean, structured
prompts pay off more than on dense models.
- Routing instability shows up as occasional out-of-distribution responses on
edge cases. Few-shot examples are an effective stabilizer because they shift
activation into well-traveled expert combinations.
- Active-parameter count (e.g., 22B active in Qwen3-235B-A22B) is the better
predictor of per-token latency and small-task quality than total parameter
count.
### 3.3 Quantization
Detailed numbers in §1.2. Summary heuristics:
| Bit-width | Reasoning (GSM8K) | Instruction (IFEval) | Recommendation |
| ----------- | ----------------- | -------------------- | ------------------------------- |
| Q3_K_S/M | Notable drop | Variable, often drop | Avoid for agents |
| Q4_K_S/M | ~Baseline | Often ≥ baseline | Default for local agents |
| Q5_K_M | ≥ Baseline | ≥ Baseline | Best quality/size trade-off [3] |
| Q6_K | ≥ Baseline | Sometimes slight dip | Use if VRAM allows |
| Q8_0 / bf16 | Baseline | Baseline | No guaranteed advantage over Q5 |
Calibration-aware methods (AWQ, GPTQ with good calibration data, EXL2) generally
outperform naive GGUF at the same bit-width; for instruction- heavy work, prefer
K-quants over legacy `_0` / `_1` quants. [3]
### 3.4 Architecture variants worth knowing in 2026
- **Standard Transformer + GQA:** still the default (Llama, Mistral, most
Qwen2/2.5).
- **Hybrid attention (Qwen3.5 / "qwen35" / OmniCoder backbone):** Gated Delta
Networks interleaved with standard attention; enables efficient 262K native
context with extension to 1M+. [10] In practice this changes the
lost-in-the-middle profile somewhat but does not eliminate it — the same
boundary-placement advice applies.
- **Thinking-mode fusion (Qwen3):** a single model trained for both reasoning
and direct response, switched by `/think` and `/no_think` flags in user/system
messages, with an emergent "stop thinking now" capability used by the
`thinking_budget` controller. [2]
---
## 4. Model-Specific Notes (May 2026)
### Claude Opus 4.7 / Opus 4.6 / Sonnet 4.6 / Haiku 4.5
- **Opus 4.7 is more literal than 4.6 at low effort.** Prompts tuned for 4.6 may
need scope made explicit on 4.7. [1]
- Adaptive thinking is the default; do not hand-write step-by-step plans unless
the task is genuinely procedural. [1]
- The "default-to-action" / "default-to-clarify" prompt is the highest- leverage
knob for changing agent behavior without changing model. [1]
- Subagent delegation (Opus parent → Sonnet/Haiku children) is
cheaper-and-comparable for isolated subtasks; the parent retains reasoning,
the children execute.
### Qwen3 family (0.6B 235B, dense + MoE; Qwen3.5 hybrid)
- Two-mode model: `/think` and `/no_think` flags toggle reasoning;
`thinking_budget` caps token spend. [2]
- Instruction following on Qwen3 instruct surpasses Qwen2.5 instruct, especially
in non-thinking mode. [2]
- Multilingual support jumped from 29 languages (Qwen2.5) to 119 (Qwen3). [2]
- Qwen3.5 (the "qwen35" architecture, base for OmniCoder-9B) introduces hybrid
Gated Delta + standard attention, 262K native context. [10]
### OmniCoder 2 / OmniCoder-9B (Tesslate, Qwen3.5-9B base)
- Fine-tuned on 425K agentic trajectories distilled from Claude Opus 4.6,
GPT-5.3-Codex, GPT-5.4, Gemini 3.1 Pro on Claude Code, OpenCode, Codex, and
Droid scaffolding. [10]
- Specifically learned read-before-write, LSP-diagnostic response, and
minimal-diff edits.
- Tesslate's own guidance: temperature 0.20.4 for agentic / tool use.
- Failure modes documented in this repo:
[`agent-infrastructure.md`](../projects/agent-infrastructure.md) §
"Smaller-scale local models" — narrower training distribution (Python/JS
heavy), JSON-schema compliance drops as context fills, instruction drift
faster than larger Qwen3 due to fewer attention heads.
### Other engineering-capable local models (2026 tier)
- **DeepSeek V4 Pro (Max), Kimi K2.6, GLM-5:** current open-weight ceiling;
strong on coding/agentic, still trail proprietary models on BrowseComp,
TerminalBench, OSWorld. [11]
- **Qwen3.5 397B (Reasoning):** competitive with the above at reasoning-heavy
work.
- **Mistral Small 4 (24B, 256K ctx):** best quality-to-resource ratio for
single-GPU deployments; Apache 2.0.
- **Gemma 4 31B (256K ctx):** strong LiveCodeBench; single high-end consumer GPU
viable.
- **Llama 4 (Maverick/Scout):** now trails the Chinese open-weight leaders on
benchmarks but retains ecosystem advantages. [11]
---
## 5. Minimal Operating Checklist
When writing a prompt or system message for any of these models:
1. **State scope and motivation explicitly.** Don't expect generalization.
2. **Structure heterogeneous content with tags.** Especially anything from a
tool or external source.
3. **Put critical instructions at the boundaries** (top of system, or
immediately before user turn) — not buried.
4. **Pick reasoning intensity deliberately.** Adaptive/`thinking_budget` for
capable models; multi-turn small steps for ≤9B locals; skip forced CoT on
reasoning models.
5. **Truncate tool output** and never paste untrusted text without a wrapper
that names its provenance.
6. **For tool-calling: lower temperature** (0.20.4) regardless of model.
7. **For local deployments: target Q4_K_M or Q5_K_M.** Verify on IFEval-style
tests, not just perplexity.
8. **Ask for the answer before stating your own view** to avoid sycophantic
agreement.
---
## 6. What the Evidence Does _Not_ Support
- **"Just use a bigger model."** Architecture, instruction tuning, and prompt
structure account for as much variance as raw parameter count for most
engineering tasks. [9][12]
- **"Always use chain-of-thought."** Outdated. Marginal for non- reasoning
models, near-zero for reasoning models, and CoT _increases answer variance_
flipping some correct answers to wrong. [7][8]
- **"Higher quantization is always better."** IFEval is not bit-width monotonic;
Q4_K_S can beat Q8_0 on compliance. [3]
- **"MoE > dense at equivalent total params."** Without instruction tuning, MoE
underperforms dense at equal FLOPs. [12]
- **"Role-play personas reliably steer behavior."** Style-based role cues are
exactly what prompt-injection attacks exploit; do not rely on persona prompts
for security boundaries. [13] **Stronger version of this debunk:** persona
prompts also don't reliably improve _task performance_ — they're often
ineffective and frequently harmful when persona attributes are even mildly
irrelevant to the task. [17][18][19] See §2.10.
- **"Longer reasoning is better reasoning."** Inverted-U on accuracy vs. CoT
length is well-established. [8]
---
## 7. Sources
The foundational survey of prompting techniques used to cross-check claims in
this doc is **Schulhoff et al. (2024), _The Prompt Report: A Systematic Survey
of Prompting Techniques_** (arXiv:2406.06608). PRISMA-based review of 1,565
papers; taxonomy of 58 text prompting techniques. Cited as [PR] where relevant.
1. Anthropic. _Prompting best practices_ (covers Opus 4.7, 4.6, Sonnet 4.6,
Haiku 4.5). Claude API Docs.
https://platform.claude.com/docs/en/docs/build-with-claude/prompt-engineering/be-clear-and-direct
2. Yang, A. et al. (2025). _Qwen3 Technical Report._ arXiv:2505.09388. (Dense +
MoE family 0.6B235B; thinking-mode fusion; thinking budget; 119-language
support.)
3. _Which Quantization Should I Use? A Unified Evaluation of llama.cpp
Quantization on Llama-3.1-8B-Instruct._ arXiv preprint. (GSM8K, IFEval, MMLU,
HellaSwag, TruthfulQA across all GGUF variants.)
4. Liu, N. F. et al. (2024). _Lost in the Middle: How Language Models Use Long
Contexts._ TACL 12, 157173.
5. The Neural Base. _Lost-in-middle behavior across major models through
early 2026._ (Replication note; U-shaped curve persists across Claude, GPT,
Llama.)
6. Anthropic. _Prompt engineering for business performance._
https://www.anthropic.com/news/prompt-engineering-for-business-performance
7. Meincke, L. et al. (2025). _Prompting Science Report 2: The Decreasing Value
of Chain of Thought in Prompting._ arXiv:2506.07142.
8. Wang, Y. et al. (2025). _When More is Less: Understanding Chain-of-Thought
Length in LLMs._ arXiv:2502.07266.
9. _Distributional Scaling Laws for Emergent Capabilities._ (2025)
arXiv:2502.17356. (Bimodal performance distributions near capability
thresholds; "emergence" as stochastic property at scale.)
10. Tesslate. _OmniCoder-9B model card._ Hugging Face, March 2026. (Qwen3.5-9B
base; 425K agentic trajectories from Claude Opus 4.6, GPT-5.3-Codex,
GPT-5.4, Gemini 3.1 Pro; Gated Delta + attention hybrid; 262K context;
recommended temperature 0.20.4 for tool use.)
https://huggingface.co/Tesslate/OmniCoder-9B
11. BenchLM.ai. _Best Open Source LLM in 2026: Rankings, Benchmarks, and the
Models Worth Running._ April 2026. (DeepSeek V4 Pro, Kimi K2.6, GLM-5,
Qwen3.5 397B, Mistral Small 4, Gemma 4, Llama 4 comparison.)
12. Shen, S. et al. (2024). _Mixture-of-Experts Meets Instruction Tuning: A
Winning Combination for Large Language Models._ ICLR. (FLAN-MoE-32B vs
Flan-PaLM-62B; MoE benefits more from instruction tuning than dense.)
13. _Role Confusion and CoT Forgery: Stylistic Spoofing as a Prompt- Injection
Mechanism._ arXiv preprint, 2026. (Models infer roles from style; forged
reasoning traces inherit self-trust.)
14. Sharma, M. et al. (2024). _Towards Understanding Sycophancy in Language
Models._ ICLR 2024. arXiv:2310.13548.
15. nostalgebraist (2023). _OpenAI API base models are not sycophantic, at any
size._ LessWrong.
https://www.lesswrong.com/posts/3ou8DayvDXxufkjHD/openai-api-base-models-are-not-sycophantic-at-any-size
(Disconfirms the strongest reading of Perez et al. 2022 for OpenAI base
models. Not peer-reviewed but the data and code are public.)
16. _Chain-of-Thought is not all you need: Posterior collapse of CoT under
distributional shift._ arXiv:2409.06173 (2024). (Larger models anchor harder
to reasoning priors under CoT, especially on subjective tasks.)
17. _Principled Personas: Defining and Measuring the Intended Effects of Persona
Prompting on Task Performance._ EMNLP 2025.
https://aclanthology.org/2025.emnlp-main.1364/
18. _Persona is a Double-Edged Sword: Rethinking the Impact of Role-play Prompts
in Zero-shot Reasoning Tasks._ IJCNLP Findings 2025.
https://aclanthology.org/2025.findings-ijcnlp.51/
19. _When personas help and when they don't: A persona-prompt evaluation across
QA benchmarks._ arXiv:2512.05858 (2025). PR. Schulhoff, S. et al. (2024).
_The Prompt Report: A Systematic Survey of Prompting Techniques._
arXiv:2406.06608. PRISMA review of 1,565 papers; taxonomy of 58 prompting
techniques.
20. Weston, J. & Sukhbaatar, S. (2023). _System 2 Attention (is something you
might need too)._ arXiv:2311.11829. (Two-pass technique: LLM first rewrites
input context to remove irrelevant/opinionated material, then generates
response from cleaned context. Reduces sycophancy and increases factuality
on QA, math word problems, and longform generation. The lightweight harness
equivalent is a current-question isolation instruction at the context tail.)