# Action Plan: Counteracting Model Failures to Interpret Intent **Status:** draft (2026-05-16) **Source investigation:** [docs/explorations/text-intent-interpretation-research.md](../explorations/text-intent-interpretation-research.md) **Source research docs:** - [docs/research/text-communication-interpretation.md](text-communication-interpretation.md) (Phase 1: humans reading text) - [docs/research/llm-intent-interpretation.md](llm-intent-interpretation.md) (Phase 1: LLMs reading prompts) - [docs/research/human-llm-interpretation-overlap.md](human-llm-interpretation-overlap.md) (Phase 2: synthesis) - [docs/research/ai-coding-best-practices.md](ai-coding-best-practices.md) (cross-reference: §2.1, §3.2, §3.4a, §3.5, §3.6, §3.7, §3.8, §7) ## How to read this document Each entry has the same shape: ``` Failure mode → Why it happens → Mitigation that works → Tempting-but-wrong mitigation (anti-pattern) → Where to implement in this repo ``` The "tempting-but-wrong" line is the most important part. Many of the obvious mitigations either (a) have no measurable effect or (b) actively hurt performance — and they sound so reasonable they get added by default. If a mitigation is on the anti-pattern list, _do not_ add it as a workaround when something else fails. Evidence-strength tags follow the synthesis doc's legend: **[multi-replicated]**, **[single-study + partial replication]**, **[single-study]**, **[preprint-only]**. --- ## 1. Failure mode: misreading the user's actual question ### 1.1 Position-anchored priming (model defends a prior answer) **Why it happens.** The model's previous turn sits in the context window and acts as a prior the model subsequently defends. Follow-ups are read through the lens of the prior position, not on their own terms. **[multi-replicated]** — documented across model families; mechanism supported by ClashEval (Wu, Wu, Zou, NeurIPS 2024) showing token-probability/adherence relationship. **What works (in order of effectiveness):** 1. **Compaction or fresh context.** Physically remove the prior committed answer. The anchor is broken. Use `PreCompact` to preserve only the user's current question and the verified-correct state. 2. **Adversarial reframing.** Lower the model's confidence in its prior commitment _before_ asking the next question: _"I believe your previous answer was wrong because X. Now answer this specific question: ..."_ ClashEval's mechanism (lower token-probability prior → higher context adherence) extends to this case in principle. 3. **Explicit current-question marker at the tail.** `UserPromptSubmit` hook prepends `CURRENT QUESTION (answer this, not the prior exchange):` to the prompt. Mechanical, cheap, observable. **Tempting but wrong (do not do):** - Repeating the question louder, adding emphasis, or asking the model to "read more carefully." None of these change the anchor. They feel productive and do nothing. - Asking the model to re-state the question in its own words before answering. In the no-oracle setting this can entrench the misreading rather than reset it. **Where in this repo:** - `UserPromptSubmit` hook (already exists at [.agents/hooks/user-prompt-submit.sh](../../.agents/hooks/user-prompt-submit.sh)) is the right place for the current-question marker. - Compaction logic in `PreCompact` hook (already exists at [.agents/hooks/pre-compact.sh](../../.agents/hooks/pre-compact.sh)) is the right place for the structured prior-discard. ### 1.2 Sycophancy (model defends the user's wrong claim) **Why it happens.** Family-conditional behavior: some RLHF recipes (Anthropic 2023) systematically push toward agreement with the user. **NOT** a universal RLHF property — nostalgebraist (LessWrong, 2023) showed OpenAI base models are not sycophantic at any size. **[single-study + partial replication]** with the caveat that the effect depends on the model family in use. **What works:** - External feedback signals (test runners, hooks, type checkers, build) that give the model a non-user source of truth. - Explicit anti-sycophancy rules in `AGENTS.md` and agent bodies: _"Challenge the user when the user is wrong,"_ _"Read a file before asserting facts about it,"_ _"Only make changes that are directly requested."_ **Tempting but wrong:** - Telling the model "be more critical" or "push back when needed." On sycophantic families this softens the floor but doesn't move the median; on non-sycophantic families it's noise. - LLM-as-judge of the user's own claim (self-critique loop without an oracle — see §4.1 below). **Where in this repo:** - [AGENTS.md](../../AGENTS.md) root anti-pattern list (already present). - [.agents/AGENTS.md](../../.agents/AGENTS.md) per-agent rule reinforcement. ### 1.3 Persona / "you are an expert" prompting **Why it happens.** Prompt-engineering folklore from 2022–2023 that expertise personas improve accuracy. The 2025 literature falsifies this for accuracy benchmarks. **[multi-replicated]** as a _negative_ result: - Principled Personas (EMNLP 2025) — models are highly sensitive to irrelevant persona details; performance drops of ~30pp from small attribute changes. - Persona is a Double-Edged Sword (IJCNLP 2025) — mixed and unstable effects. - [arXiv:2512.05858](https://arxiv.org/abs/2512.05858) — persona prompts generally did not improve accuracy; low-knowledge personas (layperson, child, outsider) often _reduced_ accuracy. **What works:** - Define _task contracts_ and _return formats_ for subagents (this is not the same as injecting an expertise persona). - Use the existing counterbalance agents ([.agents/agents/](../../.agents/agents/)) which are defined by what they _counter_, not by what they _are an expert in_. **Tempting but wrong:** - Adding `"You are a senior X engineer with 20 years of experience..."` to agent prompts. No measurable effect on frontier models; on small models can hurt via persona-attribute sensitivity. - Expertise-ladder prompting (junior/senior/outsider) as an **accuracy** improver. It is _only_ defensible as a divergent-ideation sampler for brainstorm tasks where high variance is the goal — and even then, the final answer should come from the un-personified model under an external rubric. See revised [docs/research/ai-coding-best-practices.md §7](ai-coding-best-practices.md). **Where in this repo:** - Audit existing agent prompts in [.agents/agents/](../../.agents/agents/) for any "you are an expert X" framing. Replace with negative-role and return- format specs. (Action item, to be done after this plan is approved.) --- ## 2. Failure mode: misreading specific tokens / instructions in long context ### 2.1 Lost-in-the-middle / serial-position effects **Why it happens.** Transformer attention is quadratic in context length; information in the middle of long contexts receives proportionally less attention. **[single-study + partial replication]** — Liu et al. (2023) established the U-shape; Bilan et al. (arXiv:2508.07479, 2025) shows the U-shape only holds up to ~50% of context window; Mak (2025) shows positional- embedding decay produces monotonic drop in very-long contexts; Zhang et al. (2024b) non-replication on some model families. Effect is real but mechanism varies and effective context is typically 30–50% of advertised. **What works:** - Task-critical content at the **tail** of context (recency bias is strong and consistent across the tested models). - Rules repeated at both ends (start AND tail), not just AGENTS.md (start only). - Hooks injecting at the context tail outlast AGENTS.md under context pressure. - Summarization-in-place for stale tool outputs (don't scroll, replace). **Tempting but wrong:** - Putting more rules in AGENTS.md when the existing ones aren't being followed. They are forgotten from the middle by ~5–10k tokens of subsequent context. _Adding more makes it worse._ Move the rule to a hook instead. - Increasing the model's context window. Effective attention does not scale with advertised window; the middle gets _worse_, not better. - "Reminding" the model with bold text or all-caps in AGENTS.md. Token-level emphasis has no measurable effect on the LiM gradient. **Where in this repo:** - Enforcement hierarchy in [.agents/AGENTS.md](../../.agents/AGENTS.md) already encodes the right pattern. - Existing hooks ([.agents/hooks/](../../.agents/hooks/)) already implement the context-tail-injection pattern. New guidance should follow that pattern. ### 2.2 Sequential-constraint ordering failures **Why it happens.** Cross-references documented in [ai-coding-best-practices.md §4.6](ai-coding-best-practices.md). When a list of constraints is given in one order but must be applied in another, models apply in the order they read them, not the order they should be applied in. **What works:** - Re-order constraints in the prompt to match application order. - Use a verifier (a hook, a test, a lint rule) instead of relying on the model to compose constraints in the right order. **Tempting but wrong:** - Numbered lists (1, 2, 3) implying priority order. Models don't reliably honor numeric priority over textual position. --- ## 3. Failure mode: ambiguity in the user's request ### 3.1 Models do not ask clarifying questions by default **Why it happens.** Pretraining favors confident-helpful continuations. Asking for clarification reads as "less helpful" in preference data. **[multi-replicated]** in conversational AI literature. **What works:** - Explicit instruction in the system prompt: _"If the user's intent is unclear, infer the most useful likely action and proceed with using tools to discover missing details instead of guessing"_ — paired with a structured ambiguity-flagging mechanism (e.g., the agent surfaces an explicit "assumption made: X" line before acting). - For high-stakes operations: ask one targeted clarifying question with options (the existing ask-question tool / `vscode_askQuestions` pattern). **Tempting but wrong:** - Telling the model "ask if anything is unclear." Models report nothing as unclear that they could fluently continue past. The instruction has near- zero effect. - Adding many "do you mean X or Y?" examples in the prompt. Few-shot examples for capable models on common tasks often actively harm via spurious pattern-matching ([ai-coding-best-practices.md §7](ai-coding-best-practices.md)). **Where in this repo:** - The default agent's `copilot-instructions` (if used here) or [AGENTS.md](../../AGENTS.md) operational rules section. --- ## 4. Failure mode: trying to fix it by asking the model to fix itself ### 4.1 Intrinsic self-correction without an oracle **Why it happens.** It feels like reflection should help. Empirically it doesn't, and often it hurts. **[multi-replicated]** as a negative result: - Huang et al. ([arXiv:2310.01798](https://arxiv.org/abs/2310.01798), "Large Language Models Cannot Self-Correct Reasoning Yet"): in the intrinsic (no-oracle) setting, self-correction **consistently decreases** reasoning performance across multiple prompts and tasks. Prior optimism about self-correction in earlier papers vanishes when oracle labels are removed. - Pan et al. (arXiv:2308.03188): survey reaches the same conclusion in aggregate — external feedback signals are reliable; intrinsic self-critique is not. **What works:** - External feedback signal: test runner, type checker, lint, hook exit code, build success. Reflexion (Shinn et al., arXiv:2303.11366) achieves 91% pass@1 on HumanEval _with_ an external oracle — without one, the loop is noise. - Failure-mode-routed intervention: a small judge subagent that classifies the failure mode and selects the matching intervention (see [ai-coding-best-practices.md §3.5](ai-coding-best-practices.md) table). The judge must be a stronger or cross-family model; same-family same-size judging compounds bias. **Tempting but wrong (this is the single most common anti-pattern):** - _"Take another look,"_ _"are you sure?"_ _"please double-check your work,"_ _"reflect on whether this is correct."_ All of these feel productive in transcripts. Without an external oracle they are at best noise and measurably degrade correctness in the published benchmark. Do not add them. - LLM-as-judge with the same model evaluating itself. Self-enhancement bias (Zheng et al. 2023, MT-Bench) — same-family judges over-score their own family's outputs. **Where in this repo:** - Verification is already correctly in the harness (build, lint, tests, hooks) rather than the prompt — see [ai-coding-best-practices.md §8.1](ai-coding-best-practices.md) and the existing hook set. - The reflection-without-oracle anti-pattern should be added explicitly to [AGENTS.md](../../AGENTS.md) `` so it doesn't creep back in as a "let me check my work" pattern. ### 4.2 Chain-of-thought as a universal fix **Why it happens.** CoT works on some tasks; folklore generalized it to all tasks. **[single-study + partial replication]** as a _negative_ finding for the universalization: - [arXiv:2409.06173](https://arxiv.org/abs/2409.06173) shows CoT suffers from posterior collapse: larger models anchor _harder_ to reasoning priors under CoT on subjective tasks (emotion, morality, intent inference). **What works:** - CoT for objective, verifiable reasoning (math, code logic, step-counted inference). - Think-Anywhere (Jiang et al., arXiv:2603.29957) and interleaved thinking (Claude 4.x extended thinking) — mid-sequence reasoning at high-entropy positions, not just upfront planning. **Tempting but wrong:** - _"Let's think step by step"_ preambles for reasoning-trained models — at best redundant (the model is already trained to reason); at worst it entrenches a wrong prior on subjective tasks. - Long CoT on intent-interpretation tasks. The model can reason itself _further into_ the misread. --- ## 5. Cross-cutting principle: the harness is where intent gets clarified A unifying claim from the synthesis doc that survives both the human and LLM literature: when ambiguity is high, neither a human nor a model resolves it by "reading more carefully." Resolution happens through **external signal** — question, test, lint, hook, oracle. The harness is where the external signal lives. The prompt is where the rule of "use the external signal" lives. Every action in this plan reduces to one of three moves: 1. **Move the rule into the harness.** Hooks, tests, type checkers, lint. These are unambiguous and fire deterministically. 2. **Reduce reliance on context-middle attention.** Context-tail injection, compaction, structured retrieval. 3. **Reduce reliance on self-critique.** External oracles, cross-family judges, structured failure routing. If a proposed mitigation does not fit one of these three, it probably belongs on the tempting-but-wrong list. --- ## 6. Proposed concrete edits (for user approval) This plan does not yet ship code changes. Proposed next steps in dependency order: - [ ] **A.** Audit [.agents/agents/](../../.agents/agents/) bodies for "you are an expert X" framing and replace with negative-role / return- format specs. Likely small edits to 1–4 files. - [ ] **B.** Add an anti-pattern bullet to [.agents/AGENTS.md](../../.agents/AGENTS.md) calling out _"reflect / double-check / are you sure"_ as a non-mitigation without an external oracle. Scoped to `.agents/` (not root `AGENTS.md`) because it is metaknowledge about agent design — only relevant when authoring agent infrastructure, not when writing application code where tests are the oracle anyway. - [x] **C.** Add a `CURRENT QUESTION (answer this, not the prior exchange):` prefix-injection option to [.agents/hooks/user-prompt-submit.sh](../../.agents/hooks/user-prompt-submit.sh), either always-on or gated on a follow-up trigger phrase. **Shipped always-on** (Revision 7, 2026-05-16). Placed last in `additionalContext` (context tail = highest recency bias). Validated by S2A (Weston & Sukhbaatar, arXiv:2311.11829): explicitly isolating the current query from prior context reduces sycophancy and improves factuality without a second LLM call. Same mechanism as the ClashEval token-probability anchoring research cited in §1.1. - [x] **D.** Add an `ambiguity-flag` convention: when the agent infers user intent past a real ambiguity, surface a one-line `ASSUMPTION:` marker before proceeding. Documented in [AGENTS.md](../../AGENTS.md); enforceable optionally via a `PreToolUse` check on certain destructive tools. **Shipped as documentation** in root `AGENTS.md` "Key Rules" section (Revision 7, 2026-05-16). PreToolUse enforcement deferred — would fire on every destructive call regardless of whether there was genuine ambiguity, producing noise without selectivity. - [ ] **E.** Update [docs/verified/ai-coding-best-practices.md](../verified/ai-coding-best-practices.md) summary to reflect the three corrections from Revision 6 of the research doc (sycophancy family-conditional, intrinsic self-correction is the strongest anti-pattern, persona-ladder scoped to ideation only). Open question for the user: which of A–E should ship in this conversation, which need a separate task, and which should be discarded?