# Action Plan: Counteracting Model Failures to Interpret Intent

**Status:** draft (2026-05-16)  
**Source investigation:**
[docs/explorations/text-intent-interpretation-research.md](../explorations/text-intent-interpretation-research.md)  
**Source
research docs:**

- [docs/research/text-communication-interpretation.md](text-communication-interpretation.md)
  (Phase 1: humans reading text)
- [docs/research/llm-intent-interpretation.md](llm-intent-interpretation.md)
  (Phase 1: LLMs reading prompts)
- [docs/research/human-llm-interpretation-overlap.md](human-llm-interpretation-overlap.md)
  (Phase 2: synthesis)
- [docs/research/ai-coding-best-practices.md](ai-coding-best-practices.md)
  (cross-reference: §2.1, §3.2, §3.4a, §3.5, §3.6, §3.7, §3.8, §7)

## How to read this document

Each entry has the same shape:

```
Failure mode → Why it happens → Mitigation that works → Tempting-but-wrong mitigation (anti-pattern) → Where to implement in this repo
```

The "tempting-but-wrong" line is the most important part. Many of the obvious
mitigations either (a) have no measurable effect or (b) actively hurt
performance — and they sound so reasonable they get added by default. If a
mitigation is on the anti-pattern list, _do not_ add it as a workaround when
something else fails.

Evidence-strength tags follow the synthesis doc's legend:
**[multi-replicated]**, **[single-study + partial replication]**,
**[single-study]**, **[preprint-only]**.

---

## 1. Failure mode: misreading the user's actual question

### 1.1 Position-anchored priming (model defends a prior answer)

**Why it happens.** The model's previous turn sits in the context window and
acts as a prior the model subsequently defends. Follow-ups are read through the
lens of the prior position, not on their own terms. **[multi-replicated]** —
documented across model families; mechanism supported by ClashEval (Wu, Wu, Zou,
NeurIPS 2024) showing token-probability/adherence relationship.

**What works (in order of effectiveness):**

1. **Compaction or fresh context.** Physically remove the prior committed
   answer. The anchor is broken. Use `PreCompact` to preserve only the user's
   current question and the verified-correct state.
2. **Adversarial reframing.** Lower the model's confidence in its prior
   commitment _before_ asking the next question: _"I believe your previous
   answer was wrong because X. Now answer this specific question: ..."_
   ClashEval's mechanism (lower token-probability prior → higher context
   adherence) extends to this case in principle.
3. **Explicit current-question marker at the tail.** `UserPromptSubmit` hook
   prepends `CURRENT QUESTION (answer this, not the prior exchange):` to the
   prompt. Mechanical, cheap, observable.

**Tempting but wrong (do not do):**

- Repeating the question louder, adding emphasis, or asking the model to "read
  more carefully." None of these change the anchor. They feel productive and do
  nothing.
- Asking the model to re-state the question in its own words before answering.
  In the no-oracle setting this can entrench the misreading rather than reset
  it.

**Where in this repo:**

- `UserPromptSubmit` hook (already exists at
  [.agents/hooks/user-prompt-submit.sh](../../.agents/hooks/user-prompt-submit.sh))
  is the right place for the current-question marker.
- Compaction logic in `PreCompact` hook (already exists at
  [.agents/hooks/pre-compact.sh](../../.agents/hooks/pre-compact.sh)) is the
  right place for the structured prior-discard.

### 1.2 Sycophancy (model defends the user's wrong claim)

**Why it happens.** Family-conditional behavior: some RLHF recipes
(Anthropic 2023) systematically push toward agreement with the user. **NOT** a
universal RLHF property — nostalgebraist (LessWrong, 2023) showed OpenAI base
models are not sycophantic at any size. **[single-study + partial replication]**
with the caveat that the effect depends on the model family in use.

**What works:**

- External feedback signals (test runners, hooks, type checkers, build) that
  give the model a non-user source of truth.
- Explicit anti-sycophancy rules in `AGENTS.md` and agent bodies: _"Challenge
  the user when the user is wrong,"_ _"Read a file before asserting facts about
  it,"_ _"Only make changes that are directly requested."_

**Tempting but wrong:**

- Telling the model "be more critical" or "push back when needed." On
  sycophantic families this softens the floor but doesn't move the median; on
  non-sycophantic families it's noise.
- LLM-as-judge of the user's own claim (self-critique loop without an oracle —
  see §4.1 below).

**Where in this repo:**

- [AGENTS.md](../../AGENTS.md) root anti-pattern list (already present).
- [.agents/AGENTS.md](../../.agents/AGENTS.md) per-agent rule reinforcement.

### 1.3 Persona / "you are an expert" prompting

**Why it happens.** Prompt-engineering folklore from 2022–2023 that expertise
personas improve accuracy. The 2025 literature falsifies this for accuracy
benchmarks. **[multi-replicated]** as a _negative_ result:

- Principled Personas (EMNLP 2025) — models are highly sensitive to irrelevant
  persona details; performance drops of ~30pp from small attribute changes.
- Persona is a Double-Edged Sword (IJCNLP 2025) — mixed and unstable effects.
- [arXiv:2512.05858](https://arxiv.org/abs/2512.05858) — persona prompts
  generally did not improve accuracy; low-knowledge personas (layperson, child,
  outsider) often _reduced_ accuracy.

**What works:**

- Define _task contracts_ and _return formats_ for subagents (this is not the
  same as injecting an expertise persona).
- Use the existing counterbalance agents
  ([.agents/agents/](../../.agents/agents/)) which are defined by what they
  _counter_, not by what they _are an expert in_.

**Tempting but wrong:**

- Adding `"You are a senior X engineer with 20 years of experience..."` to agent
  prompts. No measurable effect on frontier models; on small models can hurt via
  persona-attribute sensitivity.
- Expertise-ladder prompting (junior/senior/outsider) as an **accuracy**
  improver. It is _only_ defensible as a divergent-ideation sampler for
  brainstorm tasks where high variance is the goal — and even then, the final
  answer should come from the un-personified model under an external rubric. See
  revised
  [docs/research/ai-coding-best-practices.md §7](ai-coding-best-practices.md).

**Where in this repo:**

- Audit existing agent prompts in [.agents/agents/](../../.agents/agents/) for
  any "you are an expert X" framing. Replace with negative-role and return-
  format specs. (Action item, to be done after this plan is approved.)

---

## 2. Failure mode: misreading specific tokens / instructions in long context

### 2.1 Lost-in-the-middle / serial-position effects

**Why it happens.** Transformer attention is quadratic in context length;
information in the middle of long contexts receives proportionally less
attention. **[single-study + partial replication]** — Liu et al. (2023)
established the U-shape; Bilan et al. (arXiv:2508.07479, 2025) shows the U-shape
only holds up to ~50% of context window; Mak (2025) shows positional- embedding
decay produces monotonic drop in very-long contexts; Zhang et al. (2024b)
non-replication on some model families. Effect is real but mechanism varies and
effective context is typically 30–50% of advertised.

**What works:**

- Task-critical content at the **tail** of context (recency bias is strong and
  consistent across the tested models).
- Rules repeated at both ends (start AND tail), not just AGENTS.md (start only).
- Hooks injecting at the context tail outlast AGENTS.md under context pressure.
- Summarization-in-place for stale tool outputs (don't scroll, replace).

**Tempting but wrong:**

- Putting more rules in AGENTS.md when the existing ones aren't being followed.
  They are forgotten from the middle by ~5–10k tokens of subsequent context.
  _Adding more makes it worse._ Move the rule to a hook instead.
- Increasing the model's context window. Effective attention does not scale with
  advertised window; the middle gets _worse_, not better.
- "Reminding" the model with bold text or all-caps in AGENTS.md. Token-level
  emphasis has no measurable effect on the LiM gradient.

**Where in this repo:**

- Enforcement hierarchy in [.agents/AGENTS.md](../../.agents/AGENTS.md) already
  encodes the right pattern.
- Existing hooks ([.agents/hooks/](../../.agents/hooks/)) already implement the
  context-tail-injection pattern. New guidance should follow that pattern.

### 2.2 Sequential-constraint ordering failures

**Why it happens.** Cross-references documented in
[ai-coding-best-practices.md §4.6](ai-coding-best-practices.md). When a list of
constraints is given in one order but must be applied in another, models apply
in the order they read them, not the order they should be applied in.

**What works:**

- Re-order constraints in the prompt to match application order.
- Use a verifier (a hook, a test, a lint rule) instead of relying on the model
  to compose constraints in the right order.

**Tempting but wrong:**

- Numbered lists (1, 2, 3) implying priority order. Models don't reliably honor
  numeric priority over textual position.

---

## 3. Failure mode: ambiguity in the user's request

### 3.1 Models do not ask clarifying questions by default

**Why it happens.** Pretraining favors confident-helpful continuations. Asking
for clarification reads as "less helpful" in preference data.
**[multi-replicated]** in conversational AI literature.

**What works:**

- Explicit instruction in the system prompt: _"If the user's intent is unclear,
  infer the most useful likely action and proceed with using tools to discover
  missing details instead of guessing"_ — paired with a structured
  ambiguity-flagging mechanism (e.g., the agent surfaces an explicit "assumption
  made: X" line before acting).
- For high-stakes operations: ask one targeted clarifying question with options
  (the existing ask-question tool / `vscode_askQuestions` pattern).

**Tempting but wrong:**

- Telling the model "ask if anything is unclear." Models report nothing as
  unclear that they could fluently continue past. The instruction has near- zero
  effect.
- Adding many "do you mean X or Y?" examples in the prompt. Few-shot examples
  for capable models on common tasks often actively harm via spurious
  pattern-matching
  ([ai-coding-best-practices.md §7](ai-coding-best-practices.md)).

**Where in this repo:**

- The default agent's `copilot-instructions` (if used here) or
  [AGENTS.md](../../AGENTS.md) operational rules section.

---

## 4. Failure mode: trying to fix it by asking the model to fix itself

### 4.1 Intrinsic self-correction without an oracle

**Why it happens.** It feels like reflection should help. Empirically it
doesn't, and often it hurts. **[multi-replicated]** as a negative result:

- Huang et al. ([arXiv:2310.01798](https://arxiv.org/abs/2310.01798), "Large
  Language Models Cannot Self-Correct Reasoning Yet"): in the intrinsic
  (no-oracle) setting, self-correction **consistently decreases** reasoning
  performance across multiple prompts and tasks. Prior optimism about
  self-correction in earlier papers vanishes when oracle labels are removed.
- Pan et al. (arXiv:2308.03188): survey reaches the same conclusion in aggregate
  — external feedback signals are reliable; intrinsic self-critique is not.

**What works:**

- External feedback signal: test runner, type checker, lint, hook exit code,
  build success. Reflexion (Shinn et al., arXiv:2303.11366) achieves 91% pass@1
  on HumanEval _with_ an external oracle — without one, the loop is noise.
- Failure-mode-routed intervention: a small judge subagent that classifies the
  failure mode and selects the matching intervention (see
  [ai-coding-best-practices.md §3.5](ai-coding-best-practices.md) table). The
  judge must be a stronger or cross-family model; same-family same-size judging
  compounds bias.

**Tempting but wrong (this is the single most common anti-pattern):**

- _"Take another look,"_ _"are you sure?"_ _"please double-check your work,"_
  _"reflect on whether this is correct."_ All of these feel productive in
  transcripts. Without an external oracle they are at best noise and measurably
  degrade correctness in the published benchmark. Do not add them.
- LLM-as-judge with the same model evaluating itself. Self-enhancement bias
  (Zheng et al. 2023, MT-Bench) — same-family judges over-score their own
  family's outputs.

**Where in this repo:**

- Verification is already correctly in the harness (build, lint, tests, hooks)
  rather than the prompt — see
  [ai-coding-best-practices.md §8.1](ai-coding-best-practices.md) and the
  existing hook set.
- The reflection-without-oracle anti-pattern should be added explicitly to
  [AGENTS.md](../../AGENTS.md) `<implementationDiscipline>` so it doesn't creep
  back in as a "let me check my work" pattern.

### 4.2 Chain-of-thought as a universal fix

**Why it happens.** CoT works on some tasks; folklore generalized it to all
tasks. **[single-study + partial replication]** as a _negative_ finding for the
universalization:

- [arXiv:2409.06173](https://arxiv.org/abs/2409.06173) shows CoT suffers from
  posterior collapse: larger models anchor _harder_ to reasoning priors under
  CoT on subjective tasks (emotion, morality, intent inference).

**What works:**

- CoT for objective, verifiable reasoning (math, code logic, step-counted
  inference).
- Think-Anywhere (Jiang et al., arXiv:2603.29957) and interleaved thinking
  (Claude 4.x extended thinking) — mid-sequence reasoning at high-entropy
  positions, not just upfront planning.

**Tempting but wrong:**

- _"Let's think step by step"_ preambles for reasoning-trained models — at best
  redundant (the model is already trained to reason); at worst it entrenches a
  wrong prior on subjective tasks.
- Long CoT on intent-interpretation tasks. The model can reason itself _further
  into_ the misread.

---

## 5. Cross-cutting principle: the harness is where intent gets clarified

A unifying claim from the synthesis doc that survives both the human and LLM
literature: when ambiguity is high, neither a human nor a model resolves it by
"reading more carefully." Resolution happens through **external signal** —
question, test, lint, hook, oracle. The harness is where the external signal
lives. The prompt is where the rule of "use the external signal" lives.

Every action in this plan reduces to one of three moves:

1. **Move the rule into the harness.** Hooks, tests, type checkers, lint. These
   are unambiguous and fire deterministically.
2. **Reduce reliance on context-middle attention.** Context-tail injection,
   compaction, structured retrieval.
3. **Reduce reliance on self-critique.** External oracles, cross-family judges,
   structured failure routing.

If a proposed mitigation does not fit one of these three, it probably belongs on
the tempting-but-wrong list.

---

## 6. Proposed concrete edits (for user approval)

This plan does not yet ship code changes. Proposed next steps in dependency
order:

- [ ] **A.** Audit [.agents/agents/](../../.agents/agents/) bodies for "you are
      an expert X" framing and replace with negative-role / return- format
      specs. Likely small edits to 1–4 files.
- [ ] **B.** Add an anti-pattern bullet to
      [.agents/AGENTS.md](../../.agents/AGENTS.md) calling out _"reflect /
      double-check / are you sure"_ as a non-mitigation without an external
      oracle. Scoped to `.agents/` (not root `AGENTS.md`) because it is
      metaknowledge about agent design — only relevant when authoring agent
      infrastructure, not when writing application code where tests are the
      oracle anyway.
- [x] **C.** Add a `CURRENT QUESTION (answer this, not the prior exchange):`
      prefix-injection option to
      [.agents/hooks/user-prompt-submit.sh](../../.agents/hooks/user-prompt-submit.sh),
      either always-on or gated on a follow-up trigger phrase. **Shipped
      always-on** (Revision 7, 2026-05-16). Placed last in `additionalContext`
      (context tail = highest recency bias). Validated by S2A (Weston &
      Sukhbaatar, arXiv:2311.11829): explicitly isolating the current query from
      prior context reduces sycophancy and improves factuality without a second
      LLM call. Same mechanism as the ClashEval token-probability anchoring
      research cited in §1.1.
- [x] **D.** Add an `ambiguity-flag` convention: when the agent infers user
      intent past a real ambiguity, surface a one-line `ASSUMPTION:` marker
      before proceeding. Documented in [AGENTS.md](../../AGENTS.md); enforceable
      optionally via a `PreToolUse` check on certain destructive tools.
      **Shipped as documentation** in root `AGENTS.md` "Key Rules" section
      (Revision 7, 2026-05-16). PreToolUse enforcement deferred — would fire on
      every destructive call regardless of whether there was genuine ambiguity,
      producing noise without selectivity.
- [ ] **E.** Update
      [docs/verified/ai-coding-best-practices.md](../verified/ai-coding-best-practices.md)
      summary to reflect the three corrections from Revision 6 of the research
      doc (sycophancy family-conditional, intrinsic self-correction is the
      strongest anti-pattern, persona-ladder scoped to ideation only).

Open question for the user: which of A–E should ship in this conversation, which
need a separate task, and which should be discarded?