- AGENTS.md: design principles, enforcement hierarchy, deferred loading - agents/: brainstorm, build, orchestrator, research (auto-discovered by MCP server) - skills/: research methodology (auto-discovered by MCP server) - hooks/: pre-tool-use, post-tool-use (BFF block removed), session-start, stop, pre-compact, user-prompt-submit - frameworks/: opencode/plugin.ts (resolves hooks via import.meta.url — works as project-local or global plugin), github/hooks.json - mcp/index.ts: auto-discovers agents/*.md and skills/*.md from frontmatter (replaces hand-maintained registry); server renamed all-agents - docs/: agent-infrastructure.md (generalized), research docs (7 files), ai_architectures.md, llama-server-cuda-wsl2.md - install.sh: idempotent setup — Copilot global hooks, OpenCode global plugin + AGENTS.md + MCP entry, VS Code global MCP config
380 lines
17 KiB
Markdown
380 lines
17 KiB
Markdown
# Action Plan: Counteracting Model Failures to Interpret Intent
|
||
|
||
**Status:** draft (2026-05-16)
|
||
**Source investigation:**
|
||
[docs/explorations/text-intent-interpretation-research.md](../explorations/text-intent-interpretation-research.md)
|
||
**Source
|
||
research docs:**
|
||
|
||
- [docs/research/text-communication-interpretation.md](text-communication-interpretation.md)
|
||
(Phase 1: humans reading text)
|
||
- [docs/research/llm-intent-interpretation.md](llm-intent-interpretation.md)
|
||
(Phase 1: LLMs reading prompts)
|
||
- [docs/research/human-llm-interpretation-overlap.md](human-llm-interpretation-overlap.md)
|
||
(Phase 2: synthesis)
|
||
- [docs/research/ai-coding-best-practices.md](ai-coding-best-practices.md)
|
||
(cross-reference: §2.1, §3.2, §3.4a, §3.5, §3.6, §3.7, §3.8, §7)
|
||
|
||
## How to read this document
|
||
|
||
Each entry has the same shape:
|
||
|
||
```
|
||
Failure mode → Why it happens → Mitigation that works → Tempting-but-wrong mitigation (anti-pattern) → Where to implement in this repo
|
||
```
|
||
|
||
The "tempting-but-wrong" line is the most important part. Many of the obvious
|
||
mitigations either (a) have no measurable effect or (b) actively hurt
|
||
performance — and they sound so reasonable they get added by default. If a
|
||
mitigation is on the anti-pattern list, _do not_ add it as a workaround when
|
||
something else fails.
|
||
|
||
Evidence-strength tags follow the synthesis doc's legend:
|
||
**[multi-replicated]**, **[single-study + partial replication]**,
|
||
**[single-study]**, **[preprint-only]**.
|
||
|
||
---
|
||
|
||
## 1. Failure mode: misreading the user's actual question
|
||
|
||
### 1.1 Position-anchored priming (model defends a prior answer)
|
||
|
||
**Why it happens.** The model's previous turn sits in the context window and
|
||
acts as a prior the model subsequently defends. Follow-ups are read through the
|
||
lens of the prior position, not on their own terms. **[multi-replicated]** —
|
||
documented across model families; mechanism supported by ClashEval (Wu, Wu, Zou,
|
||
NeurIPS 2024) showing token-probability/adherence relationship.
|
||
|
||
**What works (in order of effectiveness):**
|
||
|
||
1. **Compaction or fresh context.** Physically remove the prior committed
|
||
answer. The anchor is broken. Use `PreCompact` to preserve only the user's
|
||
current question and the verified-correct state.
|
||
2. **Adversarial reframing.** Lower the model's confidence in its prior
|
||
commitment _before_ asking the next question: _"I believe your previous
|
||
answer was wrong because X. Now answer this specific question: ..."_
|
||
ClashEval's mechanism (lower token-probability prior → higher context
|
||
adherence) extends to this case in principle.
|
||
3. **Explicit current-question marker at the tail.** `UserPromptSubmit` hook
|
||
prepends `CURRENT QUESTION (answer this, not the prior exchange):` to the
|
||
prompt. Mechanical, cheap, observable.
|
||
|
||
**Tempting but wrong (do not do):**
|
||
|
||
- Repeating the question louder, adding emphasis, or asking the model to "read
|
||
more carefully." None of these change the anchor. They feel productive and do
|
||
nothing.
|
||
- Asking the model to re-state the question in its own words before answering.
|
||
In the no-oracle setting this can entrench the misreading rather than reset
|
||
it.
|
||
|
||
**Where in this repo:**
|
||
|
||
- `UserPromptSubmit` hook (already exists at
|
||
[.agents/hooks/user-prompt-submit.sh](../../.agents/hooks/user-prompt-submit.sh))
|
||
is the right place for the current-question marker.
|
||
- Compaction logic in `PreCompact` hook (already exists at
|
||
[.agents/hooks/pre-compact.sh](../../.agents/hooks/pre-compact.sh)) is the
|
||
right place for the structured prior-discard.
|
||
|
||
### 1.2 Sycophancy (model defends the user's wrong claim)
|
||
|
||
**Why it happens.** Family-conditional behavior: some RLHF recipes
|
||
(Anthropic 2023) systematically push toward agreement with the user. **NOT** a
|
||
universal RLHF property — nostalgebraist (LessWrong, 2023) showed OpenAI base
|
||
models are not sycophantic at any size. **[single-study + partial replication]**
|
||
with the caveat that the effect depends on the model family in use.
|
||
|
||
**What works:**
|
||
|
||
- External feedback signals (test runners, hooks, type checkers, build) that
|
||
give the model a non-user source of truth.
|
||
- Explicit anti-sycophancy rules in `AGENTS.md` and agent bodies: _"Challenge
|
||
the user when the user is wrong,"_ _"Read a file before asserting facts about
|
||
it,"_ _"Only make changes that are directly requested."_
|
||
|
||
**Tempting but wrong:**
|
||
|
||
- Telling the model "be more critical" or "push back when needed." On
|
||
sycophantic families this softens the floor but doesn't move the median; on
|
||
non-sycophantic families it's noise.
|
||
- LLM-as-judge of the user's own claim (self-critique loop without an oracle —
|
||
see §4.1 below).
|
||
|
||
**Where in this repo:**
|
||
|
||
- [AGENTS.md](../../AGENTS.md) root anti-pattern list (already present).
|
||
- [.agents/AGENTS.md](../../.agents/AGENTS.md) per-agent rule reinforcement.
|
||
|
||
### 1.3 Persona / "you are an expert" prompting
|
||
|
||
**Why it happens.** Prompt-engineering folklore from 2022–2023 that expertise
|
||
personas improve accuracy. The 2025 literature falsifies this for accuracy
|
||
benchmarks. **[multi-replicated]** as a _negative_ result:
|
||
|
||
- Principled Personas (EMNLP 2025) — models are highly sensitive to irrelevant
|
||
persona details; performance drops of ~30pp from small attribute changes.
|
||
- Persona is a Double-Edged Sword (IJCNLP 2025) — mixed and unstable effects.
|
||
- [arXiv:2512.05858](https://arxiv.org/abs/2512.05858) — persona prompts
|
||
generally did not improve accuracy; low-knowledge personas (layperson, child,
|
||
outsider) often _reduced_ accuracy.
|
||
|
||
**What works:**
|
||
|
||
- Define _task contracts_ and _return formats_ for subagents (this is not the
|
||
same as injecting an expertise persona).
|
||
- Use the existing counterbalance agents
|
||
([.agents/agents/](../../.agents/agents/)) which are defined by what they
|
||
_counter_, not by what they _are an expert in_.
|
||
|
||
**Tempting but wrong:**
|
||
|
||
- Adding `"You are a senior X engineer with 20 years of experience..."` to agent
|
||
prompts. No measurable effect on frontier models; on small models can hurt via
|
||
persona-attribute sensitivity.
|
||
- Expertise-ladder prompting (junior/senior/outsider) as an **accuracy**
|
||
improver. It is _only_ defensible as a divergent-ideation sampler for
|
||
brainstorm tasks where high variance is the goal — and even then, the final
|
||
answer should come from the un-personified model under an external rubric. See
|
||
revised
|
||
[docs/research/ai-coding-best-practices.md §7](ai-coding-best-practices.md).
|
||
|
||
**Where in this repo:**
|
||
|
||
- Audit existing agent prompts in [.agents/agents/](../../.agents/agents/) for
|
||
any "you are an expert X" framing. Replace with negative-role and return-
|
||
format specs. (Action item, to be done after this plan is approved.)
|
||
|
||
---
|
||
|
||
## 2. Failure mode: misreading specific tokens / instructions in long context
|
||
|
||
### 2.1 Lost-in-the-middle / serial-position effects
|
||
|
||
**Why it happens.** Transformer attention is quadratic in context length;
|
||
information in the middle of long contexts receives proportionally less
|
||
attention. **[single-study + partial replication]** — Liu et al. (2023)
|
||
established the U-shape; Bilan et al. (arXiv:2508.07479, 2025) shows the U-shape
|
||
only holds up to ~50% of context window; Mak (2025) shows positional- embedding
|
||
decay produces monotonic drop in very-long contexts; Zhang et al. (2024b)
|
||
non-replication on some model families. Effect is real but mechanism varies and
|
||
effective context is typically 30–50% of advertised.
|
||
|
||
**What works:**
|
||
|
||
- Task-critical content at the **tail** of context (recency bias is strong and
|
||
consistent across the tested models).
|
||
- Rules repeated at both ends (start AND tail), not just AGENTS.md (start only).
|
||
- Hooks injecting at the context tail outlast AGENTS.md under context pressure.
|
||
- Summarization-in-place for stale tool outputs (don't scroll, replace).
|
||
|
||
**Tempting but wrong:**
|
||
|
||
- Putting more rules in AGENTS.md when the existing ones aren't being followed.
|
||
They are forgotten from the middle by ~5–10k tokens of subsequent context.
|
||
_Adding more makes it worse._ Move the rule to a hook instead.
|
||
- Increasing the model's context window. Effective attention does not scale with
|
||
advertised window; the middle gets _worse_, not better.
|
||
- "Reminding" the model with bold text or all-caps in AGENTS.md. Token-level
|
||
emphasis has no measurable effect on the LiM gradient.
|
||
|
||
**Where in this repo:**
|
||
|
||
- Enforcement hierarchy in [.agents/AGENTS.md](../../.agents/AGENTS.md) already
|
||
encodes the right pattern.
|
||
- Existing hooks ([.agents/hooks/](../../.agents/hooks/)) already implement the
|
||
context-tail-injection pattern. New guidance should follow that pattern.
|
||
|
||
### 2.2 Sequential-constraint ordering failures
|
||
|
||
**Why it happens.** Cross-references documented in
|
||
[ai-coding-best-practices.md §4.6](ai-coding-best-practices.md). When a list of
|
||
constraints is given in one order but must be applied in another, models apply
|
||
in the order they read them, not the order they should be applied in.
|
||
|
||
**What works:**
|
||
|
||
- Re-order constraints in the prompt to match application order.
|
||
- Use a verifier (a hook, a test, a lint rule) instead of relying on the model
|
||
to compose constraints in the right order.
|
||
|
||
**Tempting but wrong:**
|
||
|
||
- Numbered lists (1, 2, 3) implying priority order. Models don't reliably honor
|
||
numeric priority over textual position.
|
||
|
||
---
|
||
|
||
## 3. Failure mode: ambiguity in the user's request
|
||
|
||
### 3.1 Models do not ask clarifying questions by default
|
||
|
||
**Why it happens.** Pretraining favors confident-helpful continuations. Asking
|
||
for clarification reads as "less helpful" in preference data.
|
||
**[multi-replicated]** in conversational AI literature.
|
||
|
||
**What works:**
|
||
|
||
- Explicit instruction in the system prompt: _"If the user's intent is unclear,
|
||
infer the most useful likely action and proceed with using tools to discover
|
||
missing details instead of guessing"_ — paired with a structured
|
||
ambiguity-flagging mechanism (e.g., the agent surfaces an explicit "assumption
|
||
made: X" line before acting).
|
||
- For high-stakes operations: ask one targeted clarifying question with options
|
||
(the existing ask-question tool / `vscode_askQuestions` pattern).
|
||
|
||
**Tempting but wrong:**
|
||
|
||
- Telling the model "ask if anything is unclear." Models report nothing as
|
||
unclear that they could fluently continue past. The instruction has near- zero
|
||
effect.
|
||
- Adding many "do you mean X or Y?" examples in the prompt. Few-shot examples
|
||
for capable models on common tasks often actively harm via spurious
|
||
pattern-matching
|
||
([ai-coding-best-practices.md §7](ai-coding-best-practices.md)).
|
||
|
||
**Where in this repo:**
|
||
|
||
- The default agent's `copilot-instructions` (if used here) or
|
||
[AGENTS.md](../../AGENTS.md) operational rules section.
|
||
|
||
---
|
||
|
||
## 4. Failure mode: trying to fix it by asking the model to fix itself
|
||
|
||
### 4.1 Intrinsic self-correction without an oracle
|
||
|
||
**Why it happens.** It feels like reflection should help. Empirically it
|
||
doesn't, and often it hurts. **[multi-replicated]** as a negative result:
|
||
|
||
- Huang et al. ([arXiv:2310.01798](https://arxiv.org/abs/2310.01798), "Large
|
||
Language Models Cannot Self-Correct Reasoning Yet"): in the intrinsic
|
||
(no-oracle) setting, self-correction **consistently decreases** reasoning
|
||
performance across multiple prompts and tasks. Prior optimism about
|
||
self-correction in earlier papers vanishes when oracle labels are removed.
|
||
- Pan et al. (arXiv:2308.03188): survey reaches the same conclusion in aggregate
|
||
— external feedback signals are reliable; intrinsic self-critique is not.
|
||
|
||
**What works:**
|
||
|
||
- External feedback signal: test runner, type checker, lint, hook exit code,
|
||
build success. Reflexion (Shinn et al., arXiv:2303.11366) achieves 91% pass@1
|
||
on HumanEval _with_ an external oracle — without one, the loop is noise.
|
||
- Failure-mode-routed intervention: a small judge subagent that classifies the
|
||
failure mode and selects the matching intervention (see
|
||
[ai-coding-best-practices.md §3.5](ai-coding-best-practices.md) table). The
|
||
judge must be a stronger or cross-family model; same-family same-size judging
|
||
compounds bias.
|
||
|
||
**Tempting but wrong (this is the single most common anti-pattern):**
|
||
|
||
- _"Take another look,"_ _"are you sure?"_ _"please double-check your work,"_
|
||
_"reflect on whether this is correct."_ All of these feel productive in
|
||
transcripts. Without an external oracle they are at best noise and measurably
|
||
degrade correctness in the published benchmark. Do not add them.
|
||
- LLM-as-judge with the same model evaluating itself. Self-enhancement bias
|
||
(Zheng et al. 2023, MT-Bench) — same-family judges over-score their own
|
||
family's outputs.
|
||
|
||
**Where in this repo:**
|
||
|
||
- Verification is already correctly in the harness (build, lint, tests, hooks)
|
||
rather than the prompt — see
|
||
[ai-coding-best-practices.md §8.1](ai-coding-best-practices.md) and the
|
||
existing hook set.
|
||
- The reflection-without-oracle anti-pattern should be added explicitly to
|
||
[AGENTS.md](../../AGENTS.md) `<implementationDiscipline>` so it doesn't creep
|
||
back in as a "let me check my work" pattern.
|
||
|
||
### 4.2 Chain-of-thought as a universal fix
|
||
|
||
**Why it happens.** CoT works on some tasks; folklore generalized it to all
|
||
tasks. **[single-study + partial replication]** as a _negative_ finding for the
|
||
universalization:
|
||
|
||
- [arXiv:2409.06173](https://arxiv.org/abs/2409.06173) shows CoT suffers from
|
||
posterior collapse: larger models anchor _harder_ to reasoning priors under
|
||
CoT on subjective tasks (emotion, morality, intent inference).
|
||
|
||
**What works:**
|
||
|
||
- CoT for objective, verifiable reasoning (math, code logic, step-counted
|
||
inference).
|
||
- Think-Anywhere (Jiang et al., arXiv:2603.29957) and interleaved thinking
|
||
(Claude 4.x extended thinking) — mid-sequence reasoning at high-entropy
|
||
positions, not just upfront planning.
|
||
|
||
**Tempting but wrong:**
|
||
|
||
- _"Let's think step by step"_ preambles for reasoning-trained models — at best
|
||
redundant (the model is already trained to reason); at worst it entrenches a
|
||
wrong prior on subjective tasks.
|
||
- Long CoT on intent-interpretation tasks. The model can reason itself _further
|
||
into_ the misread.
|
||
|
||
---
|
||
|
||
## 5. Cross-cutting principle: the harness is where intent gets clarified
|
||
|
||
A unifying claim from the synthesis doc that survives both the human and LLM
|
||
literature: when ambiguity is high, neither a human nor a model resolves it by
|
||
"reading more carefully." Resolution happens through **external signal** —
|
||
question, test, lint, hook, oracle. The harness is where the external signal
|
||
lives. The prompt is where the rule of "use the external signal" lives.
|
||
|
||
Every action in this plan reduces to one of three moves:
|
||
|
||
1. **Move the rule into the harness.** Hooks, tests, type checkers, lint. These
|
||
are unambiguous and fire deterministically.
|
||
2. **Reduce reliance on context-middle attention.** Context-tail injection,
|
||
compaction, structured retrieval.
|
||
3. **Reduce reliance on self-critique.** External oracles, cross-family judges,
|
||
structured failure routing.
|
||
|
||
If a proposed mitigation does not fit one of these three, it probably belongs on
|
||
the tempting-but-wrong list.
|
||
|
||
---
|
||
|
||
## 6. Proposed concrete edits (for user approval)
|
||
|
||
This plan does not yet ship code changes. Proposed next steps in dependency
|
||
order:
|
||
|
||
- [ ] **A.** Audit [.agents/agents/](../../.agents/agents/) bodies for "you are
|
||
an expert X" framing and replace with negative-role / return- format
|
||
specs. Likely small edits to 1–4 files.
|
||
- [ ] **B.** Add an anti-pattern bullet to
|
||
[.agents/AGENTS.md](../../.agents/AGENTS.md) calling out _"reflect /
|
||
double-check / are you sure"_ as a non-mitigation without an external
|
||
oracle. Scoped to `.agents/` (not root `AGENTS.md`) because it is
|
||
metaknowledge about agent design — only relevant when authoring agent
|
||
infrastructure, not when writing application code where tests are the
|
||
oracle anyway.
|
||
- [x] **C.** Add a `CURRENT QUESTION (answer this, not the prior exchange):`
|
||
prefix-injection option to
|
||
[.agents/hooks/user-prompt-submit.sh](../../.agents/hooks/user-prompt-submit.sh),
|
||
either always-on or gated on a follow-up trigger phrase. **Shipped
|
||
always-on** (Revision 7, 2026-05-16). Placed last in `additionalContext`
|
||
(context tail = highest recency bias). Validated by S2A (Weston &
|
||
Sukhbaatar, arXiv:2311.11829): explicitly isolating the current query from
|
||
prior context reduces sycophancy and improves factuality without a second
|
||
LLM call. Same mechanism as the ClashEval token-probability anchoring
|
||
research cited in §1.1.
|
||
- [x] **D.** Add an `ambiguity-flag` convention: when the agent infers user
|
||
intent past a real ambiguity, surface a one-line `ASSUMPTION:` marker
|
||
before proceeding. Documented in [AGENTS.md](../../AGENTS.md); enforceable
|
||
optionally via a `PreToolUse` check on certain destructive tools.
|
||
**Shipped as documentation** in root `AGENTS.md` "Key Rules" section
|
||
(Revision 7, 2026-05-16). PreToolUse enforcement deferred — would fire on
|
||
every destructive call regardless of whether there was genuine ambiguity,
|
||
producing noise without selectivity.
|
||
- [ ] **E.** Update
|
||
[docs/verified/ai-coding-best-practices.md](../verified/ai-coding-best-practices.md)
|
||
summary to reflect the three corrections from Revision 6 of the research
|
||
doc (sycophancy family-conditional, intrinsic self-correction is the
|
||
strongest anti-pattern, persona-ladder scoped to ideation only).
|
||
|
||
Open question for the user: which of A–E should ship in this conversation, which
|
||
need a separate task, and which should be discarded?
|