dotfiles/.agents/docs/intent-interpretation-action-plan.md
Brydon DeWitt 6b07e4ccb2 feat: add shared agent infrastructure (.agents/)
- AGENTS.md: design principles, enforcement hierarchy, deferred loading
- agents/: brainstorm, build, orchestrator, research (auto-discovered by MCP server)
- skills/: research methodology (auto-discovered by MCP server)
- hooks/: pre-tool-use, post-tool-use (BFF block removed), session-start,
  stop, pre-compact, user-prompt-submit
- frameworks/: opencode/plugin.ts (resolves hooks via import.meta.url — works
  as project-local or global plugin), github/hooks.json
- mcp/index.ts: auto-discovers agents/*.md and skills/*.md from frontmatter
  (replaces hand-maintained registry); server renamed all-agents
- docs/: agent-infrastructure.md (generalized), research docs (7 files),
  ai_architectures.md, llama-server-cuda-wsl2.md
- install.sh: idempotent setup — Copilot global hooks, OpenCode global plugin +
  AGENTS.md + MCP entry, VS Code global MCP config
2026-05-22 13:13:43 -04:00

380 lines
17 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Action Plan: Counteracting Model Failures to Interpret Intent
**Status:** draft (2026-05-16)
**Source investigation:**
[docs/explorations/text-intent-interpretation-research.md](../explorations/text-intent-interpretation-research.md)
**Source
research docs:**
- [docs/research/text-communication-interpretation.md](text-communication-interpretation.md)
(Phase 1: humans reading text)
- [docs/research/llm-intent-interpretation.md](llm-intent-interpretation.md)
(Phase 1: LLMs reading prompts)
- [docs/research/human-llm-interpretation-overlap.md](human-llm-interpretation-overlap.md)
(Phase 2: synthesis)
- [docs/research/ai-coding-best-practices.md](ai-coding-best-practices.md)
(cross-reference: §2.1, §3.2, §3.4a, §3.5, §3.6, §3.7, §3.8, §7)
## How to read this document
Each entry has the same shape:
```
Failure mode → Why it happens → Mitigation that works → Tempting-but-wrong mitigation (anti-pattern) → Where to implement in this repo
```
The "tempting-but-wrong" line is the most important part. Many of the obvious
mitigations either (a) have no measurable effect or (b) actively hurt
performance — and they sound so reasonable they get added by default. If a
mitigation is on the anti-pattern list, _do not_ add it as a workaround when
something else fails.
Evidence-strength tags follow the synthesis doc's legend:
**[multi-replicated]**, **[single-study + partial replication]**,
**[single-study]**, **[preprint-only]**.
---
## 1. Failure mode: misreading the user's actual question
### 1.1 Position-anchored priming (model defends a prior answer)
**Why it happens.** The model's previous turn sits in the context window and
acts as a prior the model subsequently defends. Follow-ups are read through the
lens of the prior position, not on their own terms. **[multi-replicated]** —
documented across model families; mechanism supported by ClashEval (Wu, Wu, Zou,
NeurIPS 2024) showing token-probability/adherence relationship.
**What works (in order of effectiveness):**
1. **Compaction or fresh context.** Physically remove the prior committed
answer. The anchor is broken. Use `PreCompact` to preserve only the user's
current question and the verified-correct state.
2. **Adversarial reframing.** Lower the model's confidence in its prior
commitment _before_ asking the next question: _"I believe your previous
answer was wrong because X. Now answer this specific question: ..."_
ClashEval's mechanism (lower token-probability prior → higher context
adherence) extends to this case in principle.
3. **Explicit current-question marker at the tail.** `UserPromptSubmit` hook
prepends `CURRENT QUESTION (answer this, not the prior exchange):` to the
prompt. Mechanical, cheap, observable.
**Tempting but wrong (do not do):**
- Repeating the question louder, adding emphasis, or asking the model to "read
more carefully." None of these change the anchor. They feel productive and do
nothing.
- Asking the model to re-state the question in its own words before answering.
In the no-oracle setting this can entrench the misreading rather than reset
it.
**Where in this repo:**
- `UserPromptSubmit` hook (already exists at
[.agents/hooks/user-prompt-submit.sh](../../.agents/hooks/user-prompt-submit.sh))
is the right place for the current-question marker.
- Compaction logic in `PreCompact` hook (already exists at
[.agents/hooks/pre-compact.sh](../../.agents/hooks/pre-compact.sh)) is the
right place for the structured prior-discard.
### 1.2 Sycophancy (model defends the user's wrong claim)
**Why it happens.** Family-conditional behavior: some RLHF recipes
(Anthropic 2023) systematically push toward agreement with the user. **NOT** a
universal RLHF property — nostalgebraist (LessWrong, 2023) showed OpenAI base
models are not sycophantic at any size. **[single-study + partial replication]**
with the caveat that the effect depends on the model family in use.
**What works:**
- External feedback signals (test runners, hooks, type checkers, build) that
give the model a non-user source of truth.
- Explicit anti-sycophancy rules in `AGENTS.md` and agent bodies: _"Challenge
the user when the user is wrong,"_ _"Read a file before asserting facts about
it,"_ _"Only make changes that are directly requested."_
**Tempting but wrong:**
- Telling the model "be more critical" or "push back when needed." On
sycophantic families this softens the floor but doesn't move the median; on
non-sycophantic families it's noise.
- LLM-as-judge of the user's own claim (self-critique loop without an oracle —
see §4.1 below).
**Where in this repo:**
- [AGENTS.md](../../AGENTS.md) root anti-pattern list (already present).
- [.agents/AGENTS.md](../../.agents/AGENTS.md) per-agent rule reinforcement.
### 1.3 Persona / "you are an expert" prompting
**Why it happens.** Prompt-engineering folklore from 20222023 that expertise
personas improve accuracy. The 2025 literature falsifies this for accuracy
benchmarks. **[multi-replicated]** as a _negative_ result:
- Principled Personas (EMNLP 2025) — models are highly sensitive to irrelevant
persona details; performance drops of ~30pp from small attribute changes.
- Persona is a Double-Edged Sword (IJCNLP 2025) — mixed and unstable effects.
- [arXiv:2512.05858](https://arxiv.org/abs/2512.05858) — persona prompts
generally did not improve accuracy; low-knowledge personas (layperson, child,
outsider) often _reduced_ accuracy.
**What works:**
- Define _task contracts_ and _return formats_ for subagents (this is not the
same as injecting an expertise persona).
- Use the existing counterbalance agents
([.agents/agents/](../../.agents/agents/)) which are defined by what they
_counter_, not by what they _are an expert in_.
**Tempting but wrong:**
- Adding `"You are a senior X engineer with 20 years of experience..."` to agent
prompts. No measurable effect on frontier models; on small models can hurt via
persona-attribute sensitivity.
- Expertise-ladder prompting (junior/senior/outsider) as an **accuracy**
improver. It is _only_ defensible as a divergent-ideation sampler for
brainstorm tasks where high variance is the goal — and even then, the final
answer should come from the un-personified model under an external rubric. See
revised
[docs/research/ai-coding-best-practices.md §7](ai-coding-best-practices.md).
**Where in this repo:**
- Audit existing agent prompts in [.agents/agents/](../../.agents/agents/) for
any "you are an expert X" framing. Replace with negative-role and return-
format specs. (Action item, to be done after this plan is approved.)
---
## 2. Failure mode: misreading specific tokens / instructions in long context
### 2.1 Lost-in-the-middle / serial-position effects
**Why it happens.** Transformer attention is quadratic in context length;
information in the middle of long contexts receives proportionally less
attention. **[single-study + partial replication]** — Liu et al. (2023)
established the U-shape; Bilan et al. (arXiv:2508.07479, 2025) shows the U-shape
only holds up to ~50% of context window; Mak (2025) shows positional- embedding
decay produces monotonic drop in very-long contexts; Zhang et al. (2024b)
non-replication on some model families. Effect is real but mechanism varies and
effective context is typically 3050% of advertised.
**What works:**
- Task-critical content at the **tail** of context (recency bias is strong and
consistent across the tested models).
- Rules repeated at both ends (start AND tail), not just AGENTS.md (start only).
- Hooks injecting at the context tail outlast AGENTS.md under context pressure.
- Summarization-in-place for stale tool outputs (don't scroll, replace).
**Tempting but wrong:**
- Putting more rules in AGENTS.md when the existing ones aren't being followed.
They are forgotten from the middle by ~510k tokens of subsequent context.
_Adding more makes it worse._ Move the rule to a hook instead.
- Increasing the model's context window. Effective attention does not scale with
advertised window; the middle gets _worse_, not better.
- "Reminding" the model with bold text or all-caps in AGENTS.md. Token-level
emphasis has no measurable effect on the LiM gradient.
**Where in this repo:**
- Enforcement hierarchy in [.agents/AGENTS.md](../../.agents/AGENTS.md) already
encodes the right pattern.
- Existing hooks ([.agents/hooks/](../../.agents/hooks/)) already implement the
context-tail-injection pattern. New guidance should follow that pattern.
### 2.2 Sequential-constraint ordering failures
**Why it happens.** Cross-references documented in
[ai-coding-best-practices.md §4.6](ai-coding-best-practices.md). When a list of
constraints is given in one order but must be applied in another, models apply
in the order they read them, not the order they should be applied in.
**What works:**
- Re-order constraints in the prompt to match application order.
- Use a verifier (a hook, a test, a lint rule) instead of relying on the model
to compose constraints in the right order.
**Tempting but wrong:**
- Numbered lists (1, 2, 3) implying priority order. Models don't reliably honor
numeric priority over textual position.
---
## 3. Failure mode: ambiguity in the user's request
### 3.1 Models do not ask clarifying questions by default
**Why it happens.** Pretraining favors confident-helpful continuations. Asking
for clarification reads as "less helpful" in preference data.
**[multi-replicated]** in conversational AI literature.
**What works:**
- Explicit instruction in the system prompt: _"If the user's intent is unclear,
infer the most useful likely action and proceed with using tools to discover
missing details instead of guessing"_ — paired with a structured
ambiguity-flagging mechanism (e.g., the agent surfaces an explicit "assumption
made: X" line before acting).
- For high-stakes operations: ask one targeted clarifying question with options
(the existing ask-question tool / `vscode_askQuestions` pattern).
**Tempting but wrong:**
- Telling the model "ask if anything is unclear." Models report nothing as
unclear that they could fluently continue past. The instruction has near- zero
effect.
- Adding many "do you mean X or Y?" examples in the prompt. Few-shot examples
for capable models on common tasks often actively harm via spurious
pattern-matching
([ai-coding-best-practices.md §7](ai-coding-best-practices.md)).
**Where in this repo:**
- The default agent's `copilot-instructions` (if used here) or
[AGENTS.md](../../AGENTS.md) operational rules section.
---
## 4. Failure mode: trying to fix it by asking the model to fix itself
### 4.1 Intrinsic self-correction without an oracle
**Why it happens.** It feels like reflection should help. Empirically it
doesn't, and often it hurts. **[multi-replicated]** as a negative result:
- Huang et al. ([arXiv:2310.01798](https://arxiv.org/abs/2310.01798), "Large
Language Models Cannot Self-Correct Reasoning Yet"): in the intrinsic
(no-oracle) setting, self-correction **consistently decreases** reasoning
performance across multiple prompts and tasks. Prior optimism about
self-correction in earlier papers vanishes when oracle labels are removed.
- Pan et al. (arXiv:2308.03188): survey reaches the same conclusion in aggregate
— external feedback signals are reliable; intrinsic self-critique is not.
**What works:**
- External feedback signal: test runner, type checker, lint, hook exit code,
build success. Reflexion (Shinn et al., arXiv:2303.11366) achieves 91% pass@1
on HumanEval _with_ an external oracle — without one, the loop is noise.
- Failure-mode-routed intervention: a small judge subagent that classifies the
failure mode and selects the matching intervention (see
[ai-coding-best-practices.md §3.5](ai-coding-best-practices.md) table). The
judge must be a stronger or cross-family model; same-family same-size judging
compounds bias.
**Tempting but wrong (this is the single most common anti-pattern):**
- _"Take another look,"_ _"are you sure?"_ _"please double-check your work,"_
_"reflect on whether this is correct."_ All of these feel productive in
transcripts. Without an external oracle they are at best noise and measurably
degrade correctness in the published benchmark. Do not add them.
- LLM-as-judge with the same model evaluating itself. Self-enhancement bias
(Zheng et al. 2023, MT-Bench) — same-family judges over-score their own
family's outputs.
**Where in this repo:**
- Verification is already correctly in the harness (build, lint, tests, hooks)
rather than the prompt — see
[ai-coding-best-practices.md §8.1](ai-coding-best-practices.md) and the
existing hook set.
- The reflection-without-oracle anti-pattern should be added explicitly to
[AGENTS.md](../../AGENTS.md) `<implementationDiscipline>` so it doesn't creep
back in as a "let me check my work" pattern.
### 4.2 Chain-of-thought as a universal fix
**Why it happens.** CoT works on some tasks; folklore generalized it to all
tasks. **[single-study + partial replication]** as a _negative_ finding for the
universalization:
- [arXiv:2409.06173](https://arxiv.org/abs/2409.06173) shows CoT suffers from
posterior collapse: larger models anchor _harder_ to reasoning priors under
CoT on subjective tasks (emotion, morality, intent inference).
**What works:**
- CoT for objective, verifiable reasoning (math, code logic, step-counted
inference).
- Think-Anywhere (Jiang et al., arXiv:2603.29957) and interleaved thinking
(Claude 4.x extended thinking) — mid-sequence reasoning at high-entropy
positions, not just upfront planning.
**Tempting but wrong:**
- _"Let's think step by step"_ preambles for reasoning-trained models — at best
redundant (the model is already trained to reason); at worst it entrenches a
wrong prior on subjective tasks.
- Long CoT on intent-interpretation tasks. The model can reason itself _further
into_ the misread.
---
## 5. Cross-cutting principle: the harness is where intent gets clarified
A unifying claim from the synthesis doc that survives both the human and LLM
literature: when ambiguity is high, neither a human nor a model resolves it by
"reading more carefully." Resolution happens through **external signal**
question, test, lint, hook, oracle. The harness is where the external signal
lives. The prompt is where the rule of "use the external signal" lives.
Every action in this plan reduces to one of three moves:
1. **Move the rule into the harness.** Hooks, tests, type checkers, lint. These
are unambiguous and fire deterministically.
2. **Reduce reliance on context-middle attention.** Context-tail injection,
compaction, structured retrieval.
3. **Reduce reliance on self-critique.** External oracles, cross-family judges,
structured failure routing.
If a proposed mitigation does not fit one of these three, it probably belongs on
the tempting-but-wrong list.
---
## 6. Proposed concrete edits (for user approval)
This plan does not yet ship code changes. Proposed next steps in dependency
order:
- [ ] **A.** Audit [.agents/agents/](../../.agents/agents/) bodies for "you are
an expert X" framing and replace with negative-role / return- format
specs. Likely small edits to 14 files.
- [ ] **B.** Add an anti-pattern bullet to
[.agents/AGENTS.md](../../.agents/AGENTS.md) calling out _"reflect /
double-check / are you sure"_ as a non-mitigation without an external
oracle. Scoped to `.agents/` (not root `AGENTS.md`) because it is
metaknowledge about agent design — only relevant when authoring agent
infrastructure, not when writing application code where tests are the
oracle anyway.
- [x] **C.** Add a `CURRENT QUESTION (answer this, not the prior exchange):`
prefix-injection option to
[.agents/hooks/user-prompt-submit.sh](../../.agents/hooks/user-prompt-submit.sh),
either always-on or gated on a follow-up trigger phrase. **Shipped
always-on** (Revision 7, 2026-05-16). Placed last in `additionalContext`
(context tail = highest recency bias). Validated by S2A (Weston &
Sukhbaatar, arXiv:2311.11829): explicitly isolating the current query from
prior context reduces sycophancy and improves factuality without a second
LLM call. Same mechanism as the ClashEval token-probability anchoring
research cited in §1.1.
- [x] **D.** Add an `ambiguity-flag` convention: when the agent infers user
intent past a real ambiguity, surface a one-line `ASSUMPTION:` marker
before proceeding. Documented in [AGENTS.md](../../AGENTS.md); enforceable
optionally via a `PreToolUse` check on certain destructive tools.
**Shipped as documentation** in root `AGENTS.md` "Key Rules" section
(Revision 7, 2026-05-16). PreToolUse enforcement deferred — would fire on
every destructive call regardless of whether there was genuine ambiguity,
producing noise without selectivity.
- [ ] **E.** Update
[docs/verified/ai-coding-best-practices.md](../verified/ai-coding-best-practices.md)
summary to reflect the three corrections from Revision 6 of the research
doc (sycophancy family-conditional, intrinsic self-correction is the
strongest anti-pattern, persona-ladder scoped to ideation only).
Open question for the user: which of AE should ship in this conversation, which
need a separate task, and which should be discarded?