dotfiles/.agents/docs/text-intent-interpretation-research.md

# Investigation: Text-Intent Interpretation (Human + LLM)

**Status:** investigating
**Orientation:** understand (mixed with mid-investigation methodology
correction)
**Created:** 2026-05-16
**Last Updated:** 2026-05-16

## Question

How do humans and LLMs (mis)interpret intent in text-only communication, and
what mitigations are supported by the literature? End goal: produce a concrete
action plan to counteract LLM intent-interpretation failures in this codebase.

## What We Know

- Three docs produced:
  [text-communication-interpretation.md](../research/text-communication-interpretation.md),
  [llm-intent-interpretation.md](../research/llm-intent-interpretation.md),
  [human-llm-interpretation-overlap.md](../research/human-llm-interpretation-overlap.md).
- Methodology critique recorded in
  [/memories/session/research-methodology-retrospective.md](/memories/session/research-methodology-retrospective.md).
- Five strongly-cited human↔LLM connections (primacy/recency↔serial position,
  ELIZA/hyperpersonal, sycophancy↔social desirability via RLHF preference data,
  perspective-taking↔SimToM, clarifying question↔CLAM).
- Bias-inheritance chain is two-stage (pretraining corpus vs. RLHF preference
  labels) — Mina et al. 2024, Sharma et al. 2024.

## Hypotheses

- **[2026-05-16] H1:** Lost-in-the-middle is a clean human-primacy/ recency
  analog in LLMs.
  **Falsification:** find a replication where the U-shape doesn't hold or where
  the mechanism is shown to be different.
  **Result:** PARTIALLY ELIMINATED — Bilan et al. (arXiv:2508.07479, 2025) shows
  U-shape only holds up to ~50% of context window; Mak (2025) shows
  positional-embedding decay produces monotonic drop, not U-shape, in very-long
  contexts. The analogy is real but narrower than I originally claimed.

- **[2026-05-16] H2:** RLHF preference labels cause sycophancy.
  **Falsification:** find evidence that base models (no RLHF) are sycophantic,
  or that some RLHF'd models are not.
  **Result:** PARTIALLY ELIMINATED — nostalgebraist (LessWrong, 2023) replicated
  Anthropic's sycophancy eval on OpenAI base models and found they are NOT
  sycophantic at any size. Sycophancy depends on the specific finetuning data
  and model family. Should be rephrased as "in some model families, RLHF
  preference data amplifies a sycophancy signal that may also have pretraining
  origins."

- **[2026-05-16] H3:** Role/persona prompting reliably improves LLM intent
  interpretation.
  **Falsification:** find published evidence persona prompting fails or is
  irrelevant.
  **Result:** ELIMINATED — three convergent 2025 papers (Persona is a
  Double-Edged Sword IJCNLP 2025; Principled Personas EMNLP 2025;
  arXiv:2512.05858) show persona prompts are mixed-to-ineffective and highly
  sensitive to irrelevant details (up to ~30pp drops). This contradicts
  widespread prompt-engineering folklore.

- **[2026-05-16] H4:** CoT reliably mitigates poor intent interpretation.
  **Falsification:** find cases where CoT actively hurts or fails to help.
  **Result:** PARTIALLY ELIMINATED — arXiv:2409.06173 shows CoT suffers from
  posterior collapse: larger models anchor harder to reasoning priors under CoT,
  particularly on subjective tasks (emotion, morality). Adds to the existing
  inverted-U finding.

- **[2026-05-16] H5:** Pan et al. (arXiv:2308.03188) establishes that intrinsic
  self-correction without external ground truth degrades or fails to improve
  model performance.
  **Falsification:** paper doesn't exist; conclusion is reversed or domain-
  restricted in a way that doesn't support a general "no self-critique" claim.
  **Result:** PARTIALLY CONFIRMED with citation correction — Pan et al.
  2308.03188 exists and is a _survey_ by Liangming Pan et al. (UCSB, Aug 2023).
  The _stronger primary_ citation for the "intrinsic self-correction degrades
  performance" claim is Huang et al. arXiv:2310.01798 ("Large Language Models
  Cannot Self-Correct Reasoning Yet," Google DeepMind / UIUC, Oct 2023): "LLMs
  struggle to self-correct their responses without external feedback, and at
  times, their performance even degrades after self-correction." Both citations
  should appear; the strong claim should attribute to Huang et al.

- **[2026-05-16] H6:** Wu, Wu, Zou (ClashEval, 2024) shows adversarial reframing
  / lowering model confidence in a prior commitment reduces position- anchored
  question drift.
  **Falsification:** paper doesn't exist; paper is about general context-vs-
  prior conflict and doesn't support the "lower confidence → adherence" claim;
  effect is small or non-replicable.
  **Result:** PARTIALLY CONFIRMED with scope caveat — ClashEval (NeurIPS 2024)
  is real and the token-probability/adherence finding is supported: "the less
  confident a model is in its initial response (via measuring token
  probabilities), the more likely it is to adopt the information in the
  retrieved content." SCOPE: ClashEval tested RAG (retrieved content vs prior
  knowledge), NOT multi-turn anchoring on the model's own prior commitment. The
  mechanism (lower confidence → higher context adherence) is plausibly
  transferable, but the best-practices doc's claim extrapolates beyond the
  paper's actual experiment.

- **[2026-05-16] H7:** Jiang et al. (2026) "Think-Anywhere" is a real published
  paper introducing mid-sequence `<think>` insertion that catches errors a
  pre-commit plan cannot foresee.
  **Falsification:** paper does not exist (hallucinated citation); paper exists
  but does not make the claimed mid-sequence intervention finding.
  **Result:** CONFIRMED with metadata correction — "Think Anywhere in Code
  Generation" (arXiv:2603.29957, Jiang et al., late 2025 / early 2026,
  github.com/jiangxxxue/Think-Anywhere). Mechanism: special `<thinkanywhere>`
  tokens via SFT + RL; key finding "LLMs tend to invoke thinking at positions
  with higher entropy." The best-practices doc's "catches mid-implementation
  off-by-one errors" framing is a mild over-specification of "on-demand
  reasoning at high-entropy positions" but directionally accurate.

## Investigation Log

### 2026-05-16 — Initial three-doc production

- Orientation: understand
- What was examined: human-text-interpretation literature (Kruger, Byron,
  Aderka, Walther, Lieberman), LLM prompting literature (Anthropic 4.7 docs, Liu
  et al., Sharma et al., Wilf et al., Schulhoff Prompting Science Report 2).
- What was found: documented in the three research docs.
- What this means: descriptive synthesis available; no decision rules yet.
- Next step: methodology audit.

### 2026-05-16 — Methodology audit and adversarial second pass

- Orientation: diagnose
- What was examined: my own search behavior; ran the adversarial searches I
  should have run originally.
- What was found: positive-bias in original search framing missed important
  disconfirmations (H2, H3) and required qualifications (H1, H4); also missed
  the foundational Schulhoff "Prompt Report" survey.
- What this means: prescriptive synthesis needs five concrete edits before it
  can drive an action plan.
- Next step: apply edits, then review ai-coding-best-practices.md with the same
  skepticism.

## Timing Notes

- Each Exa search: ~5–15s including read of first 40 lines of dump.
- Free-tier rate limit means searches must be sequential.

## Open Questions

- Are the (still uncited) parallels in §4 of the synthesis worth another
  adversarial search pass, or accept as flagged "use with care"?
- Does `docs/research/ai-coding-best-practices.md` contain claims about persona
  prompting or CoT that now need correction?
- What is the right format for the final action plan — checklist,
  copilot-instructions edit, AGENTS.md addition, or a new
  `.agents/instructions/` file?