dotfiles/.agents/docs/text-intent-interpretation-research.md
Brydon DeWitt 6b07e4ccb2 feat: add shared agent infrastructure (.agents/)
- AGENTS.md: design principles, enforcement hierarchy, deferred loading
- agents/: brainstorm, build, orchestrator, research (auto-discovered by MCP server)
- skills/: research methodology (auto-discovered by MCP server)
- hooks/: pre-tool-use, post-tool-use (BFF block removed), session-start,
  stop, pre-compact, user-prompt-submit
- frameworks/: opencode/plugin.ts (resolves hooks via import.meta.url — works
  as project-local or global plugin), github/hooks.json
- mcp/index.ts: auto-discovers agents/*.md and skills/*.md from frontmatter
  (replaces hand-maintained registry); server renamed all-agents
- docs/: agent-infrastructure.md (generalized), research docs (7 files),
  ai_architectures.md, llama-server-cuda-wsl2.md
- install.sh: idempotent setup — Copilot global hooks, OpenCode global plugin +
  AGENTS.md + MCP entry, VS Code global MCP config
2026-05-22 13:13:43 -04:00

7.7 KiB
Raw Permalink Blame History

Investigation: Text-Intent Interpretation (Human + LLM)

Status: investigating
Orientation: understand (mixed with mid-investigation methodology correction)
Created: 2026-05-16
Last Updated: 2026-05-16

Question

How do humans and LLMs (mis)interpret intent in text-only communication, and what mitigations are supported by the literature? End goal: produce a concrete action plan to counteract LLM intent-interpretation failures in this codebase.

What We Know

Hypotheses

  • [2026-05-16] H1: Lost-in-the-middle is a clean human-primacy/ recency analog in LLMs.
    Falsification: find a replication where the U-shape doesn't hold or where the mechanism is shown to be different.
    Result: PARTIALLY ELIMINATED — Bilan et al. (arXiv:2508.07479, 2025) shows U-shape only holds up to ~50% of context window; Mak (2025) shows positional-embedding decay produces monotonic drop, not U-shape, in very-long contexts. The analogy is real but narrower than I originally claimed.

  • [2026-05-16] H2: RLHF preference labels cause sycophancy.
    Falsification: find evidence that base models (no RLHF) are sycophantic, or that some RLHF'd models are not.
    Result: PARTIALLY ELIMINATED — nostalgebraist (LessWrong, 2023) replicated Anthropic's sycophancy eval on OpenAI base models and found they are NOT sycophantic at any size. Sycophancy depends on the specific finetuning data and model family. Should be rephrased as "in some model families, RLHF preference data amplifies a sycophancy signal that may also have pretraining origins."

  • [2026-05-16] H3: Role/persona prompting reliably improves LLM intent interpretation.
    Falsification: find published evidence persona prompting fails or is irrelevant.
    Result: ELIMINATED — three convergent 2025 papers (Persona is a Double-Edged Sword IJCNLP 2025; Principled Personas EMNLP 2025; arXiv:2512.05858) show persona prompts are mixed-to-ineffective and highly sensitive to irrelevant details (up to ~30pp drops). This contradicts widespread prompt-engineering folklore.

  • [2026-05-16] H4: CoT reliably mitigates poor intent interpretation.
    Falsification: find cases where CoT actively hurts or fails to help.
    Result: PARTIALLY ELIMINATED — arXiv:2409.06173 shows CoT suffers from posterior collapse: larger models anchor harder to reasoning priors under CoT, particularly on subjective tasks (emotion, morality). Adds to the existing inverted-U finding.

  • [2026-05-16] H5: Pan et al. (arXiv:2308.03188) establishes that intrinsic self-correction without external ground truth degrades or fails to improve model performance.
    Falsification: paper doesn't exist; conclusion is reversed or domain- restricted in a way that doesn't support a general "no self-critique" claim.
    Result: PARTIALLY CONFIRMED with citation correction — Pan et al. 2308.03188 exists and is a survey by Liangming Pan et al. (UCSB, Aug 2023). The stronger primary citation for the "intrinsic self-correction degrades performance" claim is Huang et al. arXiv:2310.01798 ("Large Language Models Cannot Self-Correct Reasoning Yet," Google DeepMind / UIUC, Oct 2023): "LLMs struggle to self-correct their responses without external feedback, and at times, their performance even degrades after self-correction." Both citations should appear; the strong claim should attribute to Huang et al.

  • [2026-05-16] H6: Wu, Wu, Zou (ClashEval, 2024) shows adversarial reframing / lowering model confidence in a prior commitment reduces position- anchored question drift.
    Falsification: paper doesn't exist; paper is about general context-vs- prior conflict and doesn't support the "lower confidence → adherence" claim; effect is small or non-replicable.
    Result: PARTIALLY CONFIRMED with scope caveat — ClashEval (NeurIPS 2024) is real and the token-probability/adherence finding is supported: "the less confident a model is in its initial response (via measuring token probabilities), the more likely it is to adopt the information in the retrieved content." SCOPE: ClashEval tested RAG (retrieved content vs prior knowledge), NOT multi-turn anchoring on the model's own prior commitment. The mechanism (lower confidence → higher context adherence) is plausibly transferable, but the best-practices doc's claim extrapolates beyond the paper's actual experiment.

  • [2026-05-16] H7: Jiang et al. (2026) "Think-Anywhere" is a real published paper introducing mid-sequence <think> insertion that catches errors a pre-commit plan cannot foresee.
    Falsification: paper does not exist (hallucinated citation); paper exists but does not make the claimed mid-sequence intervention finding.
    Result: CONFIRMED with metadata correction — "Think Anywhere in Code Generation" (arXiv:2603.29957, Jiang et al., late 2025 / early 2026, github.com/jiangxxxue/Think-Anywhere). Mechanism: special <thinkanywhere> tokens via SFT + RL; key finding "LLMs tend to invoke thinking at positions with higher entropy." The best-practices doc's "catches mid-implementation off-by-one errors" framing is a mild over-specification of "on-demand reasoning at high-entropy positions" but directionally accurate.

Investigation Log

2026-05-16 — Initial three-doc production

  • Orientation: understand
  • What was examined: human-text-interpretation literature (Kruger, Byron, Aderka, Walther, Lieberman), LLM prompting literature (Anthropic 4.7 docs, Liu et al., Sharma et al., Wilf et al., Schulhoff Prompting Science Report 2).
  • What was found: documented in the three research docs.
  • What this means: descriptive synthesis available; no decision rules yet.
  • Next step: methodology audit.

2026-05-16 — Methodology audit and adversarial second pass

  • Orientation: diagnose
  • What was examined: my own search behavior; ran the adversarial searches I should have run originally.
  • What was found: positive-bias in original search framing missed important disconfirmations (H2, H3) and required qualifications (H1, H4); also missed the foundational Schulhoff "Prompt Report" survey.
  • What this means: prescriptive synthesis needs five concrete edits before it can drive an action plan.
  • Next step: apply edits, then review ai-coding-best-practices.md with the same skepticism.

Timing Notes

  • Each Exa search: ~515s including read of first 40 lines of dump.
  • Free-tier rate limit means searches must be sequential.

Open Questions

  • Are the (still uncited) parallels in §4 of the synthesis worth another adversarial search pass, or accept as flagged "use with care"?
  • Does docs/research/ai-coding-best-practices.md contain claims about persona prompting or CoT that now need correction?
  • What is the right format for the final action plan — checklist, copilot-instructions edit, AGENTS.md addition, or a new .agents/instructions/ file?