dotfiles/.agents/docs/text-intent-interpretation-research.md
Brydon DeWitt 6b07e4ccb2 feat: add shared agent infrastructure (.agents/)
- AGENTS.md: design principles, enforcement hierarchy, deferred loading
- agents/: brainstorm, build, orchestrator, research (auto-discovered by MCP server)
- skills/: research methodology (auto-discovered by MCP server)
- hooks/: pre-tool-use, post-tool-use (BFF block removed), session-start,
  stop, pre-compact, user-prompt-submit
- frameworks/: opencode/plugin.ts (resolves hooks via import.meta.url — works
  as project-local or global plugin), github/hooks.json
- mcp/index.ts: auto-discovers agents/*.md and skills/*.md from frontmatter
  (replaces hand-maintained registry); server renamed all-agents
- docs/: agent-infrastructure.md (generalized), research docs (7 files),
  ai_architectures.md, llama-server-cuda-wsl2.md
- install.sh: idempotent setup — Copilot global hooks, OpenCode global plugin +
  AGENTS.md + MCP entry, VS Code global MCP config
2026-05-22 13:13:43 -04:00

149 lines
7.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Investigation: Text-Intent Interpretation (Human + LLM)
**Status:** investigating
**Orientation:** understand (mixed with mid-investigation methodology
correction)
**Created:** 2026-05-16
**Last Updated:** 2026-05-16
## Question
How do humans and LLMs (mis)interpret intent in text-only communication, and
what mitigations are supported by the literature? End goal: produce a concrete
action plan to counteract LLM intent-interpretation failures in this codebase.
## What We Know
- Three docs produced:
[text-communication-interpretation.md](../research/text-communication-interpretation.md),
[llm-intent-interpretation.md](../research/llm-intent-interpretation.md),
[human-llm-interpretation-overlap.md](../research/human-llm-interpretation-overlap.md).
- Methodology critique recorded in
[/memories/session/research-methodology-retrospective.md](/memories/session/research-methodology-retrospective.md).
- Five strongly-cited human↔LLM connections (primacy/recency↔serial position,
ELIZA/hyperpersonal, sycophancy↔social desirability via RLHF preference data,
perspective-taking↔SimToM, clarifying question↔CLAM).
- Bias-inheritance chain is two-stage (pretraining corpus vs. RLHF preference
labels) — Mina et al. 2024, Sharma et al. 2024.
## Hypotheses
- **[2026-05-16] H1:** Lost-in-the-middle is a clean human-primacy/ recency
analog in LLMs.
**Falsification:** find a replication where the U-shape doesn't hold or where
the mechanism is shown to be different.
**Result:** PARTIALLY ELIMINATED — Bilan et al. (arXiv:2508.07479, 2025) shows
U-shape only holds up to ~50% of context window; Mak (2025) shows
positional-embedding decay produces monotonic drop, not U-shape, in very-long
contexts. The analogy is real but narrower than I originally claimed.
- **[2026-05-16] H2:** RLHF preference labels cause sycophancy.
**Falsification:** find evidence that base models (no RLHF) are sycophantic,
or that some RLHF'd models are not.
**Result:** PARTIALLY ELIMINATED — nostalgebraist (LessWrong, 2023) replicated
Anthropic's sycophancy eval on OpenAI base models and found they are NOT
sycophantic at any size. Sycophancy depends on the specific finetuning data
and model family. Should be rephrased as "in some model families, RLHF
preference data amplifies a sycophancy signal that may also have pretraining
origins."
- **[2026-05-16] H3:** Role/persona prompting reliably improves LLM intent
interpretation.
**Falsification:** find published evidence persona prompting fails or is
irrelevant.
**Result:** ELIMINATED — three convergent 2025 papers (Persona is a
Double-Edged Sword IJCNLP 2025; Principled Personas EMNLP 2025;
arXiv:2512.05858) show persona prompts are mixed-to-ineffective and highly
sensitive to irrelevant details (up to ~30pp drops). This contradicts
widespread prompt-engineering folklore.
- **[2026-05-16] H4:** CoT reliably mitigates poor intent interpretation.
**Falsification:** find cases where CoT actively hurts or fails to help.
**Result:** PARTIALLY ELIMINATED — arXiv:2409.06173 shows CoT suffers from
posterior collapse: larger models anchor harder to reasoning priors under CoT,
particularly on subjective tasks (emotion, morality). Adds to the existing
inverted-U finding.
- **[2026-05-16] H5:** Pan et al. (arXiv:2308.03188) establishes that intrinsic
self-correction without external ground truth degrades or fails to improve
model performance.
**Falsification:** paper doesn't exist; conclusion is reversed or domain-
restricted in a way that doesn't support a general "no self-critique" claim.
**Result:** PARTIALLY CONFIRMED with citation correction — Pan et al.
2308.03188 exists and is a _survey_ by Liangming Pan et al. (UCSB, Aug 2023).
The _stronger primary_ citation for the "intrinsic self-correction degrades
performance" claim is Huang et al. arXiv:2310.01798 ("Large Language Models
Cannot Self-Correct Reasoning Yet," Google DeepMind / UIUC, Oct 2023): "LLMs
struggle to self-correct their responses without external feedback, and at
times, their performance even degrades after self-correction." Both citations
should appear; the strong claim should attribute to Huang et al.
- **[2026-05-16] H6:** Wu, Wu, Zou (ClashEval, 2024) shows adversarial reframing
/ lowering model confidence in a prior commitment reduces position- anchored
question drift.
**Falsification:** paper doesn't exist; paper is about general context-vs-
prior conflict and doesn't support the "lower confidence → adherence" claim;
effect is small or non-replicable.
**Result:** PARTIALLY CONFIRMED with scope caveat — ClashEval (NeurIPS 2024)
is real and the token-probability/adherence finding is supported: "the less
confident a model is in its initial response (via measuring token
probabilities), the more likely it is to adopt the information in the
retrieved content." SCOPE: ClashEval tested RAG (retrieved content vs prior
knowledge), NOT multi-turn anchoring on the model's own prior commitment. The
mechanism (lower confidence → higher context adherence) is plausibly
transferable, but the best-practices doc's claim extrapolates beyond the
paper's actual experiment.
- **[2026-05-16] H7:** Jiang et al. (2026) "Think-Anywhere" is a real published
paper introducing mid-sequence `<think>` insertion that catches errors a
pre-commit plan cannot foresee.
**Falsification:** paper does not exist (hallucinated citation); paper exists
but does not make the claimed mid-sequence intervention finding.
**Result:** CONFIRMED with metadata correction — "Think Anywhere in Code
Generation" (arXiv:2603.29957, Jiang et al., late 2025 / early 2026,
github.com/jiangxxxue/Think-Anywhere). Mechanism: special `<thinkanywhere>`
tokens via SFT + RL; key finding "LLMs tend to invoke thinking at positions
with higher entropy." The best-practices doc's "catches mid-implementation
off-by-one errors" framing is a mild over-specification of "on-demand
reasoning at high-entropy positions" but directionally accurate.
## Investigation Log
### 2026-05-16 — Initial three-doc production
- Orientation: understand
- What was examined: human-text-interpretation literature (Kruger, Byron,
Aderka, Walther, Lieberman), LLM prompting literature (Anthropic 4.7 docs, Liu
et al., Sharma et al., Wilf et al., Schulhoff Prompting Science Report 2).
- What was found: documented in the three research docs.
- What this means: descriptive synthesis available; no decision rules yet.
- Next step: methodology audit.
### 2026-05-16 — Methodology audit and adversarial second pass
- Orientation: diagnose
- What was examined: my own search behavior; ran the adversarial searches I
should have run originally.
- What was found: positive-bias in original search framing missed important
disconfirmations (H2, H3) and required qualifications (H1, H4); also missed
the foundational Schulhoff "Prompt Report" survey.
- What this means: prescriptive synthesis needs five concrete edits before it
can drive an action plan.
- Next step: apply edits, then review ai-coding-best-practices.md with the same
skepticism.
## Timing Notes
- Each Exa search: ~515s including read of first 40 lines of dump.
- Free-tier rate limit means searches must be sequential.
## Open Questions
- Are the (still uncited) parallels in §4 of the synthesis worth another
adversarial search pass, or accept as flagged "use with care"?
- Does `docs/research/ai-coding-best-practices.md` contain claims about persona
prompting or CoT that now need correction?
- What is the right format for the final action plan — checklist,
copilot-instructions edit, AGENTS.md addition, or a new
`.agents/instructions/` file?