- AGENTS.md: design principles, enforcement hierarchy, deferred loading - agents/: brainstorm, build, orchestrator, research (auto-discovered by MCP server) - skills/: research methodology (auto-discovered by MCP server) - hooks/: pre-tool-use, post-tool-use (BFF block removed), session-start, stop, pre-compact, user-prompt-submit - frameworks/: opencode/plugin.ts (resolves hooks via import.meta.url — works as project-local or global plugin), github/hooks.json - mcp/index.ts: auto-discovers agents/*.md and skills/*.md from frontmatter (replaces hand-maintained registry); server renamed all-agents - docs/: agent-infrastructure.md (generalized), research docs (7 files), ai_architectures.md, llama-server-cuda-wsl2.md - install.sh: idempotent setup — Copilot global hooks, OpenCode global plugin + AGENTS.md + MCP entry, VS Code global MCP config
149 lines
7.7 KiB
Markdown
149 lines
7.7 KiB
Markdown
# Investigation: Text-Intent Interpretation (Human + LLM)
|
||
|
||
**Status:** investigating
|
||
**Orientation:** understand (mixed with mid-investigation methodology
|
||
correction)
|
||
**Created:** 2026-05-16
|
||
**Last Updated:** 2026-05-16
|
||
|
||
## Question
|
||
|
||
How do humans and LLMs (mis)interpret intent in text-only communication, and
|
||
what mitigations are supported by the literature? End goal: produce a concrete
|
||
action plan to counteract LLM intent-interpretation failures in this codebase.
|
||
|
||
## What We Know
|
||
|
||
- Three docs produced:
|
||
[text-communication-interpretation.md](../research/text-communication-interpretation.md),
|
||
[llm-intent-interpretation.md](../research/llm-intent-interpretation.md),
|
||
[human-llm-interpretation-overlap.md](../research/human-llm-interpretation-overlap.md).
|
||
- Methodology critique recorded in
|
||
[/memories/session/research-methodology-retrospective.md](/memories/session/research-methodology-retrospective.md).
|
||
- Five strongly-cited human↔LLM connections (primacy/recency↔serial position,
|
||
ELIZA/hyperpersonal, sycophancy↔social desirability via RLHF preference data,
|
||
perspective-taking↔SimToM, clarifying question↔CLAM).
|
||
- Bias-inheritance chain is two-stage (pretraining corpus vs. RLHF preference
|
||
labels) — Mina et al. 2024, Sharma et al. 2024.
|
||
|
||
## Hypotheses
|
||
|
||
- **[2026-05-16] H1:** Lost-in-the-middle is a clean human-primacy/ recency
|
||
analog in LLMs.
|
||
**Falsification:** find a replication where the U-shape doesn't hold or where
|
||
the mechanism is shown to be different.
|
||
**Result:** PARTIALLY ELIMINATED — Bilan et al. (arXiv:2508.07479, 2025) shows
|
||
U-shape only holds up to ~50% of context window; Mak (2025) shows
|
||
positional-embedding decay produces monotonic drop, not U-shape, in very-long
|
||
contexts. The analogy is real but narrower than I originally claimed.
|
||
|
||
- **[2026-05-16] H2:** RLHF preference labels cause sycophancy.
|
||
**Falsification:** find evidence that base models (no RLHF) are sycophantic,
|
||
or that some RLHF'd models are not.
|
||
**Result:** PARTIALLY ELIMINATED — nostalgebraist (LessWrong, 2023) replicated
|
||
Anthropic's sycophancy eval on OpenAI base models and found they are NOT
|
||
sycophantic at any size. Sycophancy depends on the specific finetuning data
|
||
and model family. Should be rephrased as "in some model families, RLHF
|
||
preference data amplifies a sycophancy signal that may also have pretraining
|
||
origins."
|
||
|
||
- **[2026-05-16] H3:** Role/persona prompting reliably improves LLM intent
|
||
interpretation.
|
||
**Falsification:** find published evidence persona prompting fails or is
|
||
irrelevant.
|
||
**Result:** ELIMINATED — three convergent 2025 papers (Persona is a
|
||
Double-Edged Sword IJCNLP 2025; Principled Personas EMNLP 2025;
|
||
arXiv:2512.05858) show persona prompts are mixed-to-ineffective and highly
|
||
sensitive to irrelevant details (up to ~30pp drops). This contradicts
|
||
widespread prompt-engineering folklore.
|
||
|
||
- **[2026-05-16] H4:** CoT reliably mitigates poor intent interpretation.
|
||
**Falsification:** find cases where CoT actively hurts or fails to help.
|
||
**Result:** PARTIALLY ELIMINATED — arXiv:2409.06173 shows CoT suffers from
|
||
posterior collapse: larger models anchor harder to reasoning priors under CoT,
|
||
particularly on subjective tasks (emotion, morality). Adds to the existing
|
||
inverted-U finding.
|
||
|
||
- **[2026-05-16] H5:** Pan et al. (arXiv:2308.03188) establishes that intrinsic
|
||
self-correction without external ground truth degrades or fails to improve
|
||
model performance.
|
||
**Falsification:** paper doesn't exist; conclusion is reversed or domain-
|
||
restricted in a way that doesn't support a general "no self-critique" claim.
|
||
**Result:** PARTIALLY CONFIRMED with citation correction — Pan et al.
|
||
2308.03188 exists and is a _survey_ by Liangming Pan et al. (UCSB, Aug 2023).
|
||
The _stronger primary_ citation for the "intrinsic self-correction degrades
|
||
performance" claim is Huang et al. arXiv:2310.01798 ("Large Language Models
|
||
Cannot Self-Correct Reasoning Yet," Google DeepMind / UIUC, Oct 2023): "LLMs
|
||
struggle to self-correct their responses without external feedback, and at
|
||
times, their performance even degrades after self-correction." Both citations
|
||
should appear; the strong claim should attribute to Huang et al.
|
||
|
||
- **[2026-05-16] H6:** Wu, Wu, Zou (ClashEval, 2024) shows adversarial reframing
|
||
/ lowering model confidence in a prior commitment reduces position- anchored
|
||
question drift.
|
||
**Falsification:** paper doesn't exist; paper is about general context-vs-
|
||
prior conflict and doesn't support the "lower confidence → adherence" claim;
|
||
effect is small or non-replicable.
|
||
**Result:** PARTIALLY CONFIRMED with scope caveat — ClashEval (NeurIPS 2024)
|
||
is real and the token-probability/adherence finding is supported: "the less
|
||
confident a model is in its initial response (via measuring token
|
||
probabilities), the more likely it is to adopt the information in the
|
||
retrieved content." SCOPE: ClashEval tested RAG (retrieved content vs prior
|
||
knowledge), NOT multi-turn anchoring on the model's own prior commitment. The
|
||
mechanism (lower confidence → higher context adherence) is plausibly
|
||
transferable, but the best-practices doc's claim extrapolates beyond the
|
||
paper's actual experiment.
|
||
|
||
- **[2026-05-16] H7:** Jiang et al. (2026) "Think-Anywhere" is a real published
|
||
paper introducing mid-sequence `<think>` insertion that catches errors a
|
||
pre-commit plan cannot foresee.
|
||
**Falsification:** paper does not exist (hallucinated citation); paper exists
|
||
but does not make the claimed mid-sequence intervention finding.
|
||
**Result:** CONFIRMED with metadata correction — "Think Anywhere in Code
|
||
Generation" (arXiv:2603.29957, Jiang et al., late 2025 / early 2026,
|
||
github.com/jiangxxxue/Think-Anywhere). Mechanism: special `<thinkanywhere>`
|
||
tokens via SFT + RL; key finding "LLMs tend to invoke thinking at positions
|
||
with higher entropy." The best-practices doc's "catches mid-implementation
|
||
off-by-one errors" framing is a mild over-specification of "on-demand
|
||
reasoning at high-entropy positions" but directionally accurate.
|
||
|
||
## Investigation Log
|
||
|
||
### 2026-05-16 — Initial three-doc production
|
||
|
||
- Orientation: understand
|
||
- What was examined: human-text-interpretation literature (Kruger, Byron,
|
||
Aderka, Walther, Lieberman), LLM prompting literature (Anthropic 4.7 docs, Liu
|
||
et al., Sharma et al., Wilf et al., Schulhoff Prompting Science Report 2).
|
||
- What was found: documented in the three research docs.
|
||
- What this means: descriptive synthesis available; no decision rules yet.
|
||
- Next step: methodology audit.
|
||
|
||
### 2026-05-16 — Methodology audit and adversarial second pass
|
||
|
||
- Orientation: diagnose
|
||
- What was examined: my own search behavior; ran the adversarial searches I
|
||
should have run originally.
|
||
- What was found: positive-bias in original search framing missed important
|
||
disconfirmations (H2, H3) and required qualifications (H1, H4); also missed
|
||
the foundational Schulhoff "Prompt Report" survey.
|
||
- What this means: prescriptive synthesis needs five concrete edits before it
|
||
can drive an action plan.
|
||
- Next step: apply edits, then review ai-coding-best-practices.md with the same
|
||
skepticism.
|
||
|
||
## Timing Notes
|
||
|
||
- Each Exa search: ~5–15s including read of first 40 lines of dump.
|
||
- Free-tier rate limit means searches must be sequential.
|
||
|
||
## Open Questions
|
||
|
||
- Are the (still uncited) parallels in §4 of the synthesis worth another
|
||
adversarial search pass, or accept as flagged "use with care"?
|
||
- Does `docs/research/ai-coding-best-practices.md` contain claims about persona
|
||
prompting or CoT that now need correction?
|
||
- What is the right format for the final action plan — checklist,
|
||
copilot-instructions edit, AGENTS.md addition, or a new
|
||
`.agents/instructions/` file?
|