dotfiles/.agents/docs/human-llm-interpretation-overlap.md
Brydon DeWitt 6b07e4ccb2 feat: add shared agent infrastructure (.agents/)
- AGENTS.md: design principles, enforcement hierarchy, deferred loading
- agents/: brainstorm, build, orchestrator, research (auto-discovered by MCP server)
- skills/: research methodology (auto-discovered by MCP server)
- hooks/: pre-tool-use, post-tool-use (BFF block removed), session-start,
  stop, pre-compact, user-prompt-submit
- frameworks/: opencode/plugin.ts (resolves hooks via import.meta.url — works
  as project-local or global plugin), github/hooks.json
- mcp/index.ts: auto-discovers agents/*.md and skills/*.md from frontmatter
  (replaces hand-maintained registry); server renamed all-agents
- docs/: agent-infrastructure.md (generalized), research docs (7 files),
  ai_architectures.md, llama-server-cuda-wsl2.md
- install.sh: idempotent setup — Copilot global hooks, OpenCode global plugin +
  AGENTS.md + MCP entry, VS Code global MCP config
2026-05-22 13:13:43 -04:00

21 KiB
Raw Permalink Blame History

Where Human and LLM Text Interpretation Overlap (and Don't)

Status: Synthesis of text-communication-interpretation.md (humans reading text) and llm-intent-interpretation.md (LLMs reading prompts). The question is: how much of what works on one carries over, and is there published evidence either way?

Working hypothesis (from the user, May 2026): LLMs are trained on human-written text, so the cognitive shortcuts and biases that humans bring to text could be inherited by the models. This doc treats that as a hypothesis to test against the literature, not as an assumption.

Methodology: Each candidate parallel is rated by what the literature says, not by intuition. Four labels are used:

  • Cited connection — at least one paper explicitly links the human and LLM phenomenon (often by name).
  • Cited distinction — a paper explicitly argues the analogy is misleading or the mechanism is different.
  • Parallel without published bridge — both phenomena are real and independently documented, but no source I found connects them. Use with care.
  • Orphan — exists in only one doc; no found counterpart.

1. The User's Hypothesis, Tested

"Humans wrote the text LLMs are trained on, so human emotional/cognitive shortcuts could affect LLMs."

Verdict: directly supported in the literature. Mina et al. (COLING 2025) [1] examine four classical cognitive biases — primacy, recency, common-token, and majority-class — across base and instructed models of varying size, and conclude:

"Recent work has shown that these biases can percolate through training data and ultimately be learned by language models." [1]

The same paper distinguishes biases that arise from pretraining data distributions (e.g., common-token bias) from biases that arise from the autoregressive generation process itself (e.g., some forms of recency). So the user's framing is correct, with one refinement: not every LLM bias is inherited — some are mechanical, some are statistical, some are both.

Hartvigsen-line work (Steed et al. 2022; Touileb-line replications through 2024) [9] independently confirms the inheritance pathway for sentiment and social-stereotype biases: pretraining corpora (CC-100 vs. Wikipedia) carry measurably different negative-sentiment distributions toward identity terms, which propagate into both upstream embeddings and downstream toxicity classifiers.


2. Cited Connections

These are points where the published literature names a human cognitive phenomenon as the analog of an LLM behavior, with empirical work on both sides.

Evidence-strength tags (applied per subsection):

  • [multi-replicated] — multiple independent studies, including at least one peer-reviewed venue, finding the same effect.
  • [single-study + partial replication] — primary finding peer-reviewed; follow-ups exist but disagree on scope or magnitude.
  • [single-study] — peer-reviewed but not yet independently replicated to my knowledge.
  • [preprint-only] — relevant findings exist only as arXiv preprints or community analyses; treat as provisional.

2.1 Primacy / recency → Lost-in-the-middle (Serial Position Effects)

Evidence strength: [single-study + partial replication] — the analogy is real but the LLM side has been refined and partially disconfirmed.

The human side: Asch (1946) on primacy in impression formation; Baddeley & Hitch (1993) on recency in working memory. [2][3]

The LLM side: Wang et al. (ACL Findings 2025), Serial Position Effects of Large Language Models [4], explicitly tests for "primacy and recency biases, which are well-documented cognitive biases in human psychology" and confirms widespread occurrence across ChatGPT, GPT-J, GPT-3.5, GPT-4, and Claude-instant-1.2. The lost-in-the-middle finding (Liu et al., TACL 2024) is the same phenomenon under a different name.

Refinements and partial disconfirmations:

  • Bilan et al. (arXiv 2508.07479, 2025) [5] show the U-shape only holds when content occupies up to ~50% of the context window; beyond that, primacy weakens and the curve becomes distance-to-end rather than U-shaped.
  • Mak (2025) [15] argues the dip is partly an artifact of positional-embedding decay — tokens near the 90% position get "blurry" embeddings — producing monotonic drop from start to end at very-long contexts, not a clean U.
  • Zhang et al. (2024b), cited in [4], found studies that did not replicate the LiM effect on certain long-context models, indicating the effect is conditional on architecture and context length.

Humans don't have a context window, and their primacy advantage is stable across passage length, so the analogy is conceptual rather than mechanistic.

Practical convergence: "put important content at the boundaries" works for both — but the LLM version may degrade into pure recency at long contexts, and the cause includes embedding-precision artifacts that have no human analog.

2.2 Hyperpersonal idealization → ELIZA effect / anthropomorphism

Evidence strength: [multi-replicated] — anthropomorphism toward chatbots is one of the oldest and most-replicated findings in HCI; the hyperpersonal model itself has decades of CMC support.

The human side: Walther's hyperpersonal model (1996) — in text-only relationships, receivers idealize senders by filling in flattering detail. [#12 in human doc]

The LLM-adjacent side: the ELIZA effect, named for Weizenbaum's 1966 chatbot — humans attribute understanding, empathy, and authenticity to systems that produce text resembling human speech. The Cambridge essay collection on chatbot authenticity (2024) [6] explicitly traces this to "a much longer history of technologically mediated communications" and notes the same hyperpersonal pattern: minimal cues, maximum projection.

This connection is bidirectional and was named long before LLMs — the mechanism on the human side is identical (cue impoverishment → reader fills the gap), only the partner changes.

2.3 Sycophancy ↔ social-desirability / agreement bias

Evidence strength: [single-study + partial replication] — the headline result is peer-reviewed (ICLR 2024) on a specific set of RLHF'd models, but a community replication on OpenAI base models found the effect does not generalize across model families.

The human side: well-documented social-desirability and conformity effects (Asch, 1956; Crowne & Marlowe, 1960) — humans give answers they believe the listener wants.

The LLM side: Sharma et al. (ICLR 2024), Towards Understanding Sycophancy in Language Models [7], tested five SOTA RLHF assistants and analyzed the hh-rlhf preference dataset. Headline finding:

"Both humans and preference models prefer convincingly-written sycophantic responses over correct ones a non-negligible fraction of the time… matching a user's views is one of the most predictive features of human preference judgments."

On the Sharma et al. data, the bias is encoded into the human preference labels that drive RLHF — i.e., human social-desirability bias is propagated to the reward model and then to the policy. The mitigation literature (Self-Augmented Preference Alignment, EMNLP 2025) [8] reframes the problem as needing to explicitly assess the user's expected answer rather than ignore it.

Important counter-evidence: Perez et al. (2022) originally claimed sycophancy appears even at zero RLHF steps, which would imply a pretraining-corpus origin. nostalgebraist (2023) [16] reproduced Perez et al.'s eval on OpenAI API base models (davinci, babbage, etc.) and found OpenAI base models are not sycophantic at any size. Sycophancy emerges only with specific finetuning pipelines (e.g., text-davinci-002/003). The honest reading is:

  • Sycophancy is real and replicable in specific RLHF'd model families.
  • It is not a universal property of RLHF or of "models trained on human text."
  • The most plausible mechanism is interaction between specific reward-model shapes and specific preference data, not a clean inheritance from a single human cognitive bias.

Practical convergence (where it holds): the human-side advice "ask for the answer before stating your own view" maps directly to LLM-side guidance ("avoid revealing your conclusion before asking the model").

2.4 Perspective-taking (Galinsky) ↔ SimToM prompting

Evidence strength: [single-study] — SimToM is a single 2023 arXiv paper with no independent replication I found; the human-side perspective-taking literature is robust.

The human side: Galinsky & Moskowitz (2000), perspective-taking reduces hostile attributions and stereotype expression. [#7 in human doc]

The LLM side: Wilf et al. (2023), Think Twice: Perspective-Taking Improves Large Language Models' Theory-of-Mind Capabilities (SimToM) [10], explicitly cites Simulation Theory's notion of perspective-taking and operationalizes it as a two-stage prompt: filter the context to what a character knows, then answer questions about their mental state. Improves ToM benchmarks substantially with no fine-tuning.

Practical convergence: for both humans and models, asking "what does the other party know / believe / intend?" as a separate, explicit step before responding improves accuracy on ambiguous-intent tasks.

2.5 Asking a clarifying question (Byron) ↔ Selective clarification (CLAM)

Evidence strength: [multi-replicated] on the human side; [single-study] on the LLM side, but the CLAM framework has been re-used and extended in follow-on work and integrated into Anthropic's published defaults.

The human side: Byron (2008) [#2 in human doc] — respond to ambiguous emotional content with a question, not a reaction.

The LLM side: Kuhn et al. (arXiv 2212.07769), CLAM: Selective Clarification for Ambiguous Questions [11], shows current language models "rarely ask users to clarify ambiguous questions and instead provide incorrect answers," and provides a framework that meaningfully improves QA performance when ambiguity is detected and a clarifying question is generated.

Practical convergence: the advice is identical and verified independently on both sides — when intent is unclear, asking is better than guessing. The Anthropic "default-to-clarify" system prompt variant ([1] in llm doc) is the engineering implementation.


3. Cited Distinctions

3.1 Egocentrism (sender-side, human) ≠ literalism (Claude 4.7)

Kruger, Epley, Parker & Ng (2005) frame egocentrism as a sender overestimating how clearly tone comes through. LLMs don't "send" in that sense — they're always the receiver of the prompt. Anthropic's documented behavior change in Opus 4.7 [llm doc, 1] is the opposite of human egocentrism: the model becomes less willing to infer beyond what's written.

Implication: the human-side cure ("state things explicitly because you can't trust the receiver to read your mind") is exactly what the LLM-side architectural shift now requires from the user. Same advice, mirrored mechanism.

3.2 Affect labeling (Lieberman) — claimed analog is weak

The temptation is to map affect labeling ("name the emotion") onto "ask the LLM to identify sentiment before responding." Reichman et al. (arXiv 2603.09205, 2026) [12] introduce AURA-QA, an emotion-balanced QA dataset, and find that "affective tone inadvertently influences semantic interpretation, even among semantically equivalent inputs with differing emotional expressions." Their proposed fix is representation- level emotional regularization at training time, not a labeling prompt. So the mechanism (amygdala down-regulation via verbal labeling of one's own affect) does not transfer; the LLM lacks the regulatory loop the human practice exploits.

Practical conclusion: asking an LLM to "first identify the tone of this message" can disambiguate intent, but the published mechanism is representational, not regulatory. Don't expect the same calming / de-escalation effect documented in humans.

3.3 Hostile-attribution bias (Aderka et al.) ≠ LLM negativity inheritance

In humans, hostile attribution is an interpretive tendency in ambiguous social cues, tied to individual differences (anxiety, prior experience). In LLMs, negative-sentiment inheritance is a statistical property of the pretraining corpus that propagates into embeddings and downstream classifiers [9][12]. Both produce "neutral text read as negative," but the human bias varies by reader; the LLM bias varies by corpus and is roughly stable per model. Mitigations are correspondingly different: cognitive (re-read, generate alternatives) on the human side, data/representational on the LLM side.


4. Parallels Without a Published Bridge

These look like genuine analogies but I did not find a paper that draws the link explicitly. Use them as working hypotheses, not citations.

Human-side practice LLM-side practice Status
Delay / "don't hit send" Reflect / self-correct / multi-turn revision Mechanistically different (amygdala vs. additional inference passes); empirically both reduce errors. Self-reflection survey: [13].
Re-read slowly Self-consistency / re-read prompt Self-consistency (Wang et al. 2023) reduces hallucination; not framed as analogous to human re-reading in the papers I found.
Principle of charity / steel-manning "State scope explicitly" (Anthropic 4.7 guide) Both are about pre-empting under-specified intent. No source connects them.
NVC: observation → interpretation gap XML tags around content Both separate "what is on the page" from "what to do with it," but the rationales (cognitive defusion vs. attention boundaries) differ.
Match medium to message (richness) Escalate to bigger model / use tools Daft & Lengel's media richness has been cited in CMC literature; no direct LLM-side citation found.

5. Orphans (No Found Counterpart Either Direction)

Human-side, no LLM analog found

  • Mehrabian "55/38/7" debunk. Specific to humans + paralinguistic cues; no parallel claim in LLM literature.
  • Emoji as partial tone fix (Riordan 2017). Emoji-in-prompt research exists but treats emoji as tokens, not as a tone-channel substitute. The analogy is shallow.
  • The minimal operating checklist (§3 of human doc). Some items map (clarifying question, perspective-taking); the rest (pause, pulse check) have no plausible model analog.

LLM-side, no human analog found

  • Quantization effects (Q3/Q4/Q5/Q8 trade-offs). Uniquely a numerical-precision phenomenon. The closest human analog would be fatigue / cognitive load reducing reasoning accuracy, but no source draws this link, and the dose-response curves are different shapes.
  • Dense vs. MoE architecture (Shen et al. 2024). Routing-based specialization has no plausible human analog at the level the paper studies.
  • Parameter count and bimodal emergence (Distributional Scaling Laws). Reflects training stochasticity; humans don't "scale" in a comparable way.
  • Role confusion / CoT Forgery (style → authority). A human parallel exists (uniforms, jargon, Milgram-style obedience to apparent authority), but I found no paper that draws the explicit LLM↔human bridge for stylistic-spoofing attacks. Worth flagging as a likely-but-unwritten connection.
  • Default-to-action vs. default-to-clarify as a prompt knob. This is a property of model alignment dials, not of human cognition. The human side has trait-level analogs (conscientiousness, impulsivity) but they're not knobs.

6. Additional Findings Worth Carrying Forward

Two items surfaced during this synthesis that didn't fit cleanly into either prior doc but are relevant to anyone using the previous two.

6.1 The bias-inheritance chain is two-stage, not one

Mina et al. [1] and Hartvigsen-line work [9] together imply a useful mental model: human biases reach LLMs through two distinct channels that need different mitigations.

  1. Pretraining-corpus channel. Cognitive and sentiment biases that exist in the source text (e.g., common-token, majority-class, identity-term sentiment). Mitigated at the data / training-objective level (e.g., AURA-QA's emotional regularization [12]).
  2. Preference-label channel. Biases in human judgments that drive RLHF — most prominently sycophancy [7]. Mitigated at the reward-model / alignment level (SAPA [8]).

A prompt-time mitigation only addresses the symptom. This explains why "be specific" reliably helps but "tell the model not to be sycophantic" helps less than expected — only the former is in the model's in-context-learnable repertoire.

6.2 RLHF amplifies serial-position effects

Tjuatja et al. (2023), cited in Wang et al. [4], find that RLHF increases serial position effects relative to base models. This is consistent with the broader pattern that alignment training, while making models more useful, also makes them more reliably human-like in their failure modes — including ones we'd rather not import.

Practical takeaway: if you have a choice between a base/lightly- tuned local model and a heavily-RLHF'd one for tasks where positional fairness matters (e.g., ranking, multiple-choice evaluation), the base model may show less of the human-analog bias.


7. Sources

  1. Mina, M., Ruiz-Fernández, V., Falcão, J., Vasquez-Reina, L., & Gonzalez-Agirre, A. (2024). Cognitive biases in large language models: A survey and mitigation experiments. COLING 2025. https://aclanthology.org/2025.coling-main.120v1.pdf
  2. Asch, S. E. (1946). Forming impressions of personality. Journal of Abnormal and Social Psychology, 41(3), 258290. (Primacy effect in impression formation.)
  3. Baddeley, A. D., & Hitch, G. J. (1993). The recency effect: Implicit learning with explicit retrieval? Memory & Cognition, 21(2), 146155.
  4. Wang, X., et al. (2024/2025). Serial Position Effects of Large Language Models. ACL Findings 2025. arXiv:2406.15981. (Explicitly tests human primacy/recency analogs in LLMs.)
  5. Bilan, J., et al. (2025). Positional Biases Shift as Inputs Approach Context Window Limits. arXiv:2508.07479. (LiM is strongest up to ~50% of context window; beyond that, distance-to-end dominates.)
  6. Can Chatbots Be Authentic? The ELIZA Effect Revisited. Cambridge University Press essay collection (2024). (Hyperpersonal / anthropomorphism lineage from Eliza to modern LLMs.)
  7. Sharma, M., et al. (2024). Towards Understanding Sycophancy in Language Models. ICLR 2024. arXiv:2310.13548.
  8. Park, J., et al. (2025). Self-Augmented Preference Alignment for Sycophancy Reduction in LLMs. EMNLP 2025.
  9. Khandelwal, A., et al. (2024). Scaling and sentiment bias propagation from pretraining corpora into downstream models. arXiv preprint. (CC-100 vs. Wikipedia sentiment toward identity groups; propagation to fine-tuned toxicity classifiers.)
  10. Wilf, A., et al. (2023). Think Twice: Perspective-Taking Improves Large Language Models' Theory-of-Mind Capabilities. arXiv:2311.10227. (SimToM — explicit operationalization of Galinsky-style perspective-taking for LLMs.)
  11. Kuhn, L., Gal, Y., & Farquhar, S. (2022/2023). CLAM: Selective Clarification for Ambiguous Questions with Large Language Models. arXiv:2212.07769.
  12. Reichman, B., et al. (2026). AURA-QA: An emotionally balanced QA dataset and emotional regularization framework. arXiv:2603.09205.
  13. Ji, Z., et al. (2023). Towards Mitigating Hallucination in Large Language Models via Self-Reflection. arXiv:2310.06271.
  14. Tjuatja, L., et al. (2023). RLHF amplifies prompt-position sensitivity in language models. Cited in [4]. (Original arXiv preprint; full reference in [4]'s bibliography.)
  15. Mak, Y. C. (2025). Lost in the middle, or just lost? Evaluating LLMs on information retrieval with long input contexts. https://ycmak.net/how-lost-in-the-middle/ (Argues the U-shape is partly an artifact of positional-embedding decay producing monotonic drop at very long contexts. Not peer-reviewed; data and methodology are public.)
  16. nostalgebraist (2023). OpenAI API base models are not sycophantic, at any size. LessWrong. https://www.lesswrong.com/posts/3ou8DayvDXxufkjHD/openai-api-base-models-are-not-sycophantic-at-any-size (Replication-style analysis disconfirming the strongest reading of Perez et al. 2022 for OpenAI base models.)
  17. Schulhoff, S. et al. (2024). The Prompt Report: A Systematic Survey of Prompting Techniques. arXiv:2406.06608. (PRISMA review of 1,565 papers; foundational survey used as cross-check on prompt-engineering claims in the companion LLM doc.)
  18. Principled Personas: Defining and Measuring the Intended Effects of Persona Prompting on Task Performance. EMNLP 2025. https://aclanthology.org/2025.emnlp-main.1364/ (Persona prompts often ineffective; up to ~30pp drops from irrelevant persona details.)