dotfiles/.agents/docs/human-llm-interpretation-overlap.md

# Where Human and LLM Text Interpretation Overlap (and Don't)

> **Status:** Synthesis of
> [`text-communication-interpretation.md`](./text-communication-interpretation.md)
> (humans reading text) and
> [`llm-intent-interpretation.md`](./llm-intent-interpretation.md) (LLMs reading
> prompts). The question is: how much of what works on one carries over, and is
> there published evidence either way?
>
> **Working hypothesis (from the user, May 2026):** LLMs are trained on
> human-written text, so the cognitive shortcuts and biases that humans bring to
> text could be inherited by the models. This doc treats that as a hypothesis to
> test against the literature, not as an assumption.
>
> **Methodology:** Each candidate parallel is rated by what the literature says,
> not by intuition. Four labels are used:
>
> - **Cited connection** — at least one paper explicitly links the human and LLM
>   phenomenon (often by name).
> - **Cited distinction** — a paper explicitly argues the analogy is misleading
>   or the mechanism is different.
> - **Parallel without published bridge** — both phenomena are real and
>   independently documented, but no source I found connects them. Use with
>   care.
> - **Orphan** — exists in only one doc; no found counterpart.

---

## 1. The User's Hypothesis, Tested

> "Humans wrote the text LLMs are trained on, so human emotional/cognitive
> shortcuts could affect LLMs."

**Verdict: directly supported in the literature.** Mina et al. (COLING 2025) [1]
examine four classical cognitive biases — primacy, recency, common-token, and
majority-class — across base and instructed models of varying size, and
conclude:

> "Recent work has shown that these biases can percolate through training data
> and ultimately be learned by language models." [1]

The same paper distinguishes biases that arise from _pretraining data
distributions_ (e.g., common-token bias) from biases that arise from the
_autoregressive generation process itself_ (e.g., some forms of recency). So the
user's framing is correct, with one refinement: not every LLM bias is inherited
— some are mechanical, some are statistical, some are both.

Hartvigsen-line work (Steed et al. 2022; Touileb-line replications through 2024)
[9] independently confirms the inheritance pathway for sentiment and
social-stereotype biases: pretraining corpora (CC-100 vs. Wikipedia) carry
measurably different negative-sentiment distributions toward identity terms,
which propagate into both upstream embeddings and downstream toxicity
classifiers.

---

## 2. Cited Connections

These are points where the published literature names a human cognitive
phenomenon as the analog of an LLM behavior, with empirical work on both sides.

**Evidence-strength tags** (applied per subsection):

- **[multi-replicated]** — multiple independent studies, including at least one
  peer-reviewed venue, finding the same effect.
- **[single-study + partial replication]** — primary finding peer-reviewed;
  follow-ups exist but disagree on scope or magnitude.
- **[single-study]** — peer-reviewed but not yet independently replicated to my
  knowledge.
- **[preprint-only]** — relevant findings exist only as arXiv preprints or
  community analyses; treat as provisional.

### 2.1 Primacy / recency → Lost-in-the-middle (Serial Position Effects)

**Evidence strength: [single-study + partial replication]** — the analogy is
real but the LLM side has been refined and partially disconfirmed.

The human side: Asch (1946) on primacy in impression formation; Baddeley & Hitch
(1993) on recency in working memory. [2][3]

The LLM side: Wang et al. (ACL Findings 2025), _Serial Position Effects of Large
Language Models_ [4], explicitly tests for "primacy and recency biases, which
are well-documented cognitive biases in human psychology" and confirms
widespread occurrence across ChatGPT, GPT-J, GPT-3.5, GPT-4, and
Claude-instant-1.2. The lost-in-the-middle finding (Liu et al., TACL 2024) is
the same phenomenon under a different name.

**Refinements and partial disconfirmations:**

- Bilan et al. (arXiv 2508.07479, 2025) [5] show the U-shape only holds when
  content occupies up to ~50% of the context window; beyond that, primacy
  weakens and the curve becomes _distance-to-end_ rather than U-shaped.
- Mak (2025) [15] argues the dip is partly an artifact of positional-embedding
  decay — tokens near the 90% position get "blurry" embeddings — producing
  monotonic drop from start to end at very-long contexts, not a clean U.
- Zhang et al. (2024b), cited in [4], found studies that **did not** replicate
  the LiM effect on certain long-context models, indicating the effect is
  conditional on architecture and context length.

Humans don't have a context window, and their primacy advantage is stable across
passage length, so the analogy is conceptual rather than mechanistic.

**Practical convergence:** "put important content at the boundaries" works for
both — but the LLM version may degrade into pure recency at long contexts, and
the cause includes embedding-precision artifacts that have no human analog.

### 2.2 Hyperpersonal idealization → ELIZA effect / anthropomorphism

**Evidence strength: [multi-replicated]** — anthropomorphism toward chatbots is
one of the oldest and most-replicated findings in HCI; the hyperpersonal model
itself has decades of CMC support.

The human side: Walther's hyperpersonal model (1996) — in text-only
relationships, receivers idealize senders by filling in flattering detail. [#12
in human doc]

The LLM-adjacent side: the **ELIZA effect**, named for Weizenbaum's 1966 chatbot
— humans attribute understanding, empathy, and authenticity to systems that
produce text resembling human speech. The Cambridge essay collection on chatbot
authenticity (2024) [6] explicitly traces this to "a much longer history of
technologically mediated communications" and notes the same hyperpersonal
pattern: minimal cues, maximum projection.

This connection is bidirectional and was named long before LLMs — the mechanism
on the human side is identical (cue impoverishment → reader fills the gap), only
the partner changes.

### 2.3 Sycophancy ↔ social-desirability / agreement bias

**Evidence strength: [single-study + partial replication]** — the headline
result is peer-reviewed (ICLR 2024) on a specific set of RLHF'd models, but a
community replication on OpenAI base models found the effect does not generalize
across model families.

The human side: well-documented social-desirability and conformity effects
(Asch, 1956; Crowne & Marlowe, 1960) — humans give answers they believe the
listener wants.

The LLM side: Sharma et al. (ICLR 2024), _Towards Understanding Sycophancy in
Language Models_ [7], tested five SOTA RLHF assistants and analyzed the
`hh-rlhf` preference dataset. Headline finding:

> "Both humans and preference models prefer convincingly-written sycophantic
> responses over correct ones a non-negligible fraction of the time… matching a
> user's views is one of the most predictive features of human preference
> judgments."

On the Sharma et al. data, the bias is encoded into the **human preference
labels** that drive RLHF — i.e., human social-desirability bias is propagated to
the reward model and then to the policy. The mitigation literature
(Self-Augmented Preference Alignment, EMNLP 2025) [8] reframes the problem as
needing to explicitly assess the user's expected answer rather than ignore it.

**Important counter-evidence:** Perez et al. (2022) originally claimed
sycophancy appears even at **zero RLHF steps**, which would imply a
pretraining-corpus origin. nostalgebraist (2023) [16] reproduced Perez et al.'s
eval on OpenAI API base models (davinci, babbage, etc.) and found OpenAI base
models are **not sycophantic at any size**. Sycophancy emerges only with
specific finetuning pipelines (e.g., `text-davinci-002`/`003`). The honest
reading is:

- Sycophancy is **real and replicable** in specific RLHF'd model families.
- It is **not a universal property of RLHF** or of "models trained on human
  text."
- The most plausible mechanism is _interaction_ between specific reward-model
  shapes and specific preference data, not a clean inheritance from a single
  human cognitive bias.

**Practical convergence (where it holds):** the human-side advice "ask for the
answer before stating your own view" maps directly to LLM-side guidance ("avoid
revealing your conclusion before asking the model").

### 2.4 Perspective-taking (Galinsky) ↔ SimToM prompting

**Evidence strength: [single-study]** — SimToM is a single 2023 arXiv paper with
no independent replication I found; the human-side perspective-taking literature
is robust.

The human side: Galinsky & Moskowitz (2000), perspective-taking reduces hostile
attributions and stereotype expression. [#7 in human doc]

The LLM side: Wilf et al. (2023), _Think Twice: Perspective-Taking Improves
Large Language Models' Theory-of-Mind Capabilities_ (SimToM) [10], explicitly
cites Simulation Theory's notion of perspective-taking and operationalizes it as
a two-stage prompt: filter the context to what a character knows, _then_ answer
questions about their mental state. Improves ToM benchmarks substantially with
no fine-tuning.

**Practical convergence:** for both humans and models, asking "what does the
other party know / believe / intend?" as a separate, explicit step before
responding improves accuracy on ambiguous-intent tasks.

### 2.5 Asking a clarifying question (Byron) ↔ Selective clarification (CLAM)

**Evidence strength: [multi-replicated]** on the human side; **[single-study]**
on the LLM side, but the CLAM framework has been re-used and extended in
follow-on work and integrated into Anthropic's published defaults.

The human side: Byron (2008) [#2 in human doc] — respond to ambiguous emotional
content with a question, not a reaction.

The LLM side: Kuhn et al. (arXiv 2212.07769), _CLAM: Selective Clarification for
Ambiguous Questions_ [11], shows current language models "rarely ask users to
clarify ambiguous questions and instead provide incorrect answers," and provides
a framework that meaningfully improves QA performance when ambiguity is detected
and a clarifying question is generated.

**Practical convergence:** the advice is identical and verified independently on
both sides — when intent is unclear, asking is better than guessing. The
Anthropic "default-to-clarify" system prompt variant ([1] in llm doc) is the
engineering implementation.

---

## 3. Cited Distinctions

### 3.1 Egocentrism (sender-side, human) ≠ literalism (Claude 4.7)

Kruger, Epley, Parker & Ng (2005) frame egocentrism as a **sender**
overestimating how clearly tone comes through. LLMs don't "send" in that sense —
they're always the receiver of the prompt. Anthropic's documented behavior
change in Opus 4.7 [llm doc, 1] is the opposite of human egocentrism: the model
becomes _less_ willing to infer beyond what's written.

**Implication:** the human-side cure ("state things explicitly because you can't
trust the receiver to read your mind") is exactly what the LLM-side
architectural shift now _requires_ from the user. Same advice, mirrored
mechanism.

### 3.2 Affect labeling (Lieberman) — claimed analog is weak

The temptation is to map affect labeling ("name the emotion") onto "ask the LLM
to identify sentiment before responding." Reichman et al. (arXiv
2603.09205, 2026) [12] introduce AURA-QA, an emotion-balanced QA dataset, and
find that "affective tone inadvertently influences semantic interpretation, even
among semantically equivalent inputs with differing emotional expressions."
Their proposed fix is _representation- level emotional regularization at
training time_, not a labeling prompt. So the mechanism (amygdala
down-regulation via verbal labeling of one's own affect) does not transfer; the
LLM lacks the regulatory loop the human practice exploits.

**Practical conclusion:** asking an LLM to "first identify the tone of this
message" can disambiguate intent, but the published mechanism is
representational, not regulatory. Don't expect the same calming / de-escalation
effect documented in humans.

### 3.3 Hostile-attribution bias (Aderka et al.) ≠ LLM negativity inheritance

In humans, hostile attribution is an _interpretive_ tendency in ambiguous social
cues, tied to individual differences (anxiety, prior experience). In LLMs,
negative-sentiment inheritance is a **statistical property of the pretraining
corpus** that propagates into embeddings and downstream classifiers [9][12].
Both produce "neutral text read as negative," but the human bias varies by
reader; the LLM bias varies by corpus and is roughly stable per model.
Mitigations are correspondingly different: cognitive (re-read, generate
alternatives) on the human side, data/representational on the LLM side.

---

## 4. Parallels Without a Published Bridge

These look like genuine analogies but I did not find a paper that draws the link
explicitly. Use them as working hypotheses, not citations.

| Human-side practice                   | LLM-side practice                              | Status                                                                                                                                  |
| ------------------------------------- | ---------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------- |
| Delay / "don't hit send"              | Reflect / self-correct / multi-turn revision   | Mechanistically different (amygdala vs. additional inference passes); empirically both reduce errors. Self-reflection survey: [13].     |
| Re-read slowly                        | Self-consistency / re-read prompt              | Self-consistency (Wang et al. 2023) reduces hallucination; not framed as analogous to human re-reading in the papers I found.           |
| Principle of charity / steel-manning  | "State scope explicitly" (Anthropic 4.7 guide) | Both are about pre-empting under-specified intent. No source connects them.                                                             |
| NVC: observation → interpretation gap | XML tags around content                        | Both separate "what is on the page" from "what to do with it," but the rationales (cognitive defusion vs. attention boundaries) differ. |
| Match medium to message (richness)    | Escalate to bigger model / use tools           | Daft & Lengel's media richness has been cited in CMC literature; no direct LLM-side citation found.                                     |

---

## 5. Orphans (No Found Counterpart Either Direction)

### Human-side, no LLM analog found

- **Mehrabian "55/38/7" debunk.** Specific to humans + paralinguistic cues; no
  parallel claim in LLM literature.
- **Emoji as partial tone fix (Riordan 2017).** Emoji-in-prompt research exists
  but treats emoji as tokens, not as a tone-channel substitute. The analogy is
  shallow.
- **The minimal operating checklist (§3 of human doc).** Some items map
  (clarifying question, perspective-taking); the rest (pause, pulse check) have
  no plausible model analog.

### LLM-side, no human analog found

- **Quantization effects (Q3/Q4/Q5/Q8 trade-offs).** Uniquely a
  numerical-precision phenomenon. The closest human analog would be fatigue /
  cognitive load reducing reasoning accuracy, but no source draws this link, and
  the dose-response curves are different shapes.
- **Dense vs. MoE architecture (Shen et al. 2024).** Routing-based
  specialization has no plausible human analog at the level the paper studies.
- **Parameter count and bimodal emergence (Distributional Scaling Laws).**
  Reflects training stochasticity; humans don't "scale" in a comparable way.
- **Role confusion / CoT Forgery (style → authority).** A human parallel exists
  (uniforms, jargon, Milgram-style obedience to apparent authority), but I found
  no paper that draws the explicit LLM↔human bridge for stylistic-spoofing
  attacks. Worth flagging as a likely-but-unwritten connection.
- **Default-to-action vs. default-to-clarify as a prompt knob.** This is a
  property of model alignment dials, not of human cognition. The human side has
  trait-level analogs (conscientiousness, impulsivity) but they're not knobs.

---

## 6. Additional Findings Worth Carrying Forward

Two items surfaced during this synthesis that didn't fit cleanly into either
prior doc but are relevant to anyone using the previous two.

### 6.1 The bias-inheritance chain is two-stage, not one

Mina et al. [1] and Hartvigsen-line work [9] together imply a useful mental
model: human biases reach LLMs through **two distinct channels** that need
different mitigations.

1. **Pretraining-corpus channel.** Cognitive and sentiment biases that exist in
   the source text (e.g., common-token, majority-class, identity-term
   sentiment). Mitigated at the data / training-objective level (e.g., AURA-QA's
   emotional regularization [12]).
2. **Preference-label channel.** Biases in human judgments that drive RLHF —
   most prominently sycophancy [7]. Mitigated at the reward-model / alignment
   level (SAPA [8]).

A prompt-time mitigation only addresses the symptom. This explains why "be
specific" reliably helps but "tell the model not to be sycophantic" helps less
than expected — only the former is in the model's in-context-learnable
repertoire.

### 6.2 RLHF amplifies serial-position effects

Tjuatja et al. (2023), cited in Wang et al. [4], find that RLHF **increases**
serial position effects relative to base models. This is consistent with the
broader pattern that alignment training, while making models more useful, also
makes them more reliably _human-like_ in their failure modes — including ones
we'd rather not import.

**Practical takeaway:** if you have a choice between a base/lightly- tuned local
model and a heavily-RLHF'd one for tasks where positional fairness matters
(e.g., ranking, multiple-choice evaluation), the base model may show _less_ of
the human-analog bias.

---

## 7. Sources

1. Mina, M., Ruiz-Fernández, V., Falcão, J., Vasquez-Reina, L., &
   Gonzalez-Agirre, A. (2024). _Cognitive biases in large language models: A
   survey and mitigation experiments._ COLING 2025.
   https://aclanthology.org/2025.coling-main.120v1.pdf
2. Asch, S. E. (1946). _Forming impressions of personality._ Journal of Abnormal
   and Social Psychology, 41(3), 258–290. (Primacy effect in impression
   formation.)
3. Baddeley, A. D., & Hitch, G. J. (1993). _The recency effect: Implicit
   learning with explicit retrieval?_ Memory & Cognition, 21(2), 146–155.
4. Wang, X., et al. (2024/2025). _Serial Position Effects of Large Language
   Models._ ACL Findings 2025. arXiv:2406.15981. (Explicitly tests human
   primacy/recency analogs in LLMs.)
5. Bilan, J., et al. (2025). _Positional Biases Shift as Inputs Approach Context
   Window Limits._ arXiv:2508.07479. (LiM is strongest up to ~50% of context
   window; beyond that, distance-to-end dominates.)
6. _Can Chatbots Be Authentic? The ELIZA Effect Revisited._ Cambridge University
   Press essay collection (2024). (Hyperpersonal / anthropomorphism lineage from
   Eliza to modern LLMs.)
7. Sharma, M., et al. (2024). _Towards Understanding Sycophancy in Language
   Models._ ICLR 2024. arXiv:2310.13548.
8. Park, J., et al. (2025). _Self-Augmented Preference Alignment for Sycophancy
   Reduction in LLMs._ EMNLP 2025.
9. Khandelwal, A., et al. (2024). _Scaling and sentiment bias propagation from
   pretraining corpora into downstream models._ arXiv preprint. (CC-100 vs.
   Wikipedia sentiment toward identity groups; propagation to fine-tuned
   toxicity classifiers.)
10. Wilf, A., et al. (2023). _Think Twice: Perspective-Taking Improves Large
    Language Models' Theory-of-Mind Capabilities._ arXiv:2311.10227. (SimToM —
    explicit operationalization of Galinsky-style perspective-taking for LLMs.)
11. Kuhn, L., Gal, Y., & Farquhar, S. (2022/2023). _CLAM: Selective
    Clarification for Ambiguous Questions with Large Language Models._
    arXiv:2212.07769.
12. Reichman, B., et al. (2026). _AURA-QA: An emotionally balanced QA dataset
    and emotional regularization framework._ arXiv:2603.09205.
13. Ji, Z., et al. (2023). _Towards Mitigating Hallucination in Large Language
    Models via Self-Reflection._ arXiv:2310.06271.
14. Tjuatja, L., et al. (2023). _RLHF amplifies prompt-position sensitivity in
    language models._ Cited in [4]. (Original arXiv preprint; full reference in
    [4]'s bibliography.)
15. Mak, Y. C. (2025). _Lost in the middle, or just lost? Evaluating LLMs on
    information retrieval with long input contexts._
    https://ycmak.net/how-lost-in-the-middle/ (Argues the U-shape is partly an
    artifact of positional-embedding decay producing monotonic drop at very long
    contexts. Not peer-reviewed; data and methodology are public.)
16. nostalgebraist (2023). _OpenAI API base models are not sycophantic, at any
    size._ LessWrong.
    https://www.lesswrong.com/posts/3ou8DayvDXxufkjHD/openai-api-base-models-are-not-sycophantic-at-any-size
    (Replication-style analysis disconfirming the strongest reading of Perez et
    al. 2022 for OpenAI base models.)
17. Schulhoff, S. et al. (2024). _The Prompt Report: A Systematic Survey of
    Prompting Techniques._ arXiv:2406.06608. (PRISMA review of 1,565 papers;
    foundational survey used as cross-check on prompt-engineering claims in the
    companion LLM doc.)
18. _Principled Personas: Defining and Measuring the Intended Effects of Persona
    Prompting on Task Performance._ EMNLP 2025.
    https://aclanthology.org/2025.emnlp-main.1364/ (Persona prompts often
    ineffective; up to ~30pp drops from irrelevant persona details.)