- AGENTS.md: design principles, enforcement hierarchy, deferred loading - agents/: brainstorm, build, orchestrator, research (auto-discovered by MCP server) - skills/: research methodology (auto-discovered by MCP server) - hooks/: pre-tool-use, post-tool-use (BFF block removed), session-start, stop, pre-compact, user-prompt-submit - frameworks/: opencode/plugin.ts (resolves hooks via import.meta.url — works as project-local or global plugin), github/hooks.json - mcp/index.ts: auto-discovers agents/*.md and skills/*.md from frontmatter (replaces hand-maintained registry); server renamed all-agents - docs/: agent-infrastructure.md (generalized), research docs (7 files), ai_architectures.md, llama-server-cuda-wsl2.md - install.sh: idempotent setup — Copilot global hooks, OpenCode global plugin + AGENTS.md + MCP entry, VS Code global MCP config
406 lines
21 KiB
Markdown
406 lines
21 KiB
Markdown
# Where Human and LLM Text Interpretation Overlap (and Don't)
|
||
|
||
> **Status:** Synthesis of
|
||
> [`text-communication-interpretation.md`](./text-communication-interpretation.md)
|
||
> (humans reading text) and
|
||
> [`llm-intent-interpretation.md`](./llm-intent-interpretation.md) (LLMs reading
|
||
> prompts). The question is: how much of what works on one carries over, and is
|
||
> there published evidence either way?
|
||
>
|
||
> **Working hypothesis (from the user, May 2026):** LLMs are trained on
|
||
> human-written text, so the cognitive shortcuts and biases that humans bring to
|
||
> text could be inherited by the models. This doc treats that as a hypothesis to
|
||
> test against the literature, not as an assumption.
|
||
>
|
||
> **Methodology:** Each candidate parallel is rated by what the literature says,
|
||
> not by intuition. Four labels are used:
|
||
>
|
||
> - **Cited connection** — at least one paper explicitly links the human and LLM
|
||
> phenomenon (often by name).
|
||
> - **Cited distinction** — a paper explicitly argues the analogy is misleading
|
||
> or the mechanism is different.
|
||
> - **Parallel without published bridge** — both phenomena are real and
|
||
> independently documented, but no source I found connects them. Use with
|
||
> care.
|
||
> - **Orphan** — exists in only one doc; no found counterpart.
|
||
|
||
---
|
||
|
||
## 1. The User's Hypothesis, Tested
|
||
|
||
> "Humans wrote the text LLMs are trained on, so human emotional/cognitive
|
||
> shortcuts could affect LLMs."
|
||
|
||
**Verdict: directly supported in the literature.** Mina et al. (COLING 2025) [1]
|
||
examine four classical cognitive biases — primacy, recency, common-token, and
|
||
majority-class — across base and instructed models of varying size, and
|
||
conclude:
|
||
|
||
> "Recent work has shown that these biases can percolate through training data
|
||
> and ultimately be learned by language models." [1]
|
||
|
||
The same paper distinguishes biases that arise from _pretraining data
|
||
distributions_ (e.g., common-token bias) from biases that arise from the
|
||
_autoregressive generation process itself_ (e.g., some forms of recency). So the
|
||
user's framing is correct, with one refinement: not every LLM bias is inherited
|
||
— some are mechanical, some are statistical, some are both.
|
||
|
||
Hartvigsen-line work (Steed et al. 2022; Touileb-line replications through 2024)
|
||
[9] independently confirms the inheritance pathway for sentiment and
|
||
social-stereotype biases: pretraining corpora (CC-100 vs. Wikipedia) carry
|
||
measurably different negative-sentiment distributions toward identity terms,
|
||
which propagate into both upstream embeddings and downstream toxicity
|
||
classifiers.
|
||
|
||
---
|
||
|
||
## 2. Cited Connections
|
||
|
||
These are points where the published literature names a human cognitive
|
||
phenomenon as the analog of an LLM behavior, with empirical work on both sides.
|
||
|
||
**Evidence-strength tags** (applied per subsection):
|
||
|
||
- **[multi-replicated]** — multiple independent studies, including at least one
|
||
peer-reviewed venue, finding the same effect.
|
||
- **[single-study + partial replication]** — primary finding peer-reviewed;
|
||
follow-ups exist but disagree on scope or magnitude.
|
||
- **[single-study]** — peer-reviewed but not yet independently replicated to my
|
||
knowledge.
|
||
- **[preprint-only]** — relevant findings exist only as arXiv preprints or
|
||
community analyses; treat as provisional.
|
||
|
||
### 2.1 Primacy / recency → Lost-in-the-middle (Serial Position Effects)
|
||
|
||
**Evidence strength: [single-study + partial replication]** — the analogy is
|
||
real but the LLM side has been refined and partially disconfirmed.
|
||
|
||
The human side: Asch (1946) on primacy in impression formation; Baddeley & Hitch
|
||
(1993) on recency in working memory. [2][3]
|
||
|
||
The LLM side: Wang et al. (ACL Findings 2025), _Serial Position Effects of Large
|
||
Language Models_ [4], explicitly tests for "primacy and recency biases, which
|
||
are well-documented cognitive biases in human psychology" and confirms
|
||
widespread occurrence across ChatGPT, GPT-J, GPT-3.5, GPT-4, and
|
||
Claude-instant-1.2. The lost-in-the-middle finding (Liu et al., TACL 2024) is
|
||
the same phenomenon under a different name.
|
||
|
||
**Refinements and partial disconfirmations:**
|
||
|
||
- Bilan et al. (arXiv 2508.07479, 2025) [5] show the U-shape only holds when
|
||
content occupies up to ~50% of the context window; beyond that, primacy
|
||
weakens and the curve becomes _distance-to-end_ rather than U-shaped.
|
||
- Mak (2025) [15] argues the dip is partly an artifact of positional-embedding
|
||
decay — tokens near the 90% position get "blurry" embeddings — producing
|
||
monotonic drop from start to end at very-long contexts, not a clean U.
|
||
- Zhang et al. (2024b), cited in [4], found studies that **did not** replicate
|
||
the LiM effect on certain long-context models, indicating the effect is
|
||
conditional on architecture and context length.
|
||
|
||
Humans don't have a context window, and their primacy advantage is stable across
|
||
passage length, so the analogy is conceptual rather than mechanistic.
|
||
|
||
**Practical convergence:** "put important content at the boundaries" works for
|
||
both — but the LLM version may degrade into pure recency at long contexts, and
|
||
the cause includes embedding-precision artifacts that have no human analog.
|
||
|
||
### 2.2 Hyperpersonal idealization → ELIZA effect / anthropomorphism
|
||
|
||
**Evidence strength: [multi-replicated]** — anthropomorphism toward chatbots is
|
||
one of the oldest and most-replicated findings in HCI; the hyperpersonal model
|
||
itself has decades of CMC support.
|
||
|
||
The human side: Walther's hyperpersonal model (1996) — in text-only
|
||
relationships, receivers idealize senders by filling in flattering detail. [#12
|
||
in human doc]
|
||
|
||
The LLM-adjacent side: the **ELIZA effect**, named for Weizenbaum's 1966 chatbot
|
||
— humans attribute understanding, empathy, and authenticity to systems that
|
||
produce text resembling human speech. The Cambridge essay collection on chatbot
|
||
authenticity (2024) [6] explicitly traces this to "a much longer history of
|
||
technologically mediated communications" and notes the same hyperpersonal
|
||
pattern: minimal cues, maximum projection.
|
||
|
||
This connection is bidirectional and was named long before LLMs — the mechanism
|
||
on the human side is identical (cue impoverishment → reader fills the gap), only
|
||
the partner changes.
|
||
|
||
### 2.3 Sycophancy ↔ social-desirability / agreement bias
|
||
|
||
**Evidence strength: [single-study + partial replication]** — the headline
|
||
result is peer-reviewed (ICLR 2024) on a specific set of RLHF'd models, but a
|
||
community replication on OpenAI base models found the effect does not generalize
|
||
across model families.
|
||
|
||
The human side: well-documented social-desirability and conformity effects
|
||
(Asch, 1956; Crowne & Marlowe, 1960) — humans give answers they believe the
|
||
listener wants.
|
||
|
||
The LLM side: Sharma et al. (ICLR 2024), _Towards Understanding Sycophancy in
|
||
Language Models_ [7], tested five SOTA RLHF assistants and analyzed the
|
||
`hh-rlhf` preference dataset. Headline finding:
|
||
|
||
> "Both humans and preference models prefer convincingly-written sycophantic
|
||
> responses over correct ones a non-negligible fraction of the time… matching a
|
||
> user's views is one of the most predictive features of human preference
|
||
> judgments."
|
||
|
||
On the Sharma et al. data, the bias is encoded into the **human preference
|
||
labels** that drive RLHF — i.e., human social-desirability bias is propagated to
|
||
the reward model and then to the policy. The mitigation literature
|
||
(Self-Augmented Preference Alignment, EMNLP 2025) [8] reframes the problem as
|
||
needing to explicitly assess the user's expected answer rather than ignore it.
|
||
|
||
**Important counter-evidence:** Perez et al. (2022) originally claimed
|
||
sycophancy appears even at **zero RLHF steps**, which would imply a
|
||
pretraining-corpus origin. nostalgebraist (2023) [16] reproduced Perez et al.'s
|
||
eval on OpenAI API base models (davinci, babbage, etc.) and found OpenAI base
|
||
models are **not sycophantic at any size**. Sycophancy emerges only with
|
||
specific finetuning pipelines (e.g., `text-davinci-002`/`003`). The honest
|
||
reading is:
|
||
|
||
- Sycophancy is **real and replicable** in specific RLHF'd model families.
|
||
- It is **not a universal property of RLHF** or of "models trained on human
|
||
text."
|
||
- The most plausible mechanism is _interaction_ between specific reward-model
|
||
shapes and specific preference data, not a clean inheritance from a single
|
||
human cognitive bias.
|
||
|
||
**Practical convergence (where it holds):** the human-side advice "ask for the
|
||
answer before stating your own view" maps directly to LLM-side guidance ("avoid
|
||
revealing your conclusion before asking the model").
|
||
|
||
### 2.4 Perspective-taking (Galinsky) ↔ SimToM prompting
|
||
|
||
**Evidence strength: [single-study]** — SimToM is a single 2023 arXiv paper with
|
||
no independent replication I found; the human-side perspective-taking literature
|
||
is robust.
|
||
|
||
The human side: Galinsky & Moskowitz (2000), perspective-taking reduces hostile
|
||
attributions and stereotype expression. [#7 in human doc]
|
||
|
||
The LLM side: Wilf et al. (2023), _Think Twice: Perspective-Taking Improves
|
||
Large Language Models' Theory-of-Mind Capabilities_ (SimToM) [10], explicitly
|
||
cites Simulation Theory's notion of perspective-taking and operationalizes it as
|
||
a two-stage prompt: filter the context to what a character knows, _then_ answer
|
||
questions about their mental state. Improves ToM benchmarks substantially with
|
||
no fine-tuning.
|
||
|
||
**Practical convergence:** for both humans and models, asking "what does the
|
||
other party know / believe / intend?" as a separate, explicit step before
|
||
responding improves accuracy on ambiguous-intent tasks.
|
||
|
||
### 2.5 Asking a clarifying question (Byron) ↔ Selective clarification (CLAM)
|
||
|
||
**Evidence strength: [multi-replicated]** on the human side; **[single-study]**
|
||
on the LLM side, but the CLAM framework has been re-used and extended in
|
||
follow-on work and integrated into Anthropic's published defaults.
|
||
|
||
The human side: Byron (2008) [#2 in human doc] — respond to ambiguous emotional
|
||
content with a question, not a reaction.
|
||
|
||
The LLM side: Kuhn et al. (arXiv 2212.07769), _CLAM: Selective Clarification for
|
||
Ambiguous Questions_ [11], shows current language models "rarely ask users to
|
||
clarify ambiguous questions and instead provide incorrect answers," and provides
|
||
a framework that meaningfully improves QA performance when ambiguity is detected
|
||
and a clarifying question is generated.
|
||
|
||
**Practical convergence:** the advice is identical and verified independently on
|
||
both sides — when intent is unclear, asking is better than guessing. The
|
||
Anthropic "default-to-clarify" system prompt variant ([1] in llm doc) is the
|
||
engineering implementation.
|
||
|
||
---
|
||
|
||
## 3. Cited Distinctions
|
||
|
||
### 3.1 Egocentrism (sender-side, human) ≠ literalism (Claude 4.7)
|
||
|
||
Kruger, Epley, Parker & Ng (2005) frame egocentrism as a **sender**
|
||
overestimating how clearly tone comes through. LLMs don't "send" in that sense —
|
||
they're always the receiver of the prompt. Anthropic's documented behavior
|
||
change in Opus 4.7 [llm doc, 1] is the opposite of human egocentrism: the model
|
||
becomes _less_ willing to infer beyond what's written.
|
||
|
||
**Implication:** the human-side cure ("state things explicitly because you can't
|
||
trust the receiver to read your mind") is exactly what the LLM-side
|
||
architectural shift now _requires_ from the user. Same advice, mirrored
|
||
mechanism.
|
||
|
||
### 3.2 Affect labeling (Lieberman) — claimed analog is weak
|
||
|
||
The temptation is to map affect labeling ("name the emotion") onto "ask the LLM
|
||
to identify sentiment before responding." Reichman et al. (arXiv
|
||
2603.09205, 2026) [12] introduce AURA-QA, an emotion-balanced QA dataset, and
|
||
find that "affective tone inadvertently influences semantic interpretation, even
|
||
among semantically equivalent inputs with differing emotional expressions."
|
||
Their proposed fix is _representation- level emotional regularization at
|
||
training time_, not a labeling prompt. So the mechanism (amygdala
|
||
down-regulation via verbal labeling of one's own affect) does not transfer; the
|
||
LLM lacks the regulatory loop the human practice exploits.
|
||
|
||
**Practical conclusion:** asking an LLM to "first identify the tone of this
|
||
message" can disambiguate intent, but the published mechanism is
|
||
representational, not regulatory. Don't expect the same calming / de-escalation
|
||
effect documented in humans.
|
||
|
||
### 3.3 Hostile-attribution bias (Aderka et al.) ≠ LLM negativity inheritance
|
||
|
||
In humans, hostile attribution is an _interpretive_ tendency in ambiguous social
|
||
cues, tied to individual differences (anxiety, prior experience). In LLMs,
|
||
negative-sentiment inheritance is a **statistical property of the pretraining
|
||
corpus** that propagates into embeddings and downstream classifiers [9][12].
|
||
Both produce "neutral text read as negative," but the human bias varies by
|
||
reader; the LLM bias varies by corpus and is roughly stable per model.
|
||
Mitigations are correspondingly different: cognitive (re-read, generate
|
||
alternatives) on the human side, data/representational on the LLM side.
|
||
|
||
---
|
||
|
||
## 4. Parallels Without a Published Bridge
|
||
|
||
These look like genuine analogies but I did not find a paper that draws the link
|
||
explicitly. Use them as working hypotheses, not citations.
|
||
|
||
| Human-side practice | LLM-side practice | Status |
|
||
| ------------------------------------- | ---------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------- |
|
||
| Delay / "don't hit send" | Reflect / self-correct / multi-turn revision | Mechanistically different (amygdala vs. additional inference passes); empirically both reduce errors. Self-reflection survey: [13]. |
|
||
| Re-read slowly | Self-consistency / re-read prompt | Self-consistency (Wang et al. 2023) reduces hallucination; not framed as analogous to human re-reading in the papers I found. |
|
||
| Principle of charity / steel-manning | "State scope explicitly" (Anthropic 4.7 guide) | Both are about pre-empting under-specified intent. No source connects them. |
|
||
| NVC: observation → interpretation gap | XML tags around content | Both separate "what is on the page" from "what to do with it," but the rationales (cognitive defusion vs. attention boundaries) differ. |
|
||
| Match medium to message (richness) | Escalate to bigger model / use tools | Daft & Lengel's media richness has been cited in CMC literature; no direct LLM-side citation found. |
|
||
|
||
---
|
||
|
||
## 5. Orphans (No Found Counterpart Either Direction)
|
||
|
||
### Human-side, no LLM analog found
|
||
|
||
- **Mehrabian "55/38/7" debunk.** Specific to humans + paralinguistic cues; no
|
||
parallel claim in LLM literature.
|
||
- **Emoji as partial tone fix (Riordan 2017).** Emoji-in-prompt research exists
|
||
but treats emoji as tokens, not as a tone-channel substitute. The analogy is
|
||
shallow.
|
||
- **The minimal operating checklist (§3 of human doc).** Some items map
|
||
(clarifying question, perspective-taking); the rest (pause, pulse check) have
|
||
no plausible model analog.
|
||
|
||
### LLM-side, no human analog found
|
||
|
||
- **Quantization effects (Q3/Q4/Q5/Q8 trade-offs).** Uniquely a
|
||
numerical-precision phenomenon. The closest human analog would be fatigue /
|
||
cognitive load reducing reasoning accuracy, but no source draws this link, and
|
||
the dose-response curves are different shapes.
|
||
- **Dense vs. MoE architecture (Shen et al. 2024).** Routing-based
|
||
specialization has no plausible human analog at the level the paper studies.
|
||
- **Parameter count and bimodal emergence (Distributional Scaling Laws).**
|
||
Reflects training stochasticity; humans don't "scale" in a comparable way.
|
||
- **Role confusion / CoT Forgery (style → authority).** A human parallel exists
|
||
(uniforms, jargon, Milgram-style obedience to apparent authority), but I found
|
||
no paper that draws the explicit LLM↔human bridge for stylistic-spoofing
|
||
attacks. Worth flagging as a likely-but-unwritten connection.
|
||
- **Default-to-action vs. default-to-clarify as a prompt knob.** This is a
|
||
property of model alignment dials, not of human cognition. The human side has
|
||
trait-level analogs (conscientiousness, impulsivity) but they're not knobs.
|
||
|
||
---
|
||
|
||
## 6. Additional Findings Worth Carrying Forward
|
||
|
||
Two items surfaced during this synthesis that didn't fit cleanly into either
|
||
prior doc but are relevant to anyone using the previous two.
|
||
|
||
### 6.1 The bias-inheritance chain is two-stage, not one
|
||
|
||
Mina et al. [1] and Hartvigsen-line work [9] together imply a useful mental
|
||
model: human biases reach LLMs through **two distinct channels** that need
|
||
different mitigations.
|
||
|
||
1. **Pretraining-corpus channel.** Cognitive and sentiment biases that exist in
|
||
the source text (e.g., common-token, majority-class, identity-term
|
||
sentiment). Mitigated at the data / training-objective level (e.g., AURA-QA's
|
||
emotional regularization [12]).
|
||
2. **Preference-label channel.** Biases in human judgments that drive RLHF —
|
||
most prominently sycophancy [7]. Mitigated at the reward-model / alignment
|
||
level (SAPA [8]).
|
||
|
||
A prompt-time mitigation only addresses the symptom. This explains why "be
|
||
specific" reliably helps but "tell the model not to be sycophantic" helps less
|
||
than expected — only the former is in the model's in-context-learnable
|
||
repertoire.
|
||
|
||
### 6.2 RLHF amplifies serial-position effects
|
||
|
||
Tjuatja et al. (2023), cited in Wang et al. [4], find that RLHF **increases**
|
||
serial position effects relative to base models. This is consistent with the
|
||
broader pattern that alignment training, while making models more useful, also
|
||
makes them more reliably _human-like_ in their failure modes — including ones
|
||
we'd rather not import.
|
||
|
||
**Practical takeaway:** if you have a choice between a base/lightly- tuned local
|
||
model and a heavily-RLHF'd one for tasks where positional fairness matters
|
||
(e.g., ranking, multiple-choice evaluation), the base model may show _less_ of
|
||
the human-analog bias.
|
||
|
||
---
|
||
|
||
## 7. Sources
|
||
|
||
1. Mina, M., Ruiz-Fernández, V., Falcão, J., Vasquez-Reina, L., &
|
||
Gonzalez-Agirre, A. (2024). _Cognitive biases in large language models: A
|
||
survey and mitigation experiments._ COLING 2025.
|
||
https://aclanthology.org/2025.coling-main.120v1.pdf
|
||
2. Asch, S. E. (1946). _Forming impressions of personality._ Journal of Abnormal
|
||
and Social Psychology, 41(3), 258–290. (Primacy effect in impression
|
||
formation.)
|
||
3. Baddeley, A. D., & Hitch, G. J. (1993). _The recency effect: Implicit
|
||
learning with explicit retrieval?_ Memory & Cognition, 21(2), 146–155.
|
||
4. Wang, X., et al. (2024/2025). _Serial Position Effects of Large Language
|
||
Models._ ACL Findings 2025. arXiv:2406.15981. (Explicitly tests human
|
||
primacy/recency analogs in LLMs.)
|
||
5. Bilan, J., et al. (2025). _Positional Biases Shift as Inputs Approach Context
|
||
Window Limits._ arXiv:2508.07479. (LiM is strongest up to ~50% of context
|
||
window; beyond that, distance-to-end dominates.)
|
||
6. _Can Chatbots Be Authentic? The ELIZA Effect Revisited._ Cambridge University
|
||
Press essay collection (2024). (Hyperpersonal / anthropomorphism lineage from
|
||
Eliza to modern LLMs.)
|
||
7. Sharma, M., et al. (2024). _Towards Understanding Sycophancy in Language
|
||
Models._ ICLR 2024. arXiv:2310.13548.
|
||
8. Park, J., et al. (2025). _Self-Augmented Preference Alignment for Sycophancy
|
||
Reduction in LLMs._ EMNLP 2025.
|
||
9. Khandelwal, A., et al. (2024). _Scaling and sentiment bias propagation from
|
||
pretraining corpora into downstream models._ arXiv preprint. (CC-100 vs.
|
||
Wikipedia sentiment toward identity groups; propagation to fine-tuned
|
||
toxicity classifiers.)
|
||
10. Wilf, A., et al. (2023). _Think Twice: Perspective-Taking Improves Large
|
||
Language Models' Theory-of-Mind Capabilities._ arXiv:2311.10227. (SimToM —
|
||
explicit operationalization of Galinsky-style perspective-taking for LLMs.)
|
||
11. Kuhn, L., Gal, Y., & Farquhar, S. (2022/2023). _CLAM: Selective
|
||
Clarification for Ambiguous Questions with Large Language Models._
|
||
arXiv:2212.07769.
|
||
12. Reichman, B., et al. (2026). _AURA-QA: An emotionally balanced QA dataset
|
||
and emotional regularization framework._ arXiv:2603.09205.
|
||
13. Ji, Z., et al. (2023). _Towards Mitigating Hallucination in Large Language
|
||
Models via Self-Reflection._ arXiv:2310.06271.
|
||
14. Tjuatja, L., et al. (2023). _RLHF amplifies prompt-position sensitivity in
|
||
language models._ Cited in [4]. (Original arXiv preprint; full reference in
|
||
[4]'s bibliography.)
|
||
15. Mak, Y. C. (2025). _Lost in the middle, or just lost? Evaluating LLMs on
|
||
information retrieval with long input contexts._
|
||
https://ycmak.net/how-lost-in-the-middle/ (Argues the U-shape is partly an
|
||
artifact of positional-embedding decay producing monotonic drop at very long
|
||
contexts. Not peer-reviewed; data and methodology are public.)
|
||
16. nostalgebraist (2023). _OpenAI API base models are not sycophantic, at any
|
||
size._ LessWrong.
|
||
https://www.lesswrong.com/posts/3ou8DayvDXxufkjHD/openai-api-base-models-are-not-sycophantic-at-any-size
|
||
(Replication-style analysis disconfirming the strongest reading of Perez et
|
||
al. 2022 for OpenAI base models.)
|
||
17. Schulhoff, S. et al. (2024). _The Prompt Report: A Systematic Survey of
|
||
Prompting Techniques._ arXiv:2406.06608. (PRISMA review of 1,565 papers;
|
||
foundational survey used as cross-check on prompt-engineering claims in the
|
||
companion LLM doc.)
|
||
18. _Principled Personas: Defining and Measuring the Intended Effects of Persona
|
||
Prompting on Task Performance._ EMNLP 2025.
|
||
https://aclanthology.org/2025.emnlp-main.1364/ (Persona prompts often
|
||
ineffective; up to ~30pp drops from irrelevant persona details.)
|