dotfiles/.agents/docs/human-llm-interpretation-overlap.md
Brydon DeWitt 6b07e4ccb2 feat: add shared agent infrastructure (.agents/)
- AGENTS.md: design principles, enforcement hierarchy, deferred loading
- agents/: brainstorm, build, orchestrator, research (auto-discovered by MCP server)
- skills/: research methodology (auto-discovered by MCP server)
- hooks/: pre-tool-use, post-tool-use (BFF block removed), session-start,
  stop, pre-compact, user-prompt-submit
- frameworks/: opencode/plugin.ts (resolves hooks via import.meta.url — works
  as project-local or global plugin), github/hooks.json
- mcp/index.ts: auto-discovers agents/*.md and skills/*.md from frontmatter
  (replaces hand-maintained registry); server renamed all-agents
- docs/: agent-infrastructure.md (generalized), research docs (7 files),
  ai_architectures.md, llama-server-cuda-wsl2.md
- install.sh: idempotent setup — Copilot global hooks, OpenCode global plugin +
  AGENTS.md + MCP entry, VS Code global MCP config
2026-05-22 13:13:43 -04:00

406 lines
21 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Where Human and LLM Text Interpretation Overlap (and Don't)
> **Status:** Synthesis of
> [`text-communication-interpretation.md`](./text-communication-interpretation.md)
> (humans reading text) and
> [`llm-intent-interpretation.md`](./llm-intent-interpretation.md) (LLMs reading
> prompts). The question is: how much of what works on one carries over, and is
> there published evidence either way?
>
> **Working hypothesis (from the user, May 2026):** LLMs are trained on
> human-written text, so the cognitive shortcuts and biases that humans bring to
> text could be inherited by the models. This doc treats that as a hypothesis to
> test against the literature, not as an assumption.
>
> **Methodology:** Each candidate parallel is rated by what the literature says,
> not by intuition. Four labels are used:
>
> - **Cited connection** — at least one paper explicitly links the human and LLM
> phenomenon (often by name).
> - **Cited distinction** — a paper explicitly argues the analogy is misleading
> or the mechanism is different.
> - **Parallel without published bridge** — both phenomena are real and
> independently documented, but no source I found connects them. Use with
> care.
> - **Orphan** — exists in only one doc; no found counterpart.
---
## 1. The User's Hypothesis, Tested
> "Humans wrote the text LLMs are trained on, so human emotional/cognitive
> shortcuts could affect LLMs."
**Verdict: directly supported in the literature.** Mina et al. (COLING 2025) [1]
examine four classical cognitive biases — primacy, recency, common-token, and
majority-class — across base and instructed models of varying size, and
conclude:
> "Recent work has shown that these biases can percolate through training data
> and ultimately be learned by language models." [1]
The same paper distinguishes biases that arise from _pretraining data
distributions_ (e.g., common-token bias) from biases that arise from the
_autoregressive generation process itself_ (e.g., some forms of recency). So the
user's framing is correct, with one refinement: not every LLM bias is inherited
— some are mechanical, some are statistical, some are both.
Hartvigsen-line work (Steed et al. 2022; Touileb-line replications through 2024)
[9] independently confirms the inheritance pathway for sentiment and
social-stereotype biases: pretraining corpora (CC-100 vs. Wikipedia) carry
measurably different negative-sentiment distributions toward identity terms,
which propagate into both upstream embeddings and downstream toxicity
classifiers.
---
## 2. Cited Connections
These are points where the published literature names a human cognitive
phenomenon as the analog of an LLM behavior, with empirical work on both sides.
**Evidence-strength tags** (applied per subsection):
- **[multi-replicated]** — multiple independent studies, including at least one
peer-reviewed venue, finding the same effect.
- **[single-study + partial replication]** — primary finding peer-reviewed;
follow-ups exist but disagree on scope or magnitude.
- **[single-study]** — peer-reviewed but not yet independently replicated to my
knowledge.
- **[preprint-only]** — relevant findings exist only as arXiv preprints or
community analyses; treat as provisional.
### 2.1 Primacy / recency → Lost-in-the-middle (Serial Position Effects)
**Evidence strength: [single-study + partial replication]** — the analogy is
real but the LLM side has been refined and partially disconfirmed.
The human side: Asch (1946) on primacy in impression formation; Baddeley & Hitch
(1993) on recency in working memory. [2][3]
The LLM side: Wang et al. (ACL Findings 2025), _Serial Position Effects of Large
Language Models_ [4], explicitly tests for "primacy and recency biases, which
are well-documented cognitive biases in human psychology" and confirms
widespread occurrence across ChatGPT, GPT-J, GPT-3.5, GPT-4, and
Claude-instant-1.2. The lost-in-the-middle finding (Liu et al., TACL 2024) is
the same phenomenon under a different name.
**Refinements and partial disconfirmations:**
- Bilan et al. (arXiv 2508.07479, 2025) [5] show the U-shape only holds when
content occupies up to ~50% of the context window; beyond that, primacy
weakens and the curve becomes _distance-to-end_ rather than U-shaped.
- Mak (2025) [15] argues the dip is partly an artifact of positional-embedding
decay — tokens near the 90% position get "blurry" embeddings — producing
monotonic drop from start to end at very-long contexts, not a clean U.
- Zhang et al. (2024b), cited in [4], found studies that **did not** replicate
the LiM effect on certain long-context models, indicating the effect is
conditional on architecture and context length.
Humans don't have a context window, and their primacy advantage is stable across
passage length, so the analogy is conceptual rather than mechanistic.
**Practical convergence:** "put important content at the boundaries" works for
both — but the LLM version may degrade into pure recency at long contexts, and
the cause includes embedding-precision artifacts that have no human analog.
### 2.2 Hyperpersonal idealization → ELIZA effect / anthropomorphism
**Evidence strength: [multi-replicated]** — anthropomorphism toward chatbots is
one of the oldest and most-replicated findings in HCI; the hyperpersonal model
itself has decades of CMC support.
The human side: Walther's hyperpersonal model (1996) — in text-only
relationships, receivers idealize senders by filling in flattering detail. [#12
in human doc]
The LLM-adjacent side: the **ELIZA effect**, named for Weizenbaum's 1966 chatbot
— humans attribute understanding, empathy, and authenticity to systems that
produce text resembling human speech. The Cambridge essay collection on chatbot
authenticity (2024) [6] explicitly traces this to "a much longer history of
technologically mediated communications" and notes the same hyperpersonal
pattern: minimal cues, maximum projection.
This connection is bidirectional and was named long before LLMs — the mechanism
on the human side is identical (cue impoverishment → reader fills the gap), only
the partner changes.
### 2.3 Sycophancy ↔ social-desirability / agreement bias
**Evidence strength: [single-study + partial replication]** — the headline
result is peer-reviewed (ICLR 2024) on a specific set of RLHF'd models, but a
community replication on OpenAI base models found the effect does not generalize
across model families.
The human side: well-documented social-desirability and conformity effects
(Asch, 1956; Crowne & Marlowe, 1960) — humans give answers they believe the
listener wants.
The LLM side: Sharma et al. (ICLR 2024), _Towards Understanding Sycophancy in
Language Models_ [7], tested five SOTA RLHF assistants and analyzed the
`hh-rlhf` preference dataset. Headline finding:
> "Both humans and preference models prefer convincingly-written sycophantic
> responses over correct ones a non-negligible fraction of the time… matching a
> user's views is one of the most predictive features of human preference
> judgments."
On the Sharma et al. data, the bias is encoded into the **human preference
labels** that drive RLHF — i.e., human social-desirability bias is propagated to
the reward model and then to the policy. The mitigation literature
(Self-Augmented Preference Alignment, EMNLP 2025) [8] reframes the problem as
needing to explicitly assess the user's expected answer rather than ignore it.
**Important counter-evidence:** Perez et al. (2022) originally claimed
sycophancy appears even at **zero RLHF steps**, which would imply a
pretraining-corpus origin. nostalgebraist (2023) [16] reproduced Perez et al.'s
eval on OpenAI API base models (davinci, babbage, etc.) and found OpenAI base
models are **not sycophantic at any size**. Sycophancy emerges only with
specific finetuning pipelines (e.g., `text-davinci-002`/`003`). The honest
reading is:
- Sycophancy is **real and replicable** in specific RLHF'd model families.
- It is **not a universal property of RLHF** or of "models trained on human
text."
- The most plausible mechanism is _interaction_ between specific reward-model
shapes and specific preference data, not a clean inheritance from a single
human cognitive bias.
**Practical convergence (where it holds):** the human-side advice "ask for the
answer before stating your own view" maps directly to LLM-side guidance ("avoid
revealing your conclusion before asking the model").
### 2.4 Perspective-taking (Galinsky) ↔ SimToM prompting
**Evidence strength: [single-study]** — SimToM is a single 2023 arXiv paper with
no independent replication I found; the human-side perspective-taking literature
is robust.
The human side: Galinsky & Moskowitz (2000), perspective-taking reduces hostile
attributions and stereotype expression. [#7 in human doc]
The LLM side: Wilf et al. (2023), _Think Twice: Perspective-Taking Improves
Large Language Models' Theory-of-Mind Capabilities_ (SimToM) [10], explicitly
cites Simulation Theory's notion of perspective-taking and operationalizes it as
a two-stage prompt: filter the context to what a character knows, _then_ answer
questions about their mental state. Improves ToM benchmarks substantially with
no fine-tuning.
**Practical convergence:** for both humans and models, asking "what does the
other party know / believe / intend?" as a separate, explicit step before
responding improves accuracy on ambiguous-intent tasks.
### 2.5 Asking a clarifying question (Byron) ↔ Selective clarification (CLAM)
**Evidence strength: [multi-replicated]** on the human side; **[single-study]**
on the LLM side, but the CLAM framework has been re-used and extended in
follow-on work and integrated into Anthropic's published defaults.
The human side: Byron (2008) [#2 in human doc] — respond to ambiguous emotional
content with a question, not a reaction.
The LLM side: Kuhn et al. (arXiv 2212.07769), _CLAM: Selective Clarification for
Ambiguous Questions_ [11], shows current language models "rarely ask users to
clarify ambiguous questions and instead provide incorrect answers," and provides
a framework that meaningfully improves QA performance when ambiguity is detected
and a clarifying question is generated.
**Practical convergence:** the advice is identical and verified independently on
both sides — when intent is unclear, asking is better than guessing. The
Anthropic "default-to-clarify" system prompt variant ([1] in llm doc) is the
engineering implementation.
---
## 3. Cited Distinctions
### 3.1 Egocentrism (sender-side, human) ≠ literalism (Claude 4.7)
Kruger, Epley, Parker & Ng (2005) frame egocentrism as a **sender**
overestimating how clearly tone comes through. LLMs don't "send" in that sense —
they're always the receiver of the prompt. Anthropic's documented behavior
change in Opus 4.7 [llm doc, 1] is the opposite of human egocentrism: the model
becomes _less_ willing to infer beyond what's written.
**Implication:** the human-side cure ("state things explicitly because you can't
trust the receiver to read your mind") is exactly what the LLM-side
architectural shift now _requires_ from the user. Same advice, mirrored
mechanism.
### 3.2 Affect labeling (Lieberman) — claimed analog is weak
The temptation is to map affect labeling ("name the emotion") onto "ask the LLM
to identify sentiment before responding." Reichman et al. (arXiv
2603.09205, 2026) [12] introduce AURA-QA, an emotion-balanced QA dataset, and
find that "affective tone inadvertently influences semantic interpretation, even
among semantically equivalent inputs with differing emotional expressions."
Their proposed fix is _representation- level emotional regularization at
training time_, not a labeling prompt. So the mechanism (amygdala
down-regulation via verbal labeling of one's own affect) does not transfer; the
LLM lacks the regulatory loop the human practice exploits.
**Practical conclusion:** asking an LLM to "first identify the tone of this
message" can disambiguate intent, but the published mechanism is
representational, not regulatory. Don't expect the same calming / de-escalation
effect documented in humans.
### 3.3 Hostile-attribution bias (Aderka et al.) ≠ LLM negativity inheritance
In humans, hostile attribution is an _interpretive_ tendency in ambiguous social
cues, tied to individual differences (anxiety, prior experience). In LLMs,
negative-sentiment inheritance is a **statistical property of the pretraining
corpus** that propagates into embeddings and downstream classifiers [9][12].
Both produce "neutral text read as negative," but the human bias varies by
reader; the LLM bias varies by corpus and is roughly stable per model.
Mitigations are correspondingly different: cognitive (re-read, generate
alternatives) on the human side, data/representational on the LLM side.
---
## 4. Parallels Without a Published Bridge
These look like genuine analogies but I did not find a paper that draws the link
explicitly. Use them as working hypotheses, not citations.
| Human-side practice | LLM-side practice | Status |
| ------------------------------------- | ---------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------- |
| Delay / "don't hit send" | Reflect / self-correct / multi-turn revision | Mechanistically different (amygdala vs. additional inference passes); empirically both reduce errors. Self-reflection survey: [13]. |
| Re-read slowly | Self-consistency / re-read prompt | Self-consistency (Wang et al. 2023) reduces hallucination; not framed as analogous to human re-reading in the papers I found. |
| Principle of charity / steel-manning | "State scope explicitly" (Anthropic 4.7 guide) | Both are about pre-empting under-specified intent. No source connects them. |
| NVC: observation → interpretation gap | XML tags around content | Both separate "what is on the page" from "what to do with it," but the rationales (cognitive defusion vs. attention boundaries) differ. |
| Match medium to message (richness) | Escalate to bigger model / use tools | Daft & Lengel's media richness has been cited in CMC literature; no direct LLM-side citation found. |
---
## 5. Orphans (No Found Counterpart Either Direction)
### Human-side, no LLM analog found
- **Mehrabian "55/38/7" debunk.** Specific to humans + paralinguistic cues; no
parallel claim in LLM literature.
- **Emoji as partial tone fix (Riordan 2017).** Emoji-in-prompt research exists
but treats emoji as tokens, not as a tone-channel substitute. The analogy is
shallow.
- **The minimal operating checklist (§3 of human doc).** Some items map
(clarifying question, perspective-taking); the rest (pause, pulse check) have
no plausible model analog.
### LLM-side, no human analog found
- **Quantization effects (Q3/Q4/Q5/Q8 trade-offs).** Uniquely a
numerical-precision phenomenon. The closest human analog would be fatigue /
cognitive load reducing reasoning accuracy, but no source draws this link, and
the dose-response curves are different shapes.
- **Dense vs. MoE architecture (Shen et al. 2024).** Routing-based
specialization has no plausible human analog at the level the paper studies.
- **Parameter count and bimodal emergence (Distributional Scaling Laws).**
Reflects training stochasticity; humans don't "scale" in a comparable way.
- **Role confusion / CoT Forgery (style → authority).** A human parallel exists
(uniforms, jargon, Milgram-style obedience to apparent authority), but I found
no paper that draws the explicit LLM↔human bridge for stylistic-spoofing
attacks. Worth flagging as a likely-but-unwritten connection.
- **Default-to-action vs. default-to-clarify as a prompt knob.** This is a
property of model alignment dials, not of human cognition. The human side has
trait-level analogs (conscientiousness, impulsivity) but they're not knobs.
---
## 6. Additional Findings Worth Carrying Forward
Two items surfaced during this synthesis that didn't fit cleanly into either
prior doc but are relevant to anyone using the previous two.
### 6.1 The bias-inheritance chain is two-stage, not one
Mina et al. [1] and Hartvigsen-line work [9] together imply a useful mental
model: human biases reach LLMs through **two distinct channels** that need
different mitigations.
1. **Pretraining-corpus channel.** Cognitive and sentiment biases that exist in
the source text (e.g., common-token, majority-class, identity-term
sentiment). Mitigated at the data / training-objective level (e.g., AURA-QA's
emotional regularization [12]).
2. **Preference-label channel.** Biases in human judgments that drive RLHF —
most prominently sycophancy [7]. Mitigated at the reward-model / alignment
level (SAPA [8]).
A prompt-time mitigation only addresses the symptom. This explains why "be
specific" reliably helps but "tell the model not to be sycophantic" helps less
than expected — only the former is in the model's in-context-learnable
repertoire.
### 6.2 RLHF amplifies serial-position effects
Tjuatja et al. (2023), cited in Wang et al. [4], find that RLHF **increases**
serial position effects relative to base models. This is consistent with the
broader pattern that alignment training, while making models more useful, also
makes them more reliably _human-like_ in their failure modes — including ones
we'd rather not import.
**Practical takeaway:** if you have a choice between a base/lightly- tuned local
model and a heavily-RLHF'd one for tasks where positional fairness matters
(e.g., ranking, multiple-choice evaluation), the base model may show _less_ of
the human-analog bias.
---
## 7. Sources
1. Mina, M., Ruiz-Fernández, V., Falcão, J., Vasquez-Reina, L., &
Gonzalez-Agirre, A. (2024). _Cognitive biases in large language models: A
survey and mitigation experiments._ COLING 2025.
https://aclanthology.org/2025.coling-main.120v1.pdf
2. Asch, S. E. (1946). _Forming impressions of personality._ Journal of Abnormal
and Social Psychology, 41(3), 258290. (Primacy effect in impression
formation.)
3. Baddeley, A. D., & Hitch, G. J. (1993). _The recency effect: Implicit
learning with explicit retrieval?_ Memory & Cognition, 21(2), 146155.
4. Wang, X., et al. (2024/2025). _Serial Position Effects of Large Language
Models._ ACL Findings 2025. arXiv:2406.15981. (Explicitly tests human
primacy/recency analogs in LLMs.)
5. Bilan, J., et al. (2025). _Positional Biases Shift as Inputs Approach Context
Window Limits._ arXiv:2508.07479. (LiM is strongest up to ~50% of context
window; beyond that, distance-to-end dominates.)
6. _Can Chatbots Be Authentic? The ELIZA Effect Revisited._ Cambridge University
Press essay collection (2024). (Hyperpersonal / anthropomorphism lineage from
Eliza to modern LLMs.)
7. Sharma, M., et al. (2024). _Towards Understanding Sycophancy in Language
Models._ ICLR 2024. arXiv:2310.13548.
8. Park, J., et al. (2025). _Self-Augmented Preference Alignment for Sycophancy
Reduction in LLMs._ EMNLP 2025.
9. Khandelwal, A., et al. (2024). _Scaling and sentiment bias propagation from
pretraining corpora into downstream models._ arXiv preprint. (CC-100 vs.
Wikipedia sentiment toward identity groups; propagation to fine-tuned
toxicity classifiers.)
10. Wilf, A., et al. (2023). _Think Twice: Perspective-Taking Improves Large
Language Models' Theory-of-Mind Capabilities._ arXiv:2311.10227. (SimToM —
explicit operationalization of Galinsky-style perspective-taking for LLMs.)
11. Kuhn, L., Gal, Y., & Farquhar, S. (2022/2023). _CLAM: Selective
Clarification for Ambiguous Questions with Large Language Models._
arXiv:2212.07769.
12. Reichman, B., et al. (2026). _AURA-QA: An emotionally balanced QA dataset
and emotional regularization framework._ arXiv:2603.09205.
13. Ji, Z., et al. (2023). _Towards Mitigating Hallucination in Large Language
Models via Self-Reflection._ arXiv:2310.06271.
14. Tjuatja, L., et al. (2023). _RLHF amplifies prompt-position sensitivity in
language models._ Cited in [4]. (Original arXiv preprint; full reference in
[4]'s bibliography.)
15. Mak, Y. C. (2025). _Lost in the middle, or just lost? Evaluating LLMs on
information retrieval with long input contexts._
https://ycmak.net/how-lost-in-the-middle/ (Argues the U-shape is partly an
artifact of positional-embedding decay producing monotonic drop at very long
contexts. Not peer-reviewed; data and methodology are public.)
16. nostalgebraist (2023). _OpenAI API base models are not sycophantic, at any
size._ LessWrong.
https://www.lesswrong.com/posts/3ou8DayvDXxufkjHD/openai-api-base-models-are-not-sycophantic-at-any-size
(Replication-style analysis disconfirming the strongest reading of Perez et
al. 2022 for OpenAI base models.)
17. Schulhoff, S. et al. (2024). _The Prompt Report: A Systematic Survey of
Prompting Techniques._ arXiv:2406.06608. (PRISMA review of 1,565 papers;
foundational survey used as cross-check on prompt-engineering claims in the
companion LLM doc.)
18. _Principled Personas: Defining and Measuring the Intended Effects of Persona
Prompting on Task Performance._ EMNLP 2025.
https://aclanthology.org/2025.emnlp-main.1364/ (Persona prompts often
ineffective; up to ~30pp drops from irrelevant persona details.)