Convergent AI-Mediated Personality Assessment: Psychometric Profiling and Narrative Inference from Digital Communication Data

Ashita Orbis | March 4, 2026 | 86 min read | working-paper

Ashita Orbis

Working paper. Not peer-reviewed.

Abstract

This paper examines the convergence between two AI-mediated approaches to personality assessment applied to a single participant: explicit psychometric measurement through seventeen validated instruments triangulated across three inference methods, and implicit personality inference through literary narrative generation from a 267MB text messaging archive. The explicit approach (Psyche) produced a ten-dimension persona model with dimensional scores and confidence intervals. The implicit approach (narrative pipeline) produced approximately 78,000 words of first-person literary memoir grounded in 679 cryptographic source citations with a 99.85% validation rate. Cross-method comparison across ten personality dimensions yielded five consistent, three evolved, two partially consistent, and zero divergent ratings, where "evolved" denotes documented developmental change rather than measurement disagreement. Ten analytical contributions emerge from the data: systematic self-enhancement patterns in per-method Big Five divergence, facet-level reliability differences between 4-item and 10-item NEO measurement, temporal behavioral markers operationalized as personality indicators, citation density gradients as measures of inferential distance from evidence, cross-arc attachment consistency as evidence for trait stability from behavioral rather than self-report data, quantitative signal preservation analysis showing that narrative generation preserves extreme personality traits (Openness, Extraversion) while amplifying Neuroticism through literary genre effects, communication medium effects demonstrating that the same person's inferred personality varies by up to 37 points depending on the text register analyzed, temporal personality trajectory analysis revealing a Neuroticism peak that aligns with the narrative pipeline's independently inferred emotional arc, the operationalization of a negative finding — lexical frequency analysis removed from personality synthesis and repositioned as comparative corpus characterization after demonstrating insensitivity to personality variation, and a differential personality experiment demonstrating that narrative perspective determines inferred personality profile — interlocutor first-person narratives produce profiles differing by 10.7-12.3 mean absolute points from the subject's baseline while third-person narratives about the subject stay within 3.7-5.6 points, ruling out model-invariant output as the primary explanation for signal preservation. The findings suggest that digital communication data contains more psychological signal at the corpus level than individual messages suggest, that extraction method determines accessibility of this signal, and that the combination of psychometric snapshots with narrative trajectories captures dimensions neither approach reaches independently.

1. Introduction

1.1 The Personality-in-Digital-Data Problem

The premise that digital behavior contains personality signal has accumulated substantial empirical support across the past decade of computational social science. Kosinski, Stillwell, and Graepel (2013) demonstrated that Facebook Likes predicted Big Five personality traits with moderate accuracy, achieving correlations between model predictions and self-report scores that in some cases approached the test-retest reliability of the instruments themselves. Park et al. (2015) extended this work to open-vocabulary language analysis, showing that the words people use on social media contain sufficient personality variance to predict trait scores from linguistic features alone, with associations between specific language categories and personality dimensions that replicated across large samples. More recently, Peters, Cerf, and Matz (2024) showed that large language models can infer personality traits from brief text samples, achieving convergent validity with self-report measures that suggests the models are extracting genuine personality signal rather than superficial linguistic correlations.

These findings establish that digital data contains personality information. The unresolved question concerns what happens when different extraction methods are applied to the same person's data, because the literature has yet to examine this directly. Existing studies compare model predictions against a single criterion (typically self-report scores) and evaluate accuracy in the classical psychometric sense of convergent validity with established instruments. No published work has compared explicit psychometric measurement against AI literary interpretation of the same individual's digital communication corpus, which means that the question of whether different AI-mediated approaches to personality assessment converge on the same psychological portrait from the same underlying data remains open.

This paper addresses that gap through a case study in which two independent AI-mediated approaches to personality assessment were applied to a single participant, each drawing on overlapping but non-identical corpora and employing fundamentally different extraction methodologies, and in which the resulting personality portraits were compared across ten theoretically motivated dimensions.

1.2 The Simulator Hypothesis and Persona Inference

The theoretical impetus for the narrative inference approach derives from a specific understanding of how large language models process text. The simulator hypothesis, articulated by janus (2022) and developed formally by Andreas (2022) and Shanahan, McDonell, and Reynolds (2023), holds that autoregressive language models trained on next-token prediction are best understood not as agents with goals but as simulators of the processes that generated their training data. Under this framing, a language model that encounters a sequence of text implicitly performs inference over the latent agent (or agents) most likely to have produced that sequence, because accurate next-token prediction requires modeling not just the surface statistics of language but the coherent intentions, dispositions, and knowledge states of the communicator.

Andreas (2022) formalized this intuition in the agent model framework, arguing that language models are, in a narrow but precise sense, models of intentional communicators: when predicting the next token, the model must infer the characteristics of whoever is "speaking" in its context window, which makes persona inference a computational necessity of the training objective rather than an emergent coincidence. Shanahan et al. (2023) developed a compatible "role-play" framing, published in Nature, in which LLMs are characterized as systems that simulate whoever is the implied author of their context, with the specific character being simulated determined by the prompt and context rather than fixed in the weights. Marks, Lindsey, and Olah (2026), in Anthropic's persona selection model, provided behavioral and interpretability evidence that RLHF-based post-training elicits and stabilizes a particular persona from the distribution of characters learned during pretraining, rather than creating a novel agent from scratch.

The relevance to the present study is direct. If language models learn to infer coherent personas from text as a byproduct of training, then the hypothesis that a model could construct a psychologically realistic narrative from messaging data is not a speculative leap but a prediction of the theory. The interesting question is not whether the model can produce a narrative that reads as psychologically coherent (the simulator hypothesis predicts that it can, given sufficient context), but whether the persona inferred by the model from behavioral data (text messages) correlates with independently measured personality traits of the actual individual. The present study tests precisely this: whether the implicit persona constructed by a language model from digital communication data converges with explicit psychometric measurement of the same person.

1.3 Two Approaches to AI-Mediated Personality Assessment

The first approach, designated Psyche, is an explicit measurement framework that administers seventeen validated psychometric instruments (totaling 786 questionnaire items) and triangulates the resulting scores across three inference methods: self-report (weighted 0.47), LLM text analysis (0.27), and semi-structured interview (0.27). The system produces a structured profile with dimensional scores, confidence intervals derived from cross-method divergence, and a ten-dimension persona model that maps instrument scores to behavioral predictions. This approach inherits the strengths of classical psychometric measurement (validated instruments, quantified dimensionality, normative comparisons) while extending it through the multi-method triangulation that Campbell and Fiske (1959) advocated as the standard for construct validation.

The second approach, designated the narrative pipeline, is an implicit inference system that feeds a 267MB archive of text messages (SMS and Facebook Messenger, spanning 2008 to 2026) through Claude Opus as a literary generation engine. The pipeline provides Opus with approximately 2.3KB of behavioral reference data (texting metrics: emoji rates, filler word frequencies, sentence length distributions, punctuation patterns) and instructs it to generate first-person literary memoir from the messaging data. The critical asymmetry is that the pipeline provides no personality scores, no instrument results, no psychological framework. Any psychological depth beyond surface texting metrics was inferred by the model from the messaging data itself. The pipeline produced approximately 78,000 words of narrative across two temporal arcs (the earlier spanning several years, the later spanning roughly eight months), grounded in a cryptographic citation system that links specific narrative claims to specific source messages (679 citations, 99.85% validated against the original archive).

The distinction between these approaches is not merely methodological but epistemological. Psyche asks a person standardized questions about their own behavior and triangulates the answers with computational text analysis. The narrative pipeline asks a model to construct a person's interior life from the evidence of what they said, to whom, and when. The first approach measures personality through instruments designed for the purpose. The second approach infers personality from data that was never intended to reveal it.

1.4 Idiographic Rationale

The design is a single-case study, which raises immediate questions about generalizability that require methodological justification rather than apology. The idiographic tradition in personality psychology (Allport, 1937; Lamiell, 1981; Molenaar, 2004) argues that some psychological questions are properly answered at the level of the individual case, particularly questions about the internal coherence and structure of a personality system that nomothetic methods aggregate away. Molenaar's (2004) manifesto on the non-ergodicity of psychological processes made the stronger claim that interindividual variation and intraindividual variation are mathematically non-equivalent, which means that findings from group-level analyses do not necessarily apply to individual cases and vice versa.

The present study makes no population-level claims. The contribution is methodological: demonstrating whether two AI-mediated personality assessment approaches converge when applied to the same individual, identifying the specific patterns of convergence and divergence, and extracting analytical contributions from the comparison that illuminate the properties of each method. The participant's data serve as the substrate for methodological comparison, not as evidence about a population. The N=1 design is appropriate because the research questions concern within-person cross-method convergence, which is a property of the methods rather than the person, though the findings are necessarily bounded by the characteristics of this particular participant and this particular dataset.

The autoethnographic dimension of the study (the participant is also the analyst) introduces both advantages and limitations that are discussed in the Method section. The advantages include ecological validity (the data are genuine communication, not laboratory artifacts) and interpretive access (the participant can evaluate whether inferred psychological states match experienced reality). The limitations include confirmation bias and the impossibility of blinded evaluation. These are addressed through procedural safeguards (heterogeneous ratings, explicit discussion of alternative interpretations) rather than eliminated, because elimination is not possible within this design.

1.5 Research Questions

Three questions organize the analysis:

RQ1: To what extent do explicit psychometric measurement and implicit AI narrative inference converge when applied to the same individual's digital communication data?

RQ2: What does cross-method divergence reveal about method-specific biases in AI-mediated personality assessment?

RQ3: Can behavioral communication metrics (emoji rates, question frequencies, message length distributions) serve as indicators of psychologically meaningful change that bridges quantitative measurement and qualitative narrative inference?

2. Method

2.1 Participant

The participant is a single adult male, mid-twenties to early thirties during the data collection period, residing in western Canada, with post-secondary education in psychology and substantial engagement with digital communication technologies throughout the study period. The participant occupies a dual role as both the subject of assessment and the analyst conducting the comparison, an autoethnographic configuration that has precedent in qualitative research methodology (Hayano, 1979; Ellis, Adams, and Bochner, 2011) and that introduces specific methodological considerations.

The advantages of the autoethnographic design include ecological validity (the messaging data are genuine personal communications generated over approximately fifteen years without awareness that they would later be analyzed), interpretive access (the participant can evaluate whether computationally inferred psychological states correspond to experienced reality), and motivation (the participant has substantial investment in the accuracy of the assessment, which incentivizes honest engagement with the instruments and critical evaluation of the results). The limitations include confirmation risk (the participant may unconsciously interpret ambiguous results as convergent), the impossibility of blinded evaluation (the analyst cannot be blinded to his own personality), and the potential for self-presentation bias in the self-report instruments (addressed through the multi-method triangulation that constitutes Method A's core design).

2.2 Method A: Psyche (Explicit Psychometric Measurement)

Psyche is an open-source personality profiling framework that triangulates validated psychometric instruments with computational text analysis to produce multi-method personality profiles. The framework administers instruments in two phases, combines instrument scores with three additional inference methods, and synthesizes the results into a structured profile with confidence intervals and behavioral predictions.

2.2.1 Instrument Battery

The battery consists of seventeen instruments administered across two phases, totaling 786 questionnaire items plus a ten-question semi-structured interview.

Table 1. Phase 1 Instruments (7 instruments, 225 items)

Instrument Items Scale Constructs Citation
IPIP-NEO-120 120 5-pt Likert Big Five + 30 facets (4 items/facet) Goldberg et al. (2006)
CRT-7 7 Open-ended Analytical vs. intuitive thinking Frederick (2005); Toplak et al. (2014)
NCS-18 18 5-pt Likert Need for Cognition Cacioppo et al. (1984)
Rosenberg Self-Esteem 10 4-pt Likert Global self-esteem Rosenberg (1965)
SD3 (Short Dark Triad) 27 5-pt Likert Machiavellianism, Narcissism, Psychopathy Jones and Paulhus (2014)
PHQ-9 + GAD-7 16 4-pt Likert Depression and anxiety screening Kroenke et al. (2001); Spitzer et al. (2006)
Conversational Interview 10 Qs Open-ended Values, self-concept, relationships, coping Peters and Matz (2024) protocol

Table 2. Phase 2 Instruments (10 instruments, 561 items)

Instrument Items Scale Constructs Citation
IPIP-NEO-300 300 5-pt Likert Big Five + 30 facets (10 items/facet) Goldberg et al. (2006)
HEXACO-60 60 5-pt Likert 6 factors incl. Honesty-Humility Ashton and Lee (2009)
ECR-R 36 7-pt Likert Attachment: Anxiety + Avoidance Fraley et al. (2000)
ERQ-10 10 7-pt Likert Emotion regulation: Reappraisal + Suppression Gross and John (2003)
IRI-28 28 5-pt Likert Empathy: 4 subscales Davis (1983)
Self-Monitoring Scale 18 True/False Social flexibility vs. cross-situational consistency Snyder (1974)
IE-4 (Locus of Control) 4 5-pt Likert Internal/External attribution Kovaleva et al. (2012)
Grit-S 8 5-pt Likert Perseverance + Interest consistency Duckworth and Quinn (2009)
RIASEC-48 48 5-pt Likert Vocational interests (6 types) Holland (1997); adapted items
BPNS-9 9 7-pt Likert Autonomy, Competence, Relatedness Chen et al. (2015)

All instruments are drawn from the International Personality Item Pool (IPIP) or from published validated scales with established psychometric properties (internal consistency, test-retest reliability, convergent and discriminant validity documented in the cited sources).

2.2.2 Inference Methods

The Psyche framework triangulates three inference methods, each weighted according to its assessed reliability and validity for personality inference:

Self-report (weight: 0.47): Raw instrument scores from the participant's responses to all 786 items. This method receives the highest weight because it provides the most complete dimensional coverage and benefits from the psychometric validation of the underlying instruments, despite the well-documented limitations of self-report (social desirability bias, limited introspective access, reference group effects; Paulhus, 1984; Dunning, Heath, and Suls, 2004).

LLM text analysis (weight: 0.27): Claude Opus analyzed a 1,473,000-word writing corpus encompassing academic papers (151K words), SMS messages (401K words), Facebook Messenger conversations (413K words), and AI platform conversations (ChatGPT: 333K words; Claude.ai: 167K words). The analysis processed approximately 101,000 words through a chunked sampling methodology: long samples were segmented into ~2,000-word units at word boundaries, then assembled into ~8,000-word chunks through round-robin source interleaving, producing 15 chunks that each contain diverse source material. Each chunk was independently assessed by Claude Opus 4.6, and scores were averaged across all chunks (see Appendix A for full algorithmic specification and reproduction instructions). This method provides an external behavioral observation that is not subject to the same self-presentation biases as self-report, though it introduces the model's own interpretive tendencies as a potential confound. The communication medium systematically shapes which personality facets are visible to the model — messaging captures relational distress (N=69.8) while AI conversations foreground structured goal-pursuit (C=72.0) — a finding explored in Section 3.8.

Semi-structured interview (weight: 0.27): A ten-question interview conducted by Claude Opus following the Peters and Matz (2024) assessment protocol, in which the interviewer was blind to the self-report scores. The interview probed values, decision-making, relationships, coping, and self-concept. This method captures nuance and context that fixed-response instruments cannot, while the conversational format elicits behavioral descriptions rather than trait self-ratings.

Lexical analysis (corpus characterization only): Automated lexical analysis of the writing corpus using the Empath framework (Fast, Chen, and Bernstein, 2016) is used for comparative corpus characterization across analysis levels but is excluded from profile synthesis. Initial analysis revealed that Empath measures language register (what the text is about) rather than personality traits (who wrote it), producing compressed estimates between 52 and 55 regardless of dimension extremity (see Section 3.2 for the full analysis). The framework was accordingly repositioned as an ordinal comparative tool using z-score labels (high, above average, average, below average, low) that characterize how lexical category frequencies vary across corpus segments — useful for identifying register effects, not for personality inference.

2.2.3 Synthesis and Persona Model

The three methods' scores are combined through weighted averaging to produce a merged score for each personality dimension, with confidence intervals derived from cross-method divergence (the standard deviation across the three methods' estimates, expressed as a range around the weighted mean). The merged scores are then mapped to a ten-dimension persona model that translates dimensional scores into behavioral predictions: Communication, Audience (self-monitoring), Decision-Making, Risk, Conflict Response, Emotion Regulation, Empathy, Ethics, Motivation, and Stress Response. Each dimension specifies behavioral tendencies, interaction signatures, and predicted responses to specific situational contexts.

2.3 Method B: Narrative Pipeline (Implicit Personality Inference)

The narrative pipeline is a literary generation system that produces first-person memoir from text messaging archives. The pipeline was developed to test a specific prediction of the simulator hypothesis: that a language model, given sufficient behavioral context in the form of messaging data, would infer a coherent persona whose psychological characteristics could be independently verified. It was not designed as a personality assessment tool, which means that any convergence with psychometric measurement constitutes evidence about the model's inferential capacity rather than an artifact of shared assessment design.

2.3.1 Source Data

The source corpus is a 267MB archive of text messages encompassing SMS messages extracted from Android backups and Facebook Messenger conversations exported from the platform's data download tool. The archive spans approximately 2008 to 2026, covering the participant's communications with family members, friends, romantic partners, and acquaintances. Messages from all interlocutors are included (the archive contains both sent and received messages), providing the model with both sides of each conversation.

2.3.2 Behavioral Reference

The pipeline provides the generation model (Claude Opus) with approximately 2.3KB of behavioral reference data derived from quantitative analysis of the messaging corpus. This reference includes texting metrics (emoji usage rate: 9.1% in the earlier arc, 61.5% in the later arc; question rate: 18.3% declining to 10.9%; message length distributions; filler word frequencies), punctuation patterns, and sentence construction tendencies. The reference describes texting behavior, not personality. No Psyche scores, no instrument results, no psychological constructs, no trait labels are provided to the generation model.

2.3.3 Generation Process

The narrative pipeline operates through a five-step process. First, the messaging archive is segmented into temporal arcs corresponding to distinct relational periods. Second, the behavioral reference and style profile are assembled. Third, Claude Opus generates chapter-by-chapter first-person literary memoir, drawing on the messages as source material and the behavioral reference as a voice constraint. Fourth, a cryptographic citation system links specific narrative claims to specific source messages through {FN:uid} markers, where each UID is a truncated hash of the original message content. Fifth, an automated validation pass verifies that each citation marker resolves to a real message in the source archive.

The pipeline produced approximately 78,000 words of first-person narrative across two arcs: the earlier spanning several years, the later spanning roughly eight months. Across all narrative outputs, the citation system generated 679 citations with a validation rate of 99.85% (one citation failed to resolve, likely due to a hash collision or message deletion).

2.3.4 The Key Asymmetry

The critical methodological point is the asymmetry between what the pipeline received and what it produced. The pipeline received texting metrics (surface behavioral data). It produced psychological inferences (attachment architecture, emotion regulation patterns, decision-making mechanisms, interpersonal dynamics). Every psychological construct that appears in the narrative was inferred by Opus from the messaging data itself, not provided as input. This asymmetry is what makes the comparison with Psyche's explicit measurement informative: if the pipeline's implicit inferences converge with Psyche's explicit measurements, the model successfully extracted personality from conversational data, confirming the simulator hypothesis prediction that persona inference is a natural capacity of autoregressive models given sufficient behavioral context.

2.4 Comparison Framework

The comparison evaluates the narrative pipeline's implicit personality model against Psyche's explicit measurement across ten persona dimensions. The evaluation framework acknowledges structural confounds (shared model, overlapping corpus), incorporates temporal dynamics (the narrative spans 2017 to 2026 while Psyche was administered in February 2026), and employs a four-category rating system that distinguishes genuine developmental change from measurement disagreement.

2.4.1 Three Evidence Tiers

Evidence is organized into three tiers that differ in their epistemic status. Tier 1 (Real) comprises passages in the narrative with citation markers linking to verified source messages, Psyche interview quotes, and quantitative corpus statistics; this tier constitutes ground truth. Tier 2 (Inferred) comprises the unsourced portion of the narrative (approximately 88-92% of total narrative content), which represents Opus's literary construction of inner life, emotional states, and psychological mechanisms from the messaging evidence; this tier is what the evaluation assesses. Tier 3 (Measured) comprises Psyche's instrument scores with confidence intervals and behavioral specifications; this tier serves as the benchmark against which Tier 2 inferences are evaluated.

2.4.2 Rating Categories

Each dimension receives one of four ratings:

CONSISTENT: The narrative pipeline's implicit construction aligns with Psyche's explicit measurement. Both approaches describe the same psychological architecture.

EVOLVED: The earlier and later narrative arcs show real developmental change, with Psyche capturing one endpoint of the trajectory. The temporal gap between the narrative period and measurement point explains the apparent difference. This rating denotes documented change, not measurement disagreement.

PARTIAL: Alignment on some aspects of the dimension, gaps on others. Typically reflects data source limitations (the messaging archive lacks information about constructs that Psyche measured through other channels) rather than method failure.

DIVERGENT: The two approaches describe contradictory psychological architecture that cannot be explained by temporal change or data source limitations.

2.4.3 Entanglement Acknowledgment

Two structural confounds complicate the interpretation of convergence and must be stated explicitly. First, Claude Opus generated both the Psyche synthesis report and the narrative pipeline outputs, which means that convergence may partly reflect the model's rendering tendencies rather than genuine personality signal (if Opus gravitates toward certain personality constructions regardless of input, convergence is partly artifactual). Second, Psyche's LLM analysis drew on a writing corpus that includes content adjacent to SMS, producing uncertain but nonzero overlap with the narrative pipeline's source data.

These entanglements produce an interpretive asymmetry: convergence is expected to some degree and therefore carries less evidential weight, while divergence is more informative because it occurs despite the shared model and overlapping data. The evaluation interprets convergence cautiously and takes divergence seriously, weighting the findings accordingly.

3. Results

3.1 Cross-Method Personality Convergence

The evaluation compared the narrative pipeline's implicit personality model against Psyche's explicit measurement across ten dimensions of the persona model. Table 3 summarizes the results.

Table 3. Cross-Method Comparison Across Ten Persona Dimensions

Dimension Rating Summary
Epistemic Style CONSISTENT Strongest convergence. Instruments and narrative independently agree on analytical processing, including its specific failure mode under irreducible experience.
Conflict Response CONSISTENT Narrative reconstructed the exact regulation sequence (reappraisal first, suppression when that fails) that instrument scores predict, playing out across entire arcs rather than single scenes.
Interpersonal Patterns CONSISTENT Same metaphor (glass pane separation) appeared in three independent evidence tiers. Proxy architecture visible across both arcs.
Attachment CONSISTENT Anxious-preoccupied pattern (elevated anxiety, near-floor avoidance) appears identically across both arcs with different people eight years apart. Same behavioral architecture, different partner, different era.
Stress Response CONSISTENT Dual-deployment regulation (reappraisal then suppression) and momentum-based recovery inferred from messaging patterns match ERQ scores at near-ceiling (94/83) and interview evidence of crisis behavior.
Communication EVOLVED Pipeline captured trajectory from indirect (proxied, hedged) to direct engagement. Psyche's low self-monitoring measurement captures the resolved endpoint, not the origin.
Decision-Making EVOLVED Same deliberation architecture produces different conversion rates across eras: near-zero action in the earlier arc, rapid conversion in the later. Psyche measures the mature form.
Self-Concept EVOLVED Gap between ceiling analytical capacity and low self-esteem persists across eras, but the relationship to the gap changes. Psyche captured a person mid-trajectory.
Motivation PARTIAL SMS data captures relational motivation intensity but misses vocational and investigative dimensions Psyche measured at ceiling. Data source limitation, not method failure.
Flow States PARTIAL Specific phenomenology of intellectual flow has no narrative counterpart. Some dimensions are structurally unreachable from messaging data.

Five dimensions showed straightforward convergence. Three showed real change across the narrative's temporal span, with Psyche capturing one endpoint of the trajectory. Two were limited by the data source rather than the method. Zero dimensions showed divergence, a result that is potentially suspicious and is addressed in the Discussion.

3.2 Method Divergence as Self-Enhancement Signal

The Psyche framework's multi-method design produces per-method Big Five estimates that reveal systematic patterns of disagreement between inference methods. Table 4 presents these estimates for the five major personality dimensions.

Table 4. Per-Method Big Five Estimates

Dimension Self-Report LLM Interview Merged Self-Report Divergence
Openness 66 88.6 93 79 -22.6 to -27 (lower)
Conscientiousness 57 61.4 35 53 -4.4 to +22 (mixed)
Extraversion 37 28.3 15 29 +8.7 to +22 (higher)
Agreeableness 68 36.4 30 53 +31.6 to +38 (higher)
Neuroticism 52 51.1 68 57 +0.9 to -16 (mixed)

The self-report method diverges from LLM and interview estimates in a pattern that is clearer in some dimensions than others. Three dimensions show consistent self-enhancement patterns:

Agreeableness shows the strongest self-enhancement signal: self-report (68) exceeds both LLM (36.4) and interview (30) by 31.6 to 38 points, inflating in the socially desirable direction. This is the most robust finding — it survived the expanded corpus analysis without meaningful change. Openness shows consistent self-deflation: self-report (66) is 22.6 to 27 points below external methods, a direction consistent with intellectual humility (a form of social desirability for someone whose identity is organized around epistemic caution). Extraversion shows moderate self-enhancement: self-report (37) exceeds LLM (28.3) and interview (15) by 8.7 to 22 points, though the magnitude weakened after the expanded corpus analysis corrected the LLM estimate upward from 18 to 28.3.

Two dimensions now show split signals rather than clear self-enhancement patterns. Conscientiousness: the corrected LLM score (61.4) now agrees with self-report (57, delta: -4.4), while the interview estimate remains very low (35). The original LLM underestimate (C=38) was an artifact of a truncation bug that discarded academic writing — the source containing the strongest Conscientiousness signal — before analysis. The interview's low estimate may reflect the interview protocol's emphasis on spontaneous self-description, which captures hedging and uncertainty rather than the structured goal-pursuit visible in writing. Neuroticism: the corrected LLM score (51.1) now agrees with self-report (52, delta: +0.9), while the interview estimate remains elevated (68). The original LLM-SR divergence was partially driven by a messaging-only corpus that overrepresented relational distress content.

The corrected analysis actually strengthens the self-enhancement finding for Agreeableness and Openness: these patterns survived a methodological correction that substantially changed other dimensions, providing evidence that they reflect genuine self-report bias rather than measurement artifact. The finding that Conscientiousness and Neuroticism divergence was partially methodological artifact is itself informative — it demonstrates the sensitivity of text-based personality inference to corpus composition and the importance of source-stratified sampling (see Section 3.8).

This pattern partially replicates the self-deceptive enhancement effect documented by Paulhus (1984) and Paulhus and Reid (1991), with the novel contribution that AI text inference serves as the comparison method. The model functions as an "observer" that is not subject to the relational dynamics that complicate human informant reports but that introduces its own systematic biases (toward coherent narrative construction, toward psychologically interesting interpretations). The interview method retains its distinctive signal for N and C — it was not affected by the corpus correction — suggesting that conversational assessment captures dimensions of personality that text analysis, even when corrected, does not fully access.

The Empath lexical analysis method reveals a separate finding: compression toward the midpoint. Across all five dimensions, Empath produces estimates between 52 and 55.2 (a 3.2-point range on the expanded 1.47M-word corpus), regardless of the dimension's actual extremity. A quantitative root cause analysis (see Supplementary Table S4) identifies three compounding factors. First, six categories in the Empath-to-Big-Five mapping (travel, adventure, curiosity, helping, dominance, planning) do not exist in Empath's lexicon, reducing effective coverage for Openness, Agreeableness, and Conscientiousness. Second, the calibration formula (50 + raw_score * 2000) cannot amplify the raw weighted score range of 0.003 into a meaningful 0-100 differentiation. Third, and most fundamentally, the top-20 lexical categories differ dramatically between the corpus (dominated by negative_emotion, speaking, communication — reflecting the expanded source mix) and the narrative corpus (dominated by positive_emotion, speaking, love, friends — reflecting literary genre conventions), with 10 of 20 categories shared. This genre sensitivity demonstrates that Empath measures what the text is about (topic and genre) rather than who wrote it (personality), explaining both the compression and its consistency across all analysis levels (see Section 3.7 and Table S7, where the Empath range across all 65 cells is 7.2 points compared to the LLM range of 53.8 points). These findings suggest that lexical frequency counting at the corpus level lacks sensitivity for personality differentiation, a negative finding with implications for the broader LIWC-adjacent personality inference literature.

This analysis motivated the decision to remove Empath from profile synthesis in v3 of the framework. Rather than attempting to calibrate Empath's compressed outputs into meaningful personality scores, the framework now treats Empath as a comparative corpus characterization tool: lexical category frequencies are z-scored against full-corpus norms and mapped to ordinal labels (high, above average, average, below average, low) that describe how word-usage patterns vary across corpus segments. This reframing acknowledges Empath's actual capability — measuring register and topic variation — while removing it from a role (personality inference) for which it demonstrably lacks sensitivity. The ordinal labels are useful for identifying, for example, that messaging segments score "high" on negative_emotion while AI platform segments score "low," a register effect that is genuinely informative about corpus composition even though it carries no personality inference value.

3.3 NEO-120 vs. NEO-300 Facet-Level Reliability

The participant completed both the 120-item (4 items per facet) and 300-item (10 items per facet) versions of the IPIP-NEO, administered several days apart, providing a within-person comparison of facet-level reliability across item counts. Table 5 presents the 30 facets organized by convergence.

Table 5. NEO Facet Comparison (Selected High-Convergence and High-Divergence Facets)

High convergence (delta < 5 points):

Facet Domain NEO-120 NEO-300 Delta
Trust A 69 72 3
Morality A 88 85 3
Dutifulness C 94 90 4
Intellect O 88 92 4
Liberalism O 50 50 0

High divergence (delta > 15 points):

Facet Domain NEO-120 NEO-300 Delta
Sympathy A 31 65 34
Anger N 50 22 28
Imagination O 44 70 26
Assertiveness E 50 28 22
Achievement-Striving C 38 57 19

The pattern in the high-divergence facets is consistent with a single-item artifact vulnerability at low item counts. Facets that converge well (Trust, Morality, Dutifulness, Intellect, Liberalism) tend to measure conceptually unified constructs where individual items cluster tightly around a single behavioral tendency. Facets that diverge substantially (Sympathy, Anger, Imagination, Assertiveness) tend to contain conceptually ambiguous items where the same facet label encompasses distinct behavioral tendencies (for example, the Sympathy facet includes items about both emotional responsiveness to others' suffering and about endorsing social welfare policies, which are psychologically distinct). When only four items sample this heterogeneous space, a single item can shift the score by 25 points; with ten items, the sampling covers more of the construct's breadth and produces more stable estimates.

The complete 30-facet comparison is provided in Supplementary Table S1.

3.4 Temporal Behavioral Markers

The behavioral reference data compiled for the narrative pipeline includes quantitative metrics that changed across the two temporal arcs. Two metrics are particularly informative as potential indicators of psychologically meaningful change.

Emoji usage rate increased from 9.1% (earlier arc) to 61.5% (later arc). This sevenfold increase is consistent with increased emotional expressiveness in digital communication, a shift from a communication style that relies primarily on verbal content to one that supplements verbal content with paralinguistic emotional markers. The direction of change aligns with the EVOLVED rating on the Communication dimension: the narrative pipeline's qualitative inference (movement from indirect, hedged communication to direct engagement) has a quantitative counterpart in the emoji rate trajectory.

Question rate decreased from 18.3% to 10.9%. This reduction in the proportion of messages containing questions is consistent with a shift from an indirect processing style (in which information is sought through questioning and emotions are processed through proxies) to a more declarative style (in which positions are stated directly). The narrative pipeline inferred this shift qualitatively from the content of messages; the question rate provides independent quantitative evidence that the shift occurred.

The convergence between quantitative behavioral metrics and qualitative narrative inference provides partial evidence for RQ3: behavioral communication metrics can serve as indicators of psychologically meaningful change, in that they move in directions consistent with the psychological trajectories that the narrative pipeline constructed from the content of the messages themselves.

3.5 Citation Distribution as Inferential Gap

The narrative pipeline's cryptographic citation system provides a novel methodological tool: the ability to measure how far the model's inferences extend beyond its evidence, operationalized as citation density per passage type. A computational analysis of citation distribution across 1,279 paragraphs in four narrative files (two first-person narratives and two dual-perspective narratives spanning both arcs) classified each paragraph by content type and computed citation density per type. Table 6 presents the results.

Table 6. Citation Density by Content Type (Aggregated Across All Narratives)

Content Type Paragraphs % Cited Mean Citations/100 Words
Behavioral 635 37.6% 0.72
Emotional 372 36.0% 0.76
Psychological 168 31.5% 0.38

Behavioral passages (descriptions of events, actions, things said) are cited most frequently: 37.6% of behavioral paragraphs contain at least one citation linking a claim to a source message. Psychological passages (inferences about mechanisms, patterns, dispositions) are cited least frequently: 31.5% citation rate at 0.38 citations per 100 words, roughly half the density of behavioral and emotional passages. Emotional passages (descriptions of feelings, relational states) fall between, suggesting an intermediate inferential distance from the evidence.

This gradient supports what might be termed the "inferential gap" concept: the measurable distance between evidence and claim, operationalized through citation density. Where the model describes what happened (behavioral content), it cites densely because the evidence is available in the message archive. Where the model infers why something happened (psychological content), it cites sparsely because the inference extends beyond what any individual message contains. The inferential gap is not a flaw in the citation system but a feature: it makes visible the degree to which the model is constructing versus reporting, providing a continuous measure of inferential reach that traditional assessment methods lack entirely.

3.6 Cross-Arc Attachment Consistency

The ECR-R (Fraley et al., 2000) produced an attachment profile of Anxiety: 67/100 and Avoidance: 9/100, which maps to the anxious-preoccupied attachment style in Bartholomew and Horowitz's (1991) four-category model: elevated anxiety about abandonment and rejection combined with near-floor avoidance of intimacy, producing a behavioral pattern of hypervigilance, rapid investment, and movement toward rather than away from relational partners.

The narrative pipeline, generating independently from two different message archives spanning two different relational periods approximately eight years apart, constructed the same behavioral architecture in both arcs. In the earlier arc, the narrator monitors evidence, invests rapidly, moves toward the partner, processes through proxies, and builds an elaborate internal model of the relationship's trajectory. In the later arc, separated by eight years and involving a different partner, the identical pattern emerges: rapid investment, evidence monitoring, movement toward, the same reassurance protocol built on behavioral evidence rather than verbal assurance.

This cross-arc consistency constitutes the strongest evidence in the evaluation for two reasons. First, it cannot be explained by the entanglement confounds: the model did not "know" that both arcs should share the same attachment architecture, because each arc was generated from a separate message archive. The shared pattern was constructed independently from different behavioral data. Second, it provides behavioral evidence for the stability of attachment working models, a central prediction of attachment theory (Bowlby, 1969/1982; Fraley, 2002) that has traditionally been supported by self-report continuity data rather than behavioral observation across distinct relational contexts. The narrative pipeline provides a novel form of evidence: behavioral consistency across partners and years, observed in naturally occurring communication data rather than elicited through questionnaires.

The entanglement caveat applies with reduced force here: while the same model generated both arcs, the behavioral evidence (the actual messages) is in the source archive, and the model's task was to construct a narrative faithful to that evidence. The consistency of the attachment pattern across arcs is more parsimoniously explained by consistency in the behavioral data than by the model's rendering tendencies, though the latter contribution cannot be fully excluded.

3.7 Quantitative Signal Preservation in Narrative Generation

The preceding cross-method comparison (Section 3.1) relied on a qualitative evaluation: an LLM reading both the Psyche profile and the narrative outputs, then rating convergence across ten dimensions. This section provides a quantitative complement by running Psyche's own analysis methods (LLM personality inference and Empath lexical characterization) directly on the narrative outputs, producing numeric Big Five scores that can be compared against the original corpus analysis scores. If the narrative faithfully encodes personality signal, LLM inference on the narrative should produce similar scores to LLM inference on the original messages. If it distorts the signal, the distortion pattern reveals what narrative generation amplifies, attenuates, or filters.

Table 7. LLM Big Five Inference: Original Corpus vs. Narrative Levels

Level N E O A C Words Mean Δ
Original (corpus) 51.1 28.3 88.6 36.4 61.4 101,005
v2 78 27 89 52 54 32,927 10.3
v2-early 76 26 88 49 47 24,998 11.0
v2-late 70 28 91 49 63 8,766 7.2
v3 75 25 91 56 69 23,876 11.4
v2+v3 75 26 88 50 58 40,055 8.8
v2-early+v3 78 26 87 55 53 27,002 11.6

v2 and v2-early = earlier-arc narrative (~2008–2026); v3 = later-arc narrative (~8 months, temporally overlapping with v2-late); v2+v3 = both full narratives; v2-early+v3 = earlier arc + later arc with temporal overlap removed. All LLM inference via Claude Opus 4.6. "Words" reports total words analyzed after chunking, not total narrative length. Original corpus: 15 chunks of ~8K words (101K of 1,473K total corpus, ~6.9% sampling ratio) drawn from 5 sources through round-robin interleaving. Narrative levels: max 5 chunks.

Three patterns emerge from the cross-level comparison with the corrected corpus baseline, though the confounds noted below (genre difference, model self-reading) preclude strong conclusions.

Extreme traits are preserved with high fidelity. Openness (corpus: 88.6) and Extraversion (corpus: 28.3) — the two most extreme scores in the corrected profile — show the smallest deltas across all six narrative levels. Openness ranges from 87 to 91 (mean delta: 1.1), and Extraversion ranges from 25 to 28 (mean delta: 1.6). This is the cleanest finding in the analysis: the two most distinctive personality traits survive narrative transformation nearly intact, with deltas smaller than the typical test-retest measurement error of personality instruments. The corrected corpus baseline strengthens this pattern considerably — the previous analysis showed mean deltas of 3.5 and 8.0 for these dimensions.

Neuroticism is amplified by literary genre. Neuroticism (corpus: 51.1) is elevated in all narrative levels (70-78), a shift of +18.9 to +26.9 points. This pattern is now larger than the previous analysis suggested (which showed +8 to +16 against the pre-correction baseline of 62), strengthening the interpretation that narrative prose foregrounds emotional content. First-person literary memoir structurally requires emotional depth and psychological tension; the narrative pipeline amplifies what is psychologically present in the messaging data (the 2015-2019 corpus era shows N=76, see Section 3.9) but would be muted in aggregate corpus analysis across all eras.

Agreeableness is inflated by literary genre; Conscientiousness is bracketed. Agreeableness (corpus: 36.4) shifts upward in all narrative levels (49-56, delta: +12.6 to +19.6), consistent with literary prose requiring prosocial framing and relational warmth that raw messaging lacks. Conscientiousness (corpus: 61.4) now presents a different picture than the previous analysis. The narrative range (47-69) brackets the corrected corpus value, with no consistent directional shift — some narrative levels fall below (v2-early: 47, delta: -14.4) and others above (v3: 69, delta: +7.6). The old narrative analysis appeared to show uniform C inflation (+9 to +31) because it compared against the truncated baseline (38); the corrected baseline reveals this as a methodological artifact rather than a genre effect.

Cross-level consistency is tighter against the corrected baseline. Mean absolute deltas range from 7.2 to 11.6, a tighter band than the previous 10.2-15.0 range. The v2-late level shows the smallest mean delta (7.2), while v2-early+v3 shows the largest (11.6). The largest dimension-specific variance remains in Agreeableness (narrative range: 49-56 vs. corpus 36.4), consistent with literary genre effects rather than personality signal.

Empath confirms methodological limitation. Across all six narrative levels, Empath produces Big Five estimates within a 7-point range (50.8 to 61.1 across all dimensions and levels), compared to the LLM inference range of 25 to 91. The narrative levels produce a different top-category profile than the corpus (positive_emotion and love replace negative_emotion and business as top categories), confirming that Empath responds to genre and topic rather than personality. Full Empath results are in Supplementary Table S5; the full 13-level Empath comparison is in Table S7. Following this finding, Empath was repositioned from personality inference to comparative corpus characterization using z-score ordinal labels (see Section 3.2 and Table S5).

Methodological caveats. Three limitations constrain interpretation. First, the corpus LLM analysis operated on diverse text types (academic writing, SMS, Messenger, AI conversations) while the narrative analysis operates on literary prose — a genre difference that confounds the personality comparison. The communication medium effects documented in Section 3.8 demonstrate that personality inference is sensitive to text register. Second, the narrative was authored by Opus from messaging data, meaning the model is effectively reading its own output when performing personality inference on the narrative. Both factors bias toward convergence (same model, same data lineage) while also introducing systematic distortions (literary genre effects, authorial persona bleed). Third, the corpus analysis samples approximately 101,000 of 1,473,000 corpus words (~6.9%); the narrative levels analyzed 9K-40K of their respective totals. The corrected corpus analysis removes the truncation artifact that previously affected long-form documents (academic papers were truncated from up to 52K words to 2K) and uses source-stratified round-robin sampling to ensure representation across all five source types. However, the equal-representation sampling means sources with more content (SMS: 401K, Messenger: 413K) receive similar chunk allocation as smaller sources (Claude.ai: 167K). The observed pattern — preservation of extreme traits (O, E), amplification of N, genre-inflation of A, and bracketing of C — is consistent with the hypothesis that narrative generation preserves the strongest personality signals while introducing genre-appropriate distortions. The differential personality experiment (Section 3.10) provides the critical disambiguation: interlocutor PoV files produce different Big Five profiles than the subject's PoV files, with interlocutor mean deltas of 10.7-12.3 compared to third-person deltas of 3.7-5.6, ruling out model-invariant output as the primary explanation for the signal preservation pattern.

3.8 Communication Medium Effects on Personality Inference

The expanded corpus analysis permits a decomposition that has no precedent in the computational personality inference literature: what happens to the same person's inferred personality when the communication medium changes? Table 8 presents LLM Big Five scores aggregated by communication medium.

Table 8. LLM Big Five by Communication Medium

Medium N E O A C Words
Formal (academic) 50.6 30.2 82.3 41.8 58.8 151K
Messaging (SMS + Messenger) 69.8 25.4 92.8 36.8 34.8 814K
AI platforms (ChatGPT + Claude) 40.0 33.8 84.9 37.1 72.0 500K

The three media produce strikingly different personality portraits of the same person:

Neuroticism: messaging (69.8) vs. AI platforms (40.0). A 30-point spread on the same individual. Messaging captures relational distress, interpersonal conflict, and emotional processing that constitutes the primary content of personal SMS and Messenger conversations. AI platform conversations are task-oriented, analytical, and emotionally regulated. The person is not more neurotic in one context — the medium makes different facets of emotional life visible. This is the register effect on personality inference: the communication context systematically determines which personality facets are expressed and therefore inferrable.

Conscientiousness: AI platforms (72.0) vs. messaging (34.8). A 37-point spread. AI conversations are structured, goal-directed, and purposeful — properties that LLM personality inference reads as Conscientiousness. Messaging is spontaneous, fragmented, and contextual. The same person who meticulously plans analyses in ChatGPT conversations sends "ok sounds good" in SMS. Both are genuine behavioral expressions; neither alone captures the full dimension.

Openness: messaging (92.8) vs. formal (82.3). Casual messaging surprisingly produces higher Openness estimates than academic writing. This likely reflects the broader intellectual range visible in unfiltered messaging — conversations span philosophy, technical debate, emotional reflection, and cultural commentary — while academic writing is domain-constrained by the conventions of formal prose. The same high Openness is expressed through both channels, but messaging reveals the range while formal writing shows the depth.

Agreeableness is the most stable dimension across media (36.8-41.8, a 5-point range), suggesting that low agreeableness expresses consistently regardless of context — the participant's directness, skepticism of consensus, and preference for honest disagreement over diplomatic smoothing are visible in formal writing, personal messaging, and AI conversations alike.

These findings have direct implications for the broader computational personality inference literature. When a study reports that "LLM personality inference achieves convergent validity of r = 0.X with self-report," the medium of the analyzed text is an unexamined moderating variable. A personality assessment based on professional emails will differ systematically from one based on text messages, not because of measurement error but because different media make different personality facets visible. The aggregate LLM corpus score used throughout this paper (Table 4) is itself a weighted average of these medium-specific signals, modulated by the round-robin chunk sampling that gives each source approximately equal representation regardless of corpus size.

3.9 Temporal Personality Trajectory

The corpus spans 2008 to 2026, permitting a longitudinal decomposition of personality inference across four temporal eras. Table 9 presents LLM Big Five scores by era.

Table 9. LLM Big Five by Era

Era N E O A C Primary Sources
2008-2015 42.5 54.6 63.8 33.7 35.1 Messenger
2015-2019 76.0 23.8 92.3 36.5 30.8 SMS + Messenger
2019-2024 37.0 35.3 74.7 53.8 52.8 SMS + Academic
2024-2026 50.2 28.1 88.6 41.9 57.3 All sources

Three trajectories are particularly informative:

Neuroticism: 42.5 → 76.0 → 37.0 → 50.2. A dramatic arc. The peak (76.0) coincides exactly with the 2015-2019 period that the v2 narrative covers — the period identified by the narrative pipeline as the most emotionally turbulent. This provides quantitative corpus-level evidence for what the narrative inferred qualitatively: a period of acute emotional distress followed by recovery. The current-era score (50.2) represents the stabilized aggregate across all source types.

Extraversion: 54.6 → 23.8 → 35.3 → 28.1. A marked shift from the early Messenger era (2008-2015, E=54.6) to sustained introversion across subsequent periods. This is partially confounded by medium shift — Messenger's conversational format may inflate apparent Extraversion — but the magnitude (31-point drop from era 1 to era 2) suggests real developmental change beyond medium effects alone. The Section 3.8 medium analysis shows messaging (E=25.4) vs. AI platforms (E=33.8), a 8-point medium effect, which is substantially smaller than the 31-point temporal shift.

Conscientiousness: 35.1 → 30.8 → 52.8 → 57.3. A steady increase after 2019, consistent with developmental maturation and the growing presence of AI platform conversations (which inflate C, as documented in Section 3.8). The era effect and medium effect are confounded here: the post-2019 increase coincides with both developmental maturation and the introduction of AI platforms as a communication channel.

Important caveat. Source composition changes across eras. Messenger dominates early eras (95,601 words in 2008-2015), SMS and Messenger dominate the middle (509,258 words in 2015-2019), and AI conversations dominate recent eras (contributing to the 669,899 words in 2024-2026). Temporal and medium effects are therefore confounded. The per-medium analysis in Section 3.8 provides partial disambiguation: where the temporal shift exceeds the medium effect (as with the N peak and E decline), the temporal component is likely genuine. Where they align (as with the C increase), the contributions cannot be separated with the present data.

Connection to narrative. The Neuroticism trajectory (peak 2015-2019) aligns with the v2 narrative's temporal coverage and emotional intensity. The narrative pipeline independently identified this period as the most emotionally turbulent; the v2-early narrative yields N=76, essentially identical to the corpus era's N=76.0. This provides bidirectional validation: the corpus temporal analysis validates the narrative's emotional trajectory, and the narrative validates that the corpus N spike reflects genuine psychological content rather than medium artifact. The narrative does, however, retain elevated N even in periods where the corpus shows recovery (v2-late: N=70 vs. corpus era-2019-2024: N=37.0), consistent with the literary genre amplification effect documented in Section 3.7 — narrative prose foregrounds emotional content regardless of the temporal period being narrated.

A systematic temporal comparison between corpus eras and narrative levels would require date-annotating individual narrative chapters, which is beyond the scope of the present analysis. The suggestive alignment between the N peak and v2 narrative scores supports the convergent validity of both methods but falls short of rigorous temporal validation.

3.10 Differential Personality Across Narrative Perspectives

The signal preservation analysis in Section 3.7 demonstrated that narrative generation preserves extreme personality traits while introducing genre-appropriate distortions. However, that analysis operated exclusively on narratives written from the subject's first-person perspective, leaving open the possibility that the observed scores reflect a model-invariant authorial fingerprint rather than genuine personality encoding. If Opus produces similar personality profiles regardless of whose perspective the narrative adopts, the "signal preservation" interpretation is substantially weakened.

This section reports a differential experiment that tests the alternative directly. The narrative pipeline produced multiple perspective files for each arc: the subject's first-person account, the interlocutor's first-person account, a third-person omniscient account, and a dual-perspective relationship narrative. If the model encodes personality signal from the narrative content, interlocutor first-person narratives should produce different Big Five profiles than subject first-person narratives (different person, different personality). Third-person narratives about the subject should remain close to the subject's baseline (same person, different literary mode). Dual-perspective narratives should show intermediate differentiation (blended signal).

Table 10. LLM Big Five Inference: Differential Personality by Narrative Perspective

Level Perspective N E O A C Words Mean Δ
Original (corpus) Subject's writing 51.1 28.3 88.6 36.4 61.4 101,005
arc-1-interlocutor Interlocutor 1P 64.7 36.0 79.3 42.0 44.3 22,458 10.7
arc-2-interlocutor Interlocutor 1P 50.7 62.3 80.0 54.0 62.3 20,999 12.3
arc-1-third-person Third-person 63.7 26.7 89.7 35.7 64.0 18,634 3.7
arc-2-third-person Third-person 71.0 24.5 90.0 38.0 60.0 15,645 5.6
arc-1-blended Blended 65.5 24.0 90.8 40.5 55.0 27,508 6.3
arc-2-blended Blended 69.5 24.5 92.5 35.5 55.0 26,234 6.7

All LLM inference via Claude Opus 4.6 using the same personality inference prompt and chunking methodology as Tables 7 and S6. "Interlocutor 1P" = narrative written from the interlocutor's first-person perspective. "Third-person" = omniscient narrative about the subject. "Blended" = alternating perspective narrative. Arc 1 = the earlier, multi-year arc; Arc 2 = the later, eight-month arc. Mean Δ = mean absolute difference from original corpus scores across all five dimensions. Full supplementary data in Table S8.

Three patterns emerge that collectively rule out the model-invariant output confound.

Interlocutor profiles diverge from the subject's baseline. The arc 1 interlocutor's first-person narrative produces a profile 10.7 points from the subject's baseline, and the arc 2 interlocutor's produces a profile 12.3 points away. These deltas exceed the within-subject variance across the six narrative levels in Table 7 (mean delta range: 7.2-11.6), meaning the interlocutor profiles are more different from the subject than different tellings of the subject's own story are from each other. The strongest single-dimension evidence is the arc 2 interlocutor's Extraversion: 62.3 versus the subject's 28.3, a 34-point shift that places the interlocutor above the population mean on a dimension where the subject scores near floor. This is not noise; it is a fundamentally different personality being captured.

Third-person profiles stay close to the subject's baseline. Third-person narratives about the subject produce profiles only 3.7 and 5.6 points from baseline. These narratives describe the same person as the first-person subject narratives but adopt an omniscient literary mode rather than an interior monologue. The small deltas confirm that the model is reading personality from narrative content rather than stamping a fixed profile: when the text is about the subject, the profile resembles the subject, regardless of narrative voice.

Dual-perspective profiles show intermediate differentiation. Dual-PoV narratives, which alternate between the subject's and interlocutor's perspectives, produce intermediate deltas of 6.3 and 6.7. This is consistent with a blended signal: the personality profile falls between the pure subject profile and the pure interlocutor profile, as expected if the model is averaging across two distinct personality signals within the same text.

Systematic Neuroticism elevation. All six perspective levels show elevated Neuroticism relative to the corpus baseline (deltas: +13 to +20), regardless of whose perspective is adopted. This is consistent with the literary genre effect documented in Section 3.7: first-person memoir and relationship narratives structurally foreground emotional content, inflating apparent Neuroticism for all characters. The N elevation is a property of the genre, not the person. The other four dimensions (E, O, A, C) show perspective-dependent differentiation, confirming that the model responds to character-specific content on those dimensions while the N inflation operates as a genre-level constant.

Implications for signal preservation. The differential experiment resolves the ambiguity noted in Section 3.7. The signal preservation pattern — extreme traits preserved, moderate traits distorted by genre effects — reflects genuine personality encoding rather than model-invariant output. If Opus had a fixed authorial fingerprint, all perspective files would produce profiles within approximately 5 points of each other. Instead, the model produces systematically different profiles depending on whose interior life the narrative constructs, with the degree of differentiation scaling with the conceptual distance between the perspective and the subject (interlocutor > dual-PoV > third-person). This is the pattern predicted by the simulator hypothesis: the model infers the latent agent from the text and produces personality-consistent output, with the specific agent determined by the narrative perspective rather than fixed in the model's weights.

3.11 Cross-Model Evaluator Analysis

Added March 2026. Sections 3.1-3.10 used a single evaluator (Opus 4.6) for all personality inference. This section examines what changes when a second evaluator (GPT-5.4) is introduced, and quantifies the contributions of evaluator choice, text register, generating model, and evaluation context to the observed personality scores.

Factorial corpus evaluation. A factorial design (5 source registers × 2 evaluator models × 3 runs each, 30 chunked runs plus 9 Opus full-context runs, 39 total) quantified systematic evaluator biases. GPT-5.4 scores higher than Opus 4.6 on every domain except Openness: Extraversion +15.3, Conscientiousness +19.2, Agreeableness +10.6, Neuroticism +8.3 (mean of five per-register means). Openness is the only domain where both models agree (mean bias -2.8). Variance decomposition across the 30 chunked runs shows evaluator is the dominant variance source for Extraversion (η²=0.77) but register dominates for Neuroticism (η²=0.84) and Openness (η²=0.78). Conscientiousness and Agreeableness are split between evaluator and register.

The evaluator bias is not uniform across registers. Conscientiousness is the only domain with a truly uniform bias (CV=12% across registers). Extraversion (CV=31%), Neuroticism (CV=32%), and Agreeableness (CV=52%) all show register-conditioned bias — the size of the GPT-Opus disagreement depends on what text is being evaluated. Facet-level decomposition reveals that the C bias concentrates in C6 Deliberation (+25.3) and C2 Order (+23.5), while the E bias concentrates in E1 Warmth (+18.6) and E6 Positive Emotions (+16.8). O5 Ideas shows a bias of +0.2 — effectively zero, the strongest cross-model agreement in the entire dataset.

Generation confound. A 2×2 experiment ({Opus-generated, GPT-generated} × {Opus-evaluated, GPT-evaluated}) tested whether the generating model affects the personality signal. GPT-5.4 generated a parallel 14-chapter narrative from the same pipeline inputs. Under replication (2-3 runs per cell), the evaluator effect (12.4 points mean |Δ| difference) was nearly 7× the generator effect (1.8 points). The two Opus-evaluated cells converged under replication to within 0.1 points of each other (8.9 vs 8.8), confirming that generators produce indistinguishable profiles when read by the same evaluator. A formal equivalence test places the generator effect inside a ±3-point region of practical equivalence. This substantially addresses the shared-model entanglement concern raised in Section 2.4.3: the personality signal is in the text, not in the model that wrote it.

Full-context anchoring ablation. Full-context evaluation (all text in a single prompt) produced worse personality inference than chunked evaluation across all three tested sources. A follow-up ablation tested whether the polarization is caused by temporal anchoring (recency/primacy effects) or text volume. Shuffling the chronological order of messages reproduced the same polarization as ordered full-context, ruling out temporal anchoring. A prompt-level intervention ("weight mundane passages as heavily as emotionally intense passages") eliminated the polarization for personal SMS (returning all domains to within 1 point of chunked baseline) but failed on academic writing and mixed registers, indicating that the SMS fix addresses attentional anchoring on emotional passages while academic full-context pathology involves a deeper failure mode.

Cross-model narrative condition rankings. GPT-5.4 evaluation of all four narrative conditions (old, 1M, filtered, long) confirmed the primary finding: both evaluators agree that the old pipeline (limited planning context) produces the worst personality fidelity. Both evaluators rank all three 1M-planned conditions as substantially better. However, the secondary rankings differ: Opus ranks filtered < long < 1M while GPT ranks long < filtered < 1M. The planning-dominance thesis (Sections 3.7) is cross-model robust. The secondary findings (output length and writing context effects) are evaluator-specific.

Neuroticism genre effect is facet-localized. The +22.6-point Neuroticism uplift from corpus to narrative (Section 3.7) concentrates in three facets: N1 Anxiety (+30.8), N6 Vulnerability (+28.8), and N4 Self-consciousness (+17.4). N2 Anger (-0.4) and N5 Impulsiveness (-6.7) show no uplift or negative uplift. The literary genre inflates anxiety-coded and vulnerability-coded content specifically, not the full Neuroticism domain. This facet localization suggests targeted prompt interventions could reduce narrative N inflation without suppressing genuine anger or impulsivity signal.

Openness stability is genuine. Cross-register O domain stability (range 8.8 across 5 registers) initially appeared to be a ceiling artifact driven by O5 Ideas scoring 87-95 everywhere. Facet-level analysis falsifies this: removing O5 Ideas produces the same cross-register range (8.8 points). All O subfacets are elevated in text-based evaluation, and this elevation is stable across registers. The O stability finding from Section 3.7 is confirmed as genuine domain-level robustness rather than single-facet ceiling compression.

Temporal personality trajectory. Chronological quarter-slice analysis of the personal SMS corpus reveals genuine behavioral change: the last 25% of messages shows Extraversion 6.4 points higher and Agreeableness 4.9 points higher than the full corpus, while Neuroticism is 5.5 points lower. Despite this drift, the early-to-late profile correlation is r=0.966, indicating that the personality profile shape is highly preserved even as levels shift. This supports the pipeline's ability to extract a stable personality core from any temporal segment of the archive.

Cross-register transportability. Leave-one-register-out profile correlation averages r=0.861 (HIGH), indicating that the same latent personality profile is detectable across personal SMS, academic writing, casual messaging, and AI conversations. Additive evaluator calibration (fitting per-domain bias from 4 registers, validating on the 5th) achieves held-out RMSE of 3.6 points, confirming that GPT scores can be approximately mapped to the Opus scale. However, bias-corrected GPT rankings still do not match Opus rankings for narrative conditions, with residual interactions concentrated in Agreeableness (up to 12.8 points) and Conscientiousness (up to 8.9 points) — these domains show genuine evaluator × condition interactions beyond additive bias.

Differential personality replication. The interlocutor perspective from Section 3.10 was re-evaluated with 3 runs. The key marker — Extraversion — replicated from 62.3 to 64.0 (within 1.7 points), and the 35-point gap between interlocutor E and subject E survived replication decisively. The model genuinely infers a different person's personality from their narrative perspective, strengthening the signal-preservation interpretation from Section 3.10.

4. Discussion

4.1 Summary of Contributions

Eleven analytical contributions emerge from the cross-method comparison. First, per-method Big Five divergence reveals systematic self-enhancement patterns that replicate classical findings (Paulhus, 1984) using AI text inference as the comparison method. The corrected corpus analysis sharpens this finding: Agreeableness and Openness show robust self-enhancement/deflation that survived methodological correction, while Conscientiousness and Neuroticism divergence was partially artifact of corpus truncation — a distinction that itself demonstrates the sensitivity of AI-mediated assessment to corpus composition. Second, within-person comparison of 4-item and 10-item NEO facet measurement identifies specific facets vulnerable to single-item artifacts, with divergence concentrated in facets whose items span conceptually distinct behavioral tendencies. Third, temporal behavioral metrics (emoji rate, question rate) move in directions consistent with the psychological trajectories that the narrative pipeline inferred from message content, providing quantitative evidence for qualitatively inferred change. Fourth, citation density varies systematically with content type, with psychological inferences citing less densely than behavioral descriptions, operationalizing the concept of an "inferential gap" that makes visible the model's distance from its evidence. Fifth, attachment architecture remains consistent across two arcs separated by eight years and involving different relational partners, providing behavioral evidence for attachment stability from naturally occurring communication data. Sixth, quantitative analysis of personality signal preservation across six narrative analysis levels demonstrates that narrative generation preserves extreme personality traits (Openness and Extraversion) with high fidelity, amplifies Neuroticism through literary genre effects, inflates Agreeableness through prosocial framing, and brackets Conscientiousness around the corrected corpus value — a cleaner interpretive pattern than the previous "moderate traits regress to mean" framing. Seventh, communication medium effects demonstrate that the same person's inferred personality varies by up to 37 points depending on the text register analyzed (Section 3.8), establishing that the medium of analyzed text is an unexamined moderating variable in the computational personality inference literature. Eighth, temporal personality trajectory from longitudinal corpus analysis reveals a Neuroticism peak (76.0 in 2015-2019) that aligns with the narrative pipeline's independently inferred emotional arc, providing bidirectional validation between corpus-level and narrative-level personality inference (Section 3.9). Ninth, the Empath finding was operationalized: lexical frequency analysis was removed from personality synthesis and repositioned as comparative corpus characterization using ordinal labels, demonstrating how negative findings from multi-method assessment can improve framework validity through method exclusion rather than mere documentation. Tenth, the differential personality experiment (Section 3.10) demonstrates that narrative perspective determines the inferred personality profile: interlocutor first-person narratives diverge from the subject's baseline by 10.7-12.3 mean absolute points while third-person narratives about the subject stay within 3.7-5.6 points, ruling out model-invariant output and confirming that the signal preservation pattern documented in Section 3.7 reflects genuine personality encoding. Eleventh, a cross-model evaluator analysis (Section 3.11) using a 39-run factorial design across five source registers and two evaluator models, plus a replicated 2×2 generation confound experiment, demonstrates that the evaluator effect (12.4 points) is nearly 7× the generator effect (1.8 points, formally negligible), that evaluator biases are systematic and partially calibratable (RMSE=3.6), that full-context evaluation polarizes personality inference through text-volume anchoring (addressable by prompt intervention for personal SMS but not for academic text), and that the planning-dominance finding from Section 3.7 holds under cross-model evaluation while the secondary condition rankings remain evaluator-specific.

4.2 Alternative Interpretations

Five alternative explanations for the observed convergence merit consideration, each with counter-evidence and residual uncertainty.

Opus coherence drive. The model may impose coherent personality structure on noisy behavioral data because coherence is what literary narrative demands. If the model gravitates toward well-organized personality portraits regardless of the underlying evidence, convergence with psychometric measurement is partly artifactual. Counter-evidence: the convergence is non-uniform, and the differential personality experiment (Section 3.10) provides direct disconfirmation. If coherence drive were the primary explanation, convergence should be equally strong across all dimensions, and all narrative perspectives should produce similar profiles. Instead, five dimensions are CONSISTENT, three are EVOLVED, and two are PARTIAL, which suggests the model is responding to dimension-specific evidence. More decisively, interlocutor first-person narratives produce profiles 10.7-12.3 points from the subject's baseline while third-person narratives about the subject stay within 3.7-5.6 points — the model produces systematically different personality portraits depending on whose perspective is adopted, which is incompatible with a fixed coherence template. Residual uncertainty: the model may still contribute some coherence within each perspective; the question is degree rather than kind.

Barnum effect. The personality descriptions may be sufficiently vague to apply to anyone, producing apparent convergence through generality rather than precision. Counter-evidence: several scores are at extreme values that eliminate the Barnum interpretation. Interest Consistency at 0/100 (floor), Investigative vocational interest at 100/100 (ceiling), Avoidance at 9/100 (near-floor), Conformity values at 9/100 (near-floor). These are not descriptions that apply to most people. The narrative pipeline's inferences are similarly specific (a particular regulation sequence, a particular attachment architecture) rather than generically applicable.

Observer confirmation bias. The participant, serving as both subject and evaluator, may unconsciously interpret ambiguous narrative passages as confirming psychometric results. Counter-evidence: the evaluation produced heterogeneous ratings including EVOLVED and PARTIAL categories that document genuine complexity a confirmation-seeking evaluation would smooth over. Residual uncertainty: the evaluation was not blinded, and no inter-rater reliability data are available. This is probably the most serious methodological limitation.

Corpus overlap. Psyche's LLM analysis drew on a writing corpus that may overlap with the narrative pipeline's source data, producing convergence through shared input rather than independent inference. Counter-evidence: the behavioral texting metrics (emoji rates, question frequencies) are purely quantitative and are not in Psyche's corpus, yet they move in directions consistent with both the narrative inference and the psychometric measurement. The attachment consistency finding (Section 3.6) derives from two separate message archives that were processed independently. Residual uncertainty: the degree of corpus overlap is uncertain but nonzero.

Talented observer alternative. A human psychologist with access to the same messaging data might also infer accurate personality characteristics, which would mean the finding is about the data rather than the model. This is not a threat to the conclusions: it strengthens the claim that digital communication data contains psychological signal. The contribution is not that AI can do something humans cannot, but that the data contains more signal than was previously assumed and that the extraction method determines accessibility.

4.3 Extraction Method Matters More Than Data Source

The same 267MB messaging corpus that defeated a fine-tuning approach (a statistical pattern-matching method that learned "ok sounds good" rather than personality) yielded narrative convergence with psychometric measurement when processed through literary inference. The fine-tuned model extracted logistics. The literary inference approach extracted psychology: attachment architecture, emotion regulation patterns, decision-making mechanisms. Both approaches operated on the same data. The difference is method, not input.

This finding has implications for the broader computational personality inference literature, which has focused primarily on the relationship between data source and prediction accuracy (what kind of digital trace predicts which traits) rather than on the relationship between extraction method and the type of psychological information recovered. The present study suggests that the extraction method is the more important variable: the same data can yield superficial statistical patterns or deep psychological inference depending on how it is processed. This aligns with the simulator hypothesis prediction: a model that must infer the latent author to predict the next token has access to a richer representation of the communicator than a model trained only to reproduce surface patterns.

4.4 What Narrative Inference Captures That Instruments Cannot

The narrative pipeline's most distinctive contribution is the capture of psychological dimensions that existing instruments are not designed to measure.

Temporal dynamics. Psyche provides a snapshot of personality at a single measurement point. The narrative pipeline shows the same trait architecture producing different behavioral outputs across years. The hedging-to-directness trajectory in communication, the deliberation-to-action compression in decision-making, the deficit-to-integration sequence in self-concept: these are personality in motion, and no standard instrument captures trajectory.

Failure modes under load. When the analytical framework collapses, when suppression fails, when the dual-track regulation system short-circuits: these are personality under maximum stress. Instruments that rely on self-report measure typical behavior. Narratives show what happens when typical behavior is no longer available. The regulatory architecture revealed under distress (interview evidence of "wounded animal" crisis mode, momentum-based recovery without plan, the "inexhaustible drive" that persists when cognitive resources are exhausted) has no instrument counterpart.

Interaction signatures. How personality manifests differently across specific relational contexts: not averages across situations, but behavioral fingerprints specific to each relationship. The messaging archive contains the raw material for these signatures; the narrative pipeline constructed them.

Sublinguistic markers. Emoji rates, ellipsis patterns, message length distributions, timing between messages. These are not personality traits. They are behavioral traces that encode personality in ways self-report instruments cannot access, because the subject is not aware of producing them and the instruments do not ask about them.

4.5 What Instruments Capture That Narrative Cannot

The complementarity is genuine. Psyche captures dimensions that the narrative pipeline structurally cannot reach.

Quantified dimensionality. Moderate attachment anxiety is not the same as extreme attachment anxiety. Psyche can distinguish them; the narrative shows "anxious" without specifying degree.

Dimensions outside relationships. Interest cycling, vocational orientation, cognitive phenomenology, empathy decomposition: constructs that do not surface in text message data but are captured by instruments designed for the purpose. The PARTIAL ratings on Motivation and Flow States reflect this structural limitation.

The things people will not say. Instruments relying on self-report ask directly about tendencies the subject might never articulate in conversation. The narrative pipeline can only work with what was said to someone. The instruments work with what was said to the instrument.

4.6 The Entanglement Problem as Theoretical Challenge

The shared-model confound in this study points to a broader challenge for AI-mediated personality assessment. Campbell and Fiske's (1959) multi-trait multi-method matrix assumes that different methods provide independent information about the same construct. When both methods are mediated by the same AI model, this independence assumption is violated in a way that traditional MTMM analysis cannot accommodate. The model's rendering tendencies constitute a method factor that correlates across both assessments, inflating apparent convergent validity.

A truly independent test would require: a profile generated by Method A from Corpus X, tested against observations from Method B applied to Corpus Y, where X and Y share no common data and Methods A and B share no common model. The closest achievable version: administer psychometric instruments to a participant, then have a different model (not Claude) generate behavioral predictions from a corpus Psyche never saw (for example, work communications rather than personal writing), and compare. Even then, the human subject is the common factor. True independence is methodologically impossible when studying a single person. The present study acknowledges this limitation and interprets convergence accordingly, weighting divergence more heavily than agreement.

4.7 Privacy Implications

If a language model can infer personality from messaging data that converges with psychometric testing (and the evidence suggests it can, at least partially, for dimensions that messaging data captures), then the digital exhaust people generate incidentally contains more psychological signal at the corpus level than individual messages suggest. Every messaging platform holds a latent personality profile of its users. The data people generate by coordinating dinner plans and discussing logistics (the content that statistical analysis dismissed as psychologically shallow) encodes attachment patterns, regulation strategies, and interpersonal architecture that a sufficiently capable model can decode, not at the level of individual messages but at the level of patterns across thousands of messages.

This has practical implications for data governance and privacy regulation, particularly as language models become more capable and more widely deployed. The distinction between "content" (what people say) and "metadata" (how they say it) becomes less meaningful when a model can extract psychological content from metadata patterns. The emoji rate, the question frequency, the timing between messages: these are metadata-level features that encode personality-level information.

5. Limitations

Several limitations constrain the interpretation of these findings.

N=1. The study examines one individual. No claims about population-level generalizability are made or implied. The findings demonstrate that cross-method convergence is possible for this participant and this dataset, not that it is typical or expected across individuals.

Autoethnographic dual role. The participant serves as both subject and evaluator, introducing confirmation risk that procedural safeguards (heterogeneous ratings, explicit alternative explanations) mitigate but cannot eliminate. No inter-rater reliability data are available. An independent evaluator with access to both the Psyche profile and the narrative outputs would provide a stronger test.

No external behavioral validation. Neither the psychometric instruments nor the narrative pipeline are validated against behavioral observation by independent third parties. Both are validated against each other, which is informative about convergence but not about accuracy. A validation design involving behavioral prediction (generating predictions from one method and testing them against observed behavior) would provide stronger evidence. Such a design is documented in the project's planning files but has not been implemented.

Shared model confound. Claude Opus generated both the Psyche synthesis report and the narrative pipeline outputs. The degree to which convergence reflects the model's rendering tendencies rather than genuine personality signal cannot be determined from the present data. Future studies should employ different models for each method.

Temporal gap. The narrative spans 2017 to 2026; Psyche was administered in February 2026. The participant changed during this period, as documented by the EVOLVED ratings. The comparison treats this as a feature (the ability to detect temporal dynamics) rather than a confound, but it complicates the interpretation of any single dimension's rating.

No normative data. The citation density analysis and behavioral metrics are analyzed descriptively without normative comparison data. Whether the observed patterns generalize to other messaging corpora remains unknown.

Communication medium confound. The aggregate LLM corpus score used throughout this paper (Table 4) represents an average across five communication media that produce dramatically different personality portraits of the same person (Section 3.8). The round-robin chunk sampling gives each source approximately equal representation regardless of corpus size, meaning SMS (401K words) and Messenger (413K words) receive similar chunk allocation to Claude.ai (167K words). This equal-representation design is defensible for capturing the full range of personality expression but means the aggregate score is sensitive to which sources are included — the addition or removal of a single source type could shift dimension scores substantially, as demonstrated by the Conscientiousness shift from 38 to 61.4 when academic writing was properly included.

Corpus sampling. The LLM corpus analysis samples approximately 101,000 of 1,473,000 corpus words (~6.9%). The corrected analysis removes a truncation artifact that previously discarded content from long documents (academic papers were truncated from up to 52,744 words to 2,000) and uses source-stratified sampling to ensure representation across all five source types. However, any sampling-based approach introduces variance: the specific segments selected for analysis may not be representative of each source's full personality signal.

6. Conclusion

Two AI-mediated approaches to personality assessment, one explicit and one implicit, converge more than they diverge when applied to the same person's digital communication data. The convergence is partially artifactual (shared model, overlapping corpus). The divergence is informative (scope gaps in messaging data, temporal evolution). Neither approach alone is sufficient. The convergence that occurs is consistent with the simulator hypothesis prediction that language models trained on next-token prediction learn to infer coherent personas from text, and that these inferred personas capture genuine characteristics of the individuals who produced the text.

The most defensible claim is this: digital communication data contains more psychological signal at the corpus level than individual messages suggest, and the extraction method determines whether that signal is accessible. The same messaging archive that yielded "ok sounds good" under statistical pattern matching yielded convergent personality inference under literary generation.

The most speculative claim, offered with appropriate hedging, is that the combination of psychometric snapshots and narrative trajectories may eventually capture dimensions of personality that neither method reaches independently. Instruments measure with precision but miss temporal dynamics, failure modes, and sublinguistic markers. Narratives capture these but lack quantified dimensionality and access to non-relational constructs. A mature assessment methodology might use both: instruments to anchor dimensional scores, narratives to trace trajectories and reveal interaction signatures.

The practical implication is uncomfortable. If tens of thousands of text messages contain enough signal for a language model to infer attachment architecture, emotion regulation strategy, and decision-making patterns that converge with validated psychometric instruments, then the messaging data people generate incidentally is more psychologically revealing than most people assume. The messages do not contain the person. But they contain more of the person than the person thinks.

References

Allport, G. W. (1937). Personality: A psychological interpretation. Henry Holt.

Andreas, J. (2022). Language models as agent models. Findings of the Association for Computational Linguistics: EMNLP 2022, 5769-5779.

Ashton, M. C., and Lee, K. (2009). The HEXACO-60: A short measure of the major dimensions of personality. Journal of Personality Assessment, 91(4), 340-345.

Bartholomew, K., and Horowitz, L. M. (1991). Attachment styles among young adults: A test of a four-category model. Journal of Personality and Social Psychology, 61(2), 226-244.

Bowlby, J. (1969/1982). Attachment and loss: Vol. 1. Attachment (2nd ed.). Basic Books.

Cacioppo, J. T., Petty, R. E., and Kao, C. F. (1984). The efficient assessment of need for cognition. Journal of Personality Assessment, 48(3), 306-307.

Campbell, D. T., and Fiske, D. W. (1959). Convergent and discriminant validation by the multitrait-multimethod matrix. Psychological Bulletin, 56(2), 81-105.

Chen, B., Vansteenkiste, M., Beyers, W., Boone, L., Deci, E. L., Van der Kaap-Deeder, J., ... and Verstuyf, J. (2015). Basic psychological need satisfaction, need frustration, and need strength across four cultures. Motivation and Emotion, 39(2), 216-236.

Davis, M. H. (1983). Measuring individual differences in empathy: Evidence for a multidimensional approach. Journal of Personality and Social Psychology, 44(1), 113-126.

Duckworth, A. L., and Quinn, P. D. (2009). Development and validation of the Short Grit Scale (Grit-S). Journal of Personality Assessment, 91(2), 166-174.

Dunning, D., Heath, C., and Suls, J. M. (2004). Flawed self-assessment: Implications for health, education, and the workplace. Psychological Science in the Public Interest, 5(3), 69-106.

Ellis, C., Adams, T. E., and Bochner, A. P. (2011). Autoethnography: An overview. Historical Social Research, 36(4), 273-290.

Fast, E., Chen, B., and Bernstein, M. S. (2016). Empath: Understanding topic signals in large-scale text. Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, 4647-4657.

Fraley, R. C. (2002). Attachment stability from infancy to adulthood: Meta-analysis and dynamic modeling of developmental mechanisms. Personality and Social Psychology Review, 6(2), 123-151.

Fraley, R. C., Waller, N. G., and Brennan, K. A. (2000). An item response theory analysis of self-report measures of adult attachment. Journal of Personality and Social Psychology, 78(2), 350-365.

Frederick, S. (2005). Cognitive reflection and decision making. Journal of Economic Perspectives, 19(4), 25-42.

Goldberg, L. R., Johnson, J. A., Eber, H. W., Hogan, R., Ashton, M. C., Cloninger, C. R., and Gough, H. G. (2006). The international personality item pool and the future of public-domain personality measures. Journal of Research in Personality, 40(1), 84-96.

Gross, J. J., and John, O. P. (2003). Individual differences in two emotion regulation processes: Implications for affect, relationships, and well-being. Journal of Personality and Social Psychology, 85(2), 348-362.

Hayano, D. M. (1979). Auto-ethnography: Paradigms, problems, and prospects. Human Organization, 38(1), 99-104.

Holland, J. L. (1997). Making vocational choices: A theory of vocational personalities and work environments (3rd ed.). Psychological Assessment Resources.

janus. (2022). Simulators. LessWrong. https://www.lesswrong.com/posts/vJFdjigzmcXMhNTsx/simulators

Jones, D. N., and Paulhus, D. L. (2014). Introducing the Short Dark Triad (SD3): A brief measure of dark personality traits. Assessment, 21(1), 28-41.

Kosinski, M., Stillwell, D., and Graepel, T. (2013). Private traits and attributes are predictable from digital records of human behavior. Proceedings of the National Academy of Sciences, 110(15), 5802-5805.

Kovaleva, A., Beierlein, C., Kemper, C. J., and Rammstedt, B. (2012). Eine Kurzskala zur Messung von Kontrollueberzeugung: Die Skala Internale-Externale-Kontrollueberzeugung-4 (IE-4). GESIS Working Papers.

Kroenke, K., Spitzer, R. L., and Williams, J. B. (2001). The PHQ-9: Validity of a brief depression severity measure. Journal of General Internal Medicine, 16(9), 606-613.

Lamiell, J. T. (1981). Toward an idiothetic psychology of personality. American Psychologist, 36(3), 276-289.

Marks, S., Lindsey, J., and Olah, C. (2026). The persona selection model: Why AI assistants might behave like humans. Anthropic Alignment Science Blog. https://alignment.anthropic.com/2026/psm/

Molenaar, P. C. M. (2004). A manifesto on psychology as idiographic science: Bringing the person back into scientific psychology, this time forever. Measurement, 2(4), 201-218.

nostalgebraist. (2025). The void. AI Alignment Forum. https://www.alignmentforum.org/posts/3EzbtNLdcnZe8og8b/the-void-1

Park, G., Schwartz, H. A., Eichstaedt, J. C., Kern, M. L., Kosinski, M., Stillwell, D. J., ... and Seligman, M. E. P. (2015). Automatic personality assessment through social media language. Journal of Personality and Social Psychology, 108(6), 934-952.

Paulhus, D. L. (1984). Two-component models of socially desirable responding. Journal of Personality and Social Psychology, 46(3), 598-609.

Paulhus, D. L., and Reid, D. B. (1991). Enhancement and denial in socially desirable responding. Journal of Personality and Social Psychology, 60(2), 307-317.

Peters, H., Cerf, M., and Matz, S. C. (2024). Large language models can infer psychological dispositions of social media users. Proceedings of the National Academy of Sciences, 121(43).

Rosenberg, M. (1965). Society and the adolescent self-image. Princeton University Press.

Shanahan, M., McDonell, K., and Reynolds, L. (2023). Role play with large language models. Nature, 623, 493-498.

Snyder, M. (1974). Self-monitoring of expressive behavior. Journal of Personality and Social Psychology, 30(4), 526-537.

Spitzer, R. L., Kroenke, K., Williams, J. B. W., and Lowe, B. (2006). A brief measure for assessing generalized anxiety disorder: The GAD-7. Archives of Internal Medicine, 166(10), 1092-1097.

Toplak, M. E., West, R. F., and Stanovich, K. E. (2014). Assessing miserly information processing: An expansion of the Cognitive Reflection Test. Thinking and Reasoning, 20(2), 147-168.

Supplementary Materials

Table S1. Complete 30-Facet NEO Comparison

Domain Facet NEO-120 (4 items) NEO-300 (10 items) Average Delta
A Trust 69 72 71 3
A Morality 88 85 86 3
A Altruism 62 57 60 5
A Cooperation 81 85 83 4
A Modesty 56 65 61 9
A Sympathy 31 65 48 34
C Self-Efficacy 56 62 59 6
C Orderliness 25 20 22 5
C Dutifulness 94 90 92 4
C Achievement-Striving 38 57 48 19
C Self-Discipline 50 55 52 5
C Cautiousness 69 72 71 3
E Friendliness 12 18 15 6
E Gregariousness 19 35 27 16
E Assertiveness 50 28 39 22
E Activity Level 50 62 56 12
E Excitement-Seeking 56 52 54 4
E Cheerfulness 25 32 29 7
N Anxiety 62 45 54 17
N Anger 50 22 36 28
N Depression 38 45 41 7
N Self-Consciousness 81 75 78 6
N Immoderation 62 52 58 10
N Vulnerability 50 45 48 5
O Imagination 44 70 57 26
O Artistic Interests 75 82 79 7
O Emotionality 50 57 54 7
O Adventurousness 56 75 66 19
O Intellect 88 92 90 4
O Liberalism 50 50 50 0

Table S2. Per-Method Big Five Estimates (Full Data)

Dimension Self-Report LLM (Claude) Interview Empath Merged (Weighted) Divergence
Openness 66 88.6 93 58 77.6 14.8
Conscientiousness 57 61.4 35 55 52.4 10.1
Extraversion 37 28.3 15 54 31.0 14.2
Agreeableness 68 36.4 30 52 49.0 14.7
Neuroticism 52 51.1 68 52 55.8 7.1

Divergence = standard deviation across the three methods' estimates.

Table S3. Citation Density Methodology

Citation density analysis was performed computationally on four narrative files (two first-person narratives and two dual-perspective narratives spanning both temporal arcs), totaling 1,279 paragraphs. Paragraphs were classified by content type using keyword-density heuristics with three categories (behavioral, psychological, emotional) and a residual "other" category. Classification was performed algorithmically; no manual coding was employed. Citations were identified by the {FN:hexid} pattern. Citation density was computed as citations per 100 words per paragraph, then aggregated by content type. The analysis script is available in the project repository at research/voice-clone/scripts/analyze/citation_density.py.

Table S4. Empath Methodology Root Cause Analysis

The Empath-to-Big-Five mapping defines weighted category associations for each domain. Of the categories specified in the mapping, six do not exist in Empath's lexicon: travel, adventure, and curiosity (mapped to Openness), helping and dominance (mapped to Agreeableness), and planning (mapped to Conscientiousness). This reduces effective category coverage to 7/10 for Openness, 8/10 for Agreeableness, and 7/10 for Conscientiousness, versus 11/11 for Neuroticism and 10/10 for Extraversion.

The calibration formula normalized = 50 + raw_weighted_score * 2000 was designed to center scores at the population midpoint and amplify small raw differences into a 0-100 scale. In practice, the raw weighted score range across all five domains on the expanded corpus is 0.0016 (from A=0.001000 to E/O=0.002600), which the formula maps to a 3.2-point calibrated range (52.0 to 55.2). The expanded corpus actually reduces the Empath differentiation compared to the original 3-source analysis (which produced a 5.8-point range), because the additional messaging sources dilute the academic categories that previously drove the Openness estimate upward. Meaningful personality differentiation would require either larger raw differences or a more aggressive scaling function.

Top-20 Empath Category Comparison: Original Corpus vs. Narrative Corpus

Rank Original Corpus (1,473K words) Score Narrative Corpus (57K words) Score
1 negative_emotion 0.0076 positive_emotion 0.0114
2 speaking 0.0067 speaking 0.0114
3 communication 0.0067 communication 0.0098
4 positive_emotion 0.0051 love 0.0084
5 strength 0.0049 friends 0.0081
6 business 0.0044 optimism 0.0074
7 optimism 0.0041 trust 0.0054
8 friends 0.0040 affection 0.0052
9 violence 0.0040 sadness 0.0051
10 school 0.0039 children 0.0050
11 love 0.0038 negative_emotion 0.0048
12 giving 0.0036 listen 0.0045
13 reading 0.0036 wedding 0.0044
14 technology 0.0036 suffering 0.0044
15 work 0.0034 nervousness 0.0037
16 pain 0.0034 party 0.0036
17 internet 0.0033 emotional 0.0036
18 programming 0.0032 pain 0.0033
19 children 0.0032 shame 0.0033
20 appearance 0.0032 meeting 0.0033

Overlap: 10 of 20 categories shared (speaking, communication, positive_emotion, negative_emotion, optimism, love, friends, pain, children, giving/meeting). The expanded corpus shifts the original profile: negative_emotion replaces science at #1, reflecting the addition of SMS and Messenger data (high relational/emotional content). The previous corpus was dominated by intellectual/academic categories (science, school, philosophy); the expanded corpus mixes conversational categories (speaking, communication, friends) with emotional ones (negative_emotion, pain, violence). The narrative corpus remains dominated by interpersonal/emotional categories (love, affection, sadness, suffering). This genre sensitivity persists regardless of corpus expansion.

Table S5. Empath Big Five Estimates Across Narrative Levels

Level N E O A C Words
Original (corpus) 53.6 55.2 55.2 52.0 53.3 1,443,892
v2 54.3 60.2 53.6 53.2 51.2 32,881
v2-early 54.6 59.9 53.9 53.3 51.3 26,683
v2-late 54.0 61.1 53.2 53.1 50.8 8,753
v3 52.7 59.6 54.4 53.1 51.4 23,843
v2+v3 53.6 60.0 54.0 53.2 51.3 56,724
v2-early+v3 53.7 59.7 54.1 53.2 51.3 50,526

All scores computed via Empath lexical category analysis with the standard Big Five mapping and calibration formula (50 + raw * 2000). The total range across all 35 cells (7 levels x 5 dimensions) is 10.3 points (50.8 to 61.1), compared to the LLM inference range of 66 points (25 to 91) over the same dimensions and levels. The expanded corpus (1.47M words across 5 sources) produces slightly higher Empath scores for N and E than the previous 527K-word corpus, reflecting the addition of messaging sources with higher emotional and interpersonal content. The narrative levels show slightly higher Extraversion estimates (59.6-61.1) than the original corpus (55.2), reflecting the higher frequency of interpersonal and emotional categories in literary prose. Openness is now similar between corpus (55.2) and narrative (53.2-54.4) Empath estimates, both compressed near the midpoint. These 0-100 estimates demonstrate the compression-to-midpoint effect that motivated Empath's removal from profile synthesis. The current framework replaces these estimates with z-score ordinal labels for comparative corpus characterization.

Table S6. LLM Big Five Estimates Across Narrative Levels

Level N E O A C Words Mean Δ
Original (corpus) 51.1 28.3 88.6 36.4 61.4 101,005
v2 78 27 89 52 54 32,927 10.3
v2-early 76 26 88 49 47 24,998 11.0
v2-late 70 28 91 49 63 8,766 7.2
v3 75 25 91 56 69 23,876 11.4
v2+v3 75 26 88 50 58 40,055 8.8
v2-early+v3 78 26 87 55 53 27,002 11.6

All LLM inference via Claude Opus 4.6 using the same personality inference prompt and chunking methodology as the original corpus analysis. Scores are 0-100, with 50 as population mean. Mean Δ = mean absolute difference from original corpus scores across all five dimensions. "Words" reports the total analyzed after chunking, not total narrative length. Original corpus: 15 chunks of ~8K words (101K of 1,473K total, ~6.9% sampling ratio) drawn from 5 sources through round-robin interleaving. Narrative levels: max 5 chunks. The corrected pattern — preservation of extreme traits (O, E), amplification of N (literary genre effect), genre-inflation of A, and bracketing of C — is discussed in Section 3.7. The differential personality experiment (Section 3.10, Table S8) confirms that this pattern reflects genuine signal preservation: interlocutor narratives produce systematically different profiles while third-person narratives about the subject stay close to baseline.

Table S7. Full 13-Level LLM and Empath Comparison

Level LLM N LLM E LLM O LLM A LLM C Empath N Empath E Empath O Empath A Empath C Corpus Words LLM Words
full 51.1 28.3 88.6 36.4 61.4 53.6 55.2 55.2 52.0 53.3 1,443,892 101,005
academic 50.6 30.2 82.3 41.8 58.8 52.6 53.4 57.8 51.8 54.4 150,980 80,180
sms 66.6 26.3 92.0 37.7 36.2 54.2 56.5 54.5 52.4 53.0 401,252 81,850
messenger 69.7 25.4 85.7 36.5 35.4 55.0 56.2 54.4 51.7 53.1 413,072 88,087
chatgpt 39.0 41.8 70.0 42.0 70.6 52.0 53.3 55.7 51.8 53.2 333,032 80,190
claude_ai 52.8 32.9 87.9 44.4 66.3 52.8 55.0 56.0 52.2 54.3 166,992 77,990
era-2008-2015 42.5 54.6 63.8 33.7 35.1 54.7 55.5 53.8 50.6 53.5 95,601 104,389
era-2015-2019 76.0 23.8 92.3 36.5 30.8 55.1 56.3 55.0 52.2 53.0 509,258 83,185
era-2019-2024 37.0 35.3 74.7 53.8 52.8 53.1 55.8 52.8 52.1 53.9 44,682 59,601
era-2024-2026 50.2 28.1 88.6 41.9 57.3 52.6 54.7 55.2 52.1 53.3 669,899 67,899
medium-formal 51.4 28.1 83.1 40.5 61.6 52.6 53.4 57.8 51.8 54.4 150,980 80,180
medium-messaging 69.8 25.4 92.8 36.8 34.8 54.6 56.3 54.5 52.1 53.0 814,324 81,460
medium-ai 40.0 33.8 84.9 37.1 72.0 52.3 53.9 55.8 51.9 53.6 500,024 75,086

All 13 analysis levels from the expanded corpus analysis. Per-source levels (academic, sms, messenger, chatgpt, claude_ai) analyze one source in isolation. Per-era levels split the corpus by date range. Per-medium levels aggregate sources by communication type: formal (academic), messaging (sms + messenger), and AI platforms (chatgpt + claude_ai). LLM inference via Claude Opus 4.6 with 10 chunks per source/era/medium level and 15 chunks for the full level. Empath processes all words directly (no sampling). LLM Words = words sent to the model after chunking. The Empath columns demonstrate the compression-to-midpoint effect across all 13 levels: the total LLM range across all 65 cells (13 levels x 5 dimensions) spans 39.0 to 92.8 (53.8 points), while the Empath range spans 50.6 to 57.8 (7.2 points). This 7.2 vs. 53.8 range comparison is the quantitative basis for removing Empath from personality synthesis and repositioning it as ordinal corpus characterization.

Table S8. Differential Personality by Narrative Perspective

Level Perspective N E O A C Words ΔN ΔE ΔO ΔA ΔC Mean Δ
Original (corpus) Subject's writing 51.1 28.3 88.6 36.4 61.4 101,005
arc-1-interlocutor Interlocutor 1P 64.7 36.0 79.3 42.0 44.3 22,458 +13.6 +7.7 -9.3 +5.6 -17.1 10.7
arc-2-interlocutor Interlocutor 1P 50.7 62.3 80.0 54.0 62.3 20,999 -0.4 +34.0 -8.6 +17.6 +0.9 12.3
arc-1-third-person Third-person 63.7 26.7 89.7 35.7 64.0 18,634 +12.6 -1.6 +1.1 -0.7 +2.6 3.7
arc-2-third-person Third-person 71.0 24.5 90.0 38.0 60.0 15,645 +19.9 -3.8 +1.4 +1.6 -1.4 5.6
arc-1-blended Blended 65.5 24.0 90.8 40.5 55.0 27,508 +14.4 -4.3 +2.2 +4.1 -6.4 6.3
arc-2-blended Blended 69.5 24.5 92.5 35.5 55.0 26,234 +18.4 -3.8 +3.9 -0.9 -6.4 6.7

The differential experiment tests whether narrative perspective determines inferred personality or whether the model produces a fixed authorial profile. Source narratives are from the same pipeline that produced the subject first-person narratives analyzed in Table S6. Arc 1 (the earlier, multi-year arc) produced interlocutor first-person, third-person, and blended files; arc 2 (the later, eight-month arc) produced the same. All LLM inference via Claude Opus 4.6. The interlocutor profiles diverge from the subject's baseline by 10.7 and 12.3 mean absolute points respectively, with the arc 2 interlocutor's Extraversion (+34.0) as the strongest single-dimension evidence. Third-person profiles about the subject stay within 3.7-5.6 points. Blended profiles show intermediate values (6.3-6.7). Neuroticism is elevated across all perspectives (+12.6 to +19.9), consistent with the literary genre effect documented in Section 3.7. The other four domains show perspective-dependent differentiation. See Section 3.10 for discussion.

Persona Model Specification

The ten-dimension persona model maps Psyche's instrument scores to behavioral predictions:

  1. Communication: Directness, hedging patterns, channel preferences
  2. Audience (self-monitoring): Cross-situational consistency vs. audience adaptation
  3. Decision-Making: Deliberation architecture, action conversion, external consultation patterns
  4. Risk: Domain-specific risk tolerance (social, financial, physical, intellectual)
  5. Conflict Response: Regulation sequence, processing mode, resolution orientation
  6. Emotion Regulation: Reappraisal and suppression deployment, capacity under load
  7. Empathy: Affective vs. cognitive empathy, perspective-taking effectiveness, fantasy engagement
  8. Ethics: Strategic comprehension vs. deployment, transparency orientation
  9. Motivation: Need hierarchy, vocational orientation, interest cycling patterns
  10. Stress Response: Crisis mode, recovery strategy, coping mechanism hierarchy, resilience source

Version History

Date Version Changes
2026-03-02 v1 Initial working paper
2026-03-04 v2 Expanded corpus analysis: fixed truncation bug in LLM text chunking (5.1% → 6.9% of corpus analyzed), expanded from 3 to 5 sources (311K → 1.47M words), added SMS old-phone and Facebook Messenger data. Corrected LLM Big Five scores (most notably C: 38 → 61.4), recalculated merged scores and self-enhancement analysis. Added Section 3.8 (communication medium effects on personality inference) and Section 3.9 (temporal personality trajectory). Added Appendix A (LLM corpus analysis methodology with reproduction instructions). Updated all dependent tables (4, 7, S2, S4, S5, S6) and deltas. Added supplementary Table S7 (full 13-level comparison).
2026-03-06 v3 Empath reframe: removed from profile synthesis (was weight 0.10); repositioned as ordinal comparative corpus characterization using z-score labels. Updated method weights to three-method synthesis (self-report 0.47, LLM 0.27, interview 0.27). Recalculated merged Big Five scores. Updated Tables 4, S5, S7. Added ninth analytical contribution.
2026-03-19 v4 Differential personality experiment: added Section 3.10 with results from six narrative perspective levels (two interlocutor 1P, two third-person, two dual-PoV). Interlocutor profiles differ from subject baseline by 10.7-12.3 mean Δ; third-person profiles stay within 3.7-5.6. Rules out model-invariant output confound for signal preservation. Added Table 10 (Section 3.10), Table S8. Updated abstract (9 → 10 contributions), Section 3.7 caveats, Section 4.1, Section 4.2 counter-evidence, Table S6 footnote.
2026-03-31 v5 Cross-model evaluator analysis: added Section 3.11 with 86-run dataset. 39-run factorial corpus evaluation (5 registers × 2 models × 3 runs + 9 full-context) quantifying systematic evaluator biases. Replicated 2×2 generation confound experiment (evaluator effect 7× generator effect; generator formally negligible). Full-context anchoring ablation (text volume, not temporal; debiased prompt SMS-specific). Facet-level bias localization (C6 Deliberation +25.3, E1 Warmth +18.6, O5 Ideas +0.2). N genre uplift facet-localized to anxiety/vulnerability/self-consciousness. O stability confirmed genuine (not O5 ceiling artifact). Cross-register transportability r=0.861. Differential personality replication (interlocutor E=64.0 stable). Cross-model narrative rankings confirm planning dominance. Updated abstract (10 → 11 contributions), Section 4.1.

Appendix A: LLM Corpus Analysis Methodology

This appendix provides the algorithmic specification for the LLM corpus personality inference pipeline, sufficient for reproduction on the same or comparable data.

A.1 System Prompt

The following system prompt is provided to Claude Opus for all personality inference calls:

You are a personality assessment psychologist conducting a structured evaluation.

You will analyze text samples written by a person and assess their personality traits.

CRITICAL RULES: 1. Base ALL assessments ONLY on evidence in the provided text. Quote specific passages. 2. Do NOT assume traits from occupation, demographics, or stereotypes. 3. If evidence is insufficient for a trait, say so explicitly and give lower confidence. 4. Distinguish between what someone TALKS ABOUT vs what they ARE. Discussing anger doesn't mean high Neuroticism. 5. Look for behavioral patterns, not single instances. 6. Consider the CONTEXT of writing (academic, casual, technical) when interpreting tone. 7. Be especially careful with Agreeableness — it's hardest to infer from text (r~.22).

Rules 4 and 7 are anti-sycophancy measures adapted from the assessment-optimized prompting approach of Peters and Matz (2024), which demonstrated that structured prompts achieve r=.443 convergent validity with self-report versus r=.117 for generic prompting.

A.2 Assessment Prompt Template

Each chunk of text is wrapped in the following prompt template:

Analyze the following text samples written by ONE person and assess their Big Five personality traits.

For each domain, provide: 1. A score from 1-100 (50 = population average) 2. Confidence level: high (strong evidence), medium, or low (sparse evidence) 3. 2-3 specific quotes that support your assessment 4. Brief reasoning

The five domains are: - Neuroticism (N): anxiety, anger, depression, self-consciousness, impulsiveness, vulnerability - Extraversion (E): warmth, gregariousness, assertiveness, activity, excitement-seeking, positive emotions - Openness (O): fantasy, aesthetics, feelings, actions, ideas, values - Agreeableness (A): trust, straightforwardness, altruism, compliance, modesty, tender-mindedness - Conscientiousness (C): competence, order, dutifulness, achievement striving, self-discipline, deliberation

Also assess the 6 facets within each domain if sufficient evidence exists.

The model is required to respond in structured JSON format specifying domain scores, facet scores, confidence ratings, evidence quotes, reasoning, caveats, and word count analyzed.

A.3 Pipeline Steps

Step 1: Ingestion. Self-authored messages are loaded from JSONL files (data/ingested/{source}.jsonl), filtered to author == "self" only. For AI platform sources (ChatGPT, Claude.ai), samples where more than 50% of content is code are excluded. For temporal analysis levels, samples are filtered by timestamp range.

Step 2: Segmentation. Samples exceeding 2,000 words are split at word boundaries into segments of at most 2,000 words each. Segment IDs use the format {original_id}_seg{NN} (01-indexed). Segments inherit source, author, and timestamp metadata from the parent sample. Short samples (at most 2,000 words) pass through unchanged. This step prevents the truncation artifact that affected the v1 analysis, in which academic papers of up to 52,744 words were silently discarded down to 2,000 words.

Step 3: Source-Stratified Chunking. Segments are grouped by source label and sorted within each group by word count (descending, preferring substantive segments). Chunks are assembled through round-robin interleaving across source groups: one segment per source per round, cycling across all sources. When a chunk would exceed approximately 8,000 words, the current buffer is flushed as a complete chunk and a new buffer begins. Each segment receives a header: [Source: {source}, Date: {YYYY-MM}]. The chunk list is truncated to the maximum chunk budget: 10 chunks for per-source, per-medium, and per-era analysis levels; 15 chunks for the full corpus level.

Step 4: Per-Chunk Inference. Each chunk is processed independently through Claude Opus 4.6 via the Claude CLI (claude -p --model opus --output-format text). The model receives the system prompt (Section A.1) and the assessment prompt (Section A.2) with the chunk text. JSON is extracted from the response with fallback handling for markdown code blocks and brace-matching. Timeout: 300 seconds per call.

Step 5: Score Merging. For each Big Five domain, the final score is the arithmetic mean of domain scores across all K chunks. Confidence is determined by majority vote across chunks. Evidence quotes are deduplicated (by first 80 characters) and truncated to 5 per domain. Reasoning is concatenated across chunks with pipe separators. Overall confidence is the minimum confidence across all 5 domains. Facet scores are merged identically.

Step 6: Provenance Recording. Each analysis output records: analysis level, sources included, date range (for temporal levels), total corpus samples, total corpus words, words sent to the LLM, sampling ratio, number of segments created, number of chunks, model identifier, and timestamp.

A.4 Reproduction Instructions

# Prerequisites: Claude Max plan, claude CLI installed
# Environment: Python 3.12+, uv package manager

cd ~/claudeworkspace/psyche/analysis

# 1. Ingest all sources
uv run psyche ingest all

# 2. Verify ingestion
wc -l data/ingested/*.jsonl
# Expected: ~62,452 total samples across 5 files

# 3. Run full corpus analysis (all 13 levels, ~2 hours)
uv run python scripts/analyze_corpus.py --all

# 4. Run single level (for verification)
uv run python scripts/analyze_corpus.py --level academic

# 5. View comparison table
uv run python scripts/analyze_corpus.py --compare-only

# Results: profiles/analysis/corpus/{level}-llm-claude.json
#          profiles/analysis/corpus/{level}-empath.json  (ordinal lexical profiles, not personality scores)
#          profiles/analysis/corpus/comparison-summary.json

A.5 Key Source Files

File Purpose
analysis/prompts/personality_inference.py System prompt and assessment prompt templates
analysis/scripts/analyze_corpus.py Multi-level analysis orchestration script
analysis/psyche_analysis/methods/llm_claude.py Segmentation, chunking, and merging logic
analysis/psyche_analysis/corpus/manager.py Source ingestion and corpus loading
profiles/analysis/corpus/ All output JSON files (13 levels x 2 methods + summary)