The Logistics Gap: What Happens When You Fine-Tune an LLM on Your Text Messages

Here is an excerpt from a conversation between two language models, each fine-tuned on one person's text message history:

Person A: what do you want for dinner
Person B: Well what do you want
Person A: I was going to order takeout but I'm not sure what you want
Person B: Well I want takeout too. It's the only food I eat on a regular basis.
Person A: ok sounds good. i think i'll order it here, I'm still going to be around
Person B: ok I will go pick up more water and food
Person A: Ok, I'll be over in a couple minutes
Person B: ok, we are already inside
Person A: ok, we're going
Person B: I'm out by the car
Person A: I'm at the door
Person B: oh right, I'm on the other side of the building, going to head that way now
Person A: ok sounds good
Person B: ok I'm coming

Nobody is at the door. Nobody is out by the car. These are language models running on a GPU. They have no car, no door, no building. But they have learned, from tens of thousands of real text messages, that this is what conversations are for: coordinating two bodies through physical space until they occupy the same room. At which point the texting stops and the actual relationship happens somewhere no dataset can reach.

This is the logistics gap. I fine-tuned two LLMs on approximately 46,000 raw text messages between two people, of which about 32,574 survived filtering for training (one model for each person), and ran them in conversation with each other. Five conversations, thirty turns each. Every single one devolved into logistics, sleep talk, or repetition loops within fifteen turns. The models are remarkably good at planning meetups, negotiating dinner options, and narrating physical movement toward each other. They are completely unable to have a substantive conversation. Not because the models are bad. Because the training data is.

Your text messages don't contain you. They contain the logistics of you. The person you are exists in the gaps between messages: in the dinner that gets eaten after the takeout order is placed, in the conversation that happens after "I'm at the door," in the silences that texts were never designed to capture. Fine-tuning an LLM on your texts doesn't clone your personality. It clones your coordination protocol.

What Was Built

Both models are Qwen3-8B, fine-tuned using QLoRA at 4-bit precision on an RTX 3090 with 24 GB of VRAM. The data sources are Facebook Messenger and SMS, spanning several years of conversation between two people. One model learned to be Person A; the other learned to be Person B. Training took about sixty minutes per model with batch_size=1 and gradient accumulation steps of 16, because anything larger caused the GPU to thrash itself into memory errors under WSL2.

The style profiles tell you something about what the models learned. Person A's training data is 16,084 messages with a median length of 12 words. The word "lol" appears 1,640 times, frequent enough to rank as the 24th most common word, ahead of "think," "not," and "are." "ok" starts 325 messages. Person B's data is 16,490 messages with a median of 10 words. "I" starts 23% of all messages. The type-token ratios (0.05 for both) tell you that these are linguistically repetitive corpora. The same small vocabulary, recycled constantly, because text messages are not essays. They are bursts of coordination wrapped in filler words and punctuation marks that carry more emotional weight than the words themselves.

The evaluation metrics from our training runs are technically fine and practically meaningless. A 50-sample benchmark against real messages produces a BLEU score of 0.0, which sounds catastrophic until you consider that text messages have dozens of valid responses to any given prompt. ROUGE-L averages 0.082. Message length matches within 5%. Self-BLEU is 0.002, meaning the model isn't just repeating itself. The model produces the right shape of message. But it strips every personality marker: emoji usage drops from 6% to 0%, exclamation marks from 8% to 0%, vocabulary diversity falls by 23%. The skeleton of the communication style is there. The life is not.

The Five Conversations

I ran five conversations with different opening seeds, thirty turns each. The seeds were designed to invite different registers: casual evening plans, emotional venting, social media, food logistics, and nostalgia. The registers didn't matter. Every conversation found its way to the same attractors.

Conversation 1 opened with "hey what are you up to tonight." This produced the most natural exchange of the five. The models discussed exams and papers, planned to get coffee, and spent the last twenty turns negotiating who was arriving where and when. By turn 12, the conversation had become pure logistics: when are you stopping by, I'm heading out now, see you later, ok sounds good. This is what a successful text conversation looks like in the training data, where two people converge on a shared physical location. The model learned the pattern perfectly.

Conversation 2 opened with "I'm so stressed about work lately." Person B's immediate response: "Well then you probably need to sleep." Within six turns, the stress topic evaporated into mutual compliments. By turn 15, they were trapped in a superlative escalation loop: "You're nice." "You're nicer." "I'm the most kind person you'll ever meet." "I'm the most nice person you'll ever meet." Neither model pushes back, disagrees, or introduces friction, because the training data is cooperative. Text conversations between close people rarely feature sustained disagreement, since that happens in person, or on the phone, where tone and timing can soften conflict. In text, disagreement gets avoided or resolved quickly. The models learned this perfectly: they are incapable of productive tension.

Conversation 3 opened with "did you see that thing on instagram." Person B ignored the prompt entirely: "I am just a bit busy." Within three turns, the conversation pivoted to sleep and tiredness. Then Person B hallucinated a detailed story about a 3 AM doctor's appointment that got rescheduled to 9 AM, waiting in a room for six hours, frustration with the medical system. It's plausible, coherent, and entirely fabricated. The model learned the pattern of medical complaint narratives (the structure, the emotional beats, the resolution) without having any specific memory to draw from. The rest of the conversation devolved into headache complaints and medication logistics. Instagram was never mentioned again because neither model has a framework for discussing shared visual content. For the corpus I analyzed, that register doesn't exist in text.

Conversation 4, the dinner conversation shown at the top, is the purest expression of the logistics gap. "ok" appears in nearly every turn. Both models narrate physical movements: I'm at the door, I'm out by the car, I'm on the other side of the building. They are setting up an interaction that can never happen. And then, buried in the output: "You have 4 new notifications. View them in the app." A Messenger UI string that leaked into the training data and surfaced verbatim. The model doesn't know it's a notification; it just knows that this sequence of tokens appears in the conversational stream.

Conversation 5 opened with "remember when we went to that concert last summer." What followed is the strangest output of the set. Person B responded: "I felt like a nobody." Person A: "you're a nobody to the world lol. i'm a nobody to the world. and so are you. we're not in the world though." This triggered a twenty-turn sequence riffing on the concept of being "nobody in a world of nobodies." Person A produced: "it's better to be nobody in a world of nobodies than nobody in a world of somebodies." Then: "i was nobody before i was nobody, i am nobody now, i will be nobody forever. there is no me in the world." Person B: "Nobody is nobody because nobody is nobody. Nobody is nobody and nobody is nobody."

It sounds like emergent philosophy. It isn't. Neither model was trained on philosophy. What happened is that the casual self-deprecation patterns in the text data ("I'm such a mess," "we're nobody") provided enough semantic raw material for the temperature sampling to build recursive structures. The models discovered that "nobody" can be a noun, a pronoun, and a philosophical concept, and they explored the combinatorial space. Person A's "you can't make me do math" is the most human-sounding interjection in all five conversations, a non sequitur that breaks the pattern the way a bored person would. The model produced it not by understanding humor, but by having learned that pattern interruptions are common in long text exchanges.

Convergence Table

Attractor	Conversations	Description
Sleep/tiredness	2, 3, 4	Universal fallback topic; dominant in late-night texting data
Logistics/meeting up	1, 4	Narrating physical movement toward a shared location
Agreement spiral	2, 5	Mutual agreement escalating until both echo the same phrase
Health complaints	2, 3	Headaches, stress, medical visits
Repetition loops	All 5	Semantic repetition even when exact phrases differ

The Logistics Gap

Here is the problem, named plainly: in my experience, text messages are a coordination medium, not a communication medium. The meaningful interactions between two people (the arguments, the jokes, the vulnerability, the inside references, the silence that means more than words) happen in person, on the phone, or in longer-form channels that nobody exports to a training dataset. What remains in the text logs is the scaffolding: when are you free, I'm on my way, what do you want for dinner, ok sounds good.

I am not the first person to discover this. Every practitioner who has fine-tuned on messaging data has independently arrived at the same conclusion.

Edward Donner fine-tuned Llama 2 on 240,000 text messages and wrote about the experience in a blog series starting January 2024. His early attempts: "The LLM would generate conversations that seemed to be surreal parodies of my life. My work colleagues would endlessly reschedule meetings earlier and later; my partner and I would relentlessly debate our lunch options." He eventually achieved better results through prompt engineering and data formatting, but the initial observation is identical to mine, two years and a different model later.

Alessandro Romano fine-tuned on five years of Telegram data and found that gaming scheduling messages were so overrepresented that the model replied with just a time to non-scheduling prompts. A developer named Alex Smirnov built a Telegram "doppelganger" and noted that simple responses naturally dominated the dataset. Jesse Claven fine-tuned on WhatsApp, Instagram, and Messenger and found that his writing style and humor largely failed to come through. A Hacker News commenter named jstrieb fine-tuned GPT-2 on group chat data and got a model that could only produce "there is a party tonight, who wants to go?"

Few cite each other, and the convergence is striking even where practitioners were aware of predecessors. Five people, different data sources, different base models, different years, all arriving at the same finding: the model learns to schedule things. The person who shows up to the scheduled thing is missing.

The communication science supports this. Kumar (UT Austin) and Epley (University of Chicago) published a 2021 study in the Journal of Experimental Psychology: General finding that people feel significantly more connected through voice-based media than text. Participants chose text because they predicted phone calls would be awkward. They were wrong. The richest interpersonal content travels through channels where tone, timing, laughter, and silence carry meaning, channels that generate no training data.

This is the uncomfortable conclusion. Your texts don't contain you. They contain the logistics of you. The person you are (the one who shows up to dinner, who argues about politics on the phone, who sits in comfortable silence on the couch) is invisible to any model trained on your message history. You are the gap between the messages.

The Academic Void

You might expect that a problem this well-documented by practitioners would have a corresponding academic literature. It does not.

In February 2025, Iofinova, Jovanovic, and Alistarh published a position paper surveying the entire landscape of personalized text generation (arXiv 2502.06560). In their discussion of research directions, they state: "As one example, we are not aware of any datasets that link text by the same author across modalities (e.g., e-mails, Discord posts, and Reddit posts)."

Read that again. As of 2025, no published dataset connects the same person's writing across platforms. To their knowledge, no one has taken a person's text messages, emails, essays, and social media posts and studied what happens when you train on all of them together. The logistics gap has an obvious theoretical fix (supplement the shallow text data with richer writing from the same person) and nobody has tried it in a publishable way.

The pieces exist in isolation. There are systems that fine-tune on emails (Panza), systems that fine-tune on essays (CharacterBot), and data mixing theory that establishes ratios matter for pre-training. But nobody has asked: what is the right ratio of text messages to emails to essays to blog posts to reconstruct a single person's voice across registers?

Liu et al. (arXiv 2509.07502, September 2025) studied how AI "social media clones" affect users' identities and relationships. They frame the problem through the Communication Theory of Identity, treating clones as an extension of the user's enacted identity. Building on their framework, I'd argue that clones trained on conversation data capture only that enacted identity layer, the self that is performed through a specific communication channel. The enacted identity in your texts is the scheduling-and-acknowledgment self. It is not the self that writes thoughtful emails, argues in essays, or processes grief in a journal. Every communication channel captures a different slice of who you are, and no single channel captures enough.

Wang et al. at EMNLP 2025 (arXiv 2509.14543), in their paper "Catch Me If You Can," tested whether LLMs can imitate personal writing style through in-context learning. Authorship attribution accuracy on news articles reaches 86-93%. On emails, it drops to 56-62%. On blogs, it falls further to 38-44%. More examples narrow the gap slightly but don't close it. There is a ceiling, and it's low for exactly the informal, personal registers that make someone sound like themselves.

The gap exists for understandable reasons. Privacy makes multi-source personal data unpublishable. Fine-tuning at a personal scale only became practical in 2023-2024 with QLoRA and consumer GPUs. A study of one person's data across platforms isn't publishable by traditional academic standards, since the sample size is one. And companies building personal AI assistants may have solved this internally, but nothing is published.

What we have is a practitioner community independently documenting the same failure mode, an academic community that has identified the relevant concepts (enacted identity, style imitation limits, data mixing), and zero published work connecting the two. The bridge between "text messages don't capture personality" and "here's how to fix it with multi-source data" doesn't exist yet.

What Comes Next

The obvious next question is whether adding richer data sources fixes the problem. I have more data from these two people than most researchers would: not just text messages, but academic writing, personal essays, social media posts, and creative work spanning years. The theoretical prediction is clear: cross-register training data should produce models that can sustain conversation beyond logistics. Cross-genre authorship attribution work (arXiv 2510.16819) suggests that multiple registers carry distinctive stylistic signal, supporting this direction.

Several mitigation approaches exist but haven't been tested for this specific problem. Per-message loss weighting could downweight the logistics messages and upweight the substantive ones, training the model to care less about "ok sounds good" and more about anything with actual semantic content. DPO could construct preference pairs where logistics-heavy responses are rejected and deeper responses are preferred. Instruction backtranslation could treat existing writing as answers and generate prompts, converting static text into interactive training data. RAG could supplement the model at inference time with richer documents (journal entries, letters, essays) that provide depth the text data can't.

These are all reasonable approaches. None has been tested on the multi-source personal data mixing problem specifically. The cross-genre authorship attribution literature suggests the structural analog works for classification, that training on multiple writing registers from the same author improves identification. The generative mirror, training on multiple registers to improve generation, is a hypothesis, not a result.

There's a deeper question underneath the technical one. What we are attempting, when we fine-tune on personal data, is to reconstruct a person from their digital exhaust. The text messages are exhaust. The emails are exhaust. The essays are higher-quality exhaust but exhaust nonetheless. The person exists in the interactions that generate no data: the phone call, the dinner conversation, the argument in the car, the comfortable silence. The most human moments are the ones that leave no digital trace.

The logistics gap isn't a bug in the fine-tuning process. It is an accurate reflection of what text data contains. The models aren't failing to capture personality. They're faithfully representing the medium, and the medium is shallow. The question isn't how to make the model better. It's whether the data you're feeding it was ever going to be enough.

My guess is that mixing registers will help. My guess is that it won't be enough. But the experiment is worth running, because right now the most interesting question in personal AI (what happens when you combine every digital trace a person has left) doesn't have an answer. Just a gap where the answer should be, and a pair of language models endlessly planning a dinner that never arrives.