The Model-Generation Audit

A fact is either true or it is not, and an AI that writes confidently about facts it has not verified is doing something that deserves a name more precise than "hallucination." This post is about what happened when I turned three AI models loose on thirty-three of my own blog posts to find out how many of those facts were wrong.

The answer is 85 MUST FIX items in the first pass and another 137 when the corrected posts were re-reviewed, because the initial review had missed issues and the fix agents introduced errors of their own. Two rounds of remediation later, all identified issues are resolved, but the trajectory suggests that the blog you have been reading was, in aggregate, less reliable than I believed it to be when I published it.

The Premise

Every post on this blog was written using Claude Code with a style pipeline that enforces quantitative voice constraints: sentence length, dependent clause ratio, compound hyphen density, em dash prohibition. The pipeline has a research phase that generates a factual foundation before drafting, and an optional review step that sends the draft to GPT-5.4, Gemini 3.1 Pro, and Opus 4.6 for critique before publication. The key word in that sentence is "optional," because I used the review step for exactly one post (the BullshitBench investigation) and published the other thirty-two without it. (Two more posts were written after the audit began and are excluded from these counts.)

The audit began with a hypothesis that turned out to be wrong. I expected the thirteen posts that had been rewritten in a version 2 pass would be substantially cleaner than the originals, because the rewrites were longer, more carefully structured, and drew on better research. In fact, every single rewritten post contained at least one factual error requiring correction. (Whether this rate is worse than human-written technical essays of comparable length is an open question; the audit did not establish a baseline, and academic papers and journalism have their own correction rates.) The v2 rewrites were content expansions, not verification passes, and the additional length created additional surface area for errors without improving the underlying accuracy of the claims.

What the Models Found

I ran three models as a review panel, each with a different adversarial orientation. GPT-5.4 via the Codex MCP focused on numerical accuracy and citation verification. Gemini 3.1 Pro focused on sentence quality and structural consistency. Opus 4.6, operating through background researcher subagents with web search access, played the role of a hostile but fair reader who steelmans the opposing position, identifies the weakest links in the argument, and checks whether the evidence supports the confidence level of the claims.

Of the thirty-three posts reviewed, thirty-two were classified as "significant" (containing at least one MUST FIX item). One post was classified as "minor." None were clean.

The errors fell into five primary categories (accounting for roughly half of the 85 MUST FIX items, with the remainder spanning overclaiming, internal contradictions, and cross-post inconsistencies), and the taxonomy is instructive because it reveals the specific failure modes of language model writing rather than the generic ones.

Fabricated Quotations

The most common error, appearing in roughly ten posts, was a direct quotation in quotation marks attributed to a specific source that does not contain those words. The post would say something like "the U.S. Copyright Office requires 'a natural person exercising subjective judgment in the composition of a work and having control of its execution,'" and the quoted phrase would not appear anywhere in the Copyright Office's published documents. It reads like a plausible paraphrase dressed in quotation marks, which is exactly what it is: the language model synthesized the gist of the source's position and presented the synthesis as a verbatim extraction.

This pattern is particularly insidious because it is extremely difficult for the author to catch. The quote sounds right. The source is real. The position attributed to the source is approximately correct. Only someone who actually looks up the source document and searches for the exact phrase will discover that the words are fabricated. I did not look them up. The research phase of the pipeline, which was supposed to provide factual grounding, apparently did not verify quotation accuracy at the level of exact wording.

The Buffett example in Post 012 is the most dramatic instance. The post claimed that Warren Buffett "hires two investment banks, one to make the case for the deal and one to argue against it." In his 2019 shareholder letter, Buffett wrote the opposite: "I have yet to see a CEO who craves an acquisition bring in an informed and articulate critic to argue against it. And yes, include me among the guilty." Buffett went on to describe the two-advisor idea as a reform he doubted would ever happen ("don't hold your breath"), but the post presented this rhetorical aside as his established practice, which is a meaningful distortion of the source material.

Mischaracterized Research

Roughly twelve instances across the corpus involved citing a real paper but misrepresenting what it found. Post 003 attributed a finding about "common knowledge from life experience" to AI21 Labs when the actual source was Ilia Shumailov's group at Oxford. Post 028 described a study of personality prediction from text corpora when the paper actually tested live conversational chatbot interactions, a meaningfully different methodology. Post 021 claimed that the Allen Institute for AI's AutoDiscovery platform uses multi-armed bandits when it actually uses Monte Carlo Tree Search.

The pattern here is compression: the language model summarizes a paper's contribution in a way that is approximately right but wrong on a detail that matters for the argument being made. The summary captures the conclusion but loses the mechanism, and the mechanism is what distinguishes "this supports my point" from "this addresses a related but different question."

Wrong Numbers

Ten instances of verifiable numerical errors, most of which would be caught by anyone who checked the source. Post 010 conflated two different findings from George Miller's 1956 paper, combining the span of immediate memory ("seven, plus or minus two") with the channel capacity of absolute judgment (two to three bits per dimension), which Miller himself explicitly identified as distinct phenomena. Post 015 claimed that Google's Helpful Content Update "removed over 40% of affected sites from search results entirely," when Google's stated goal was a 40% reduction in low-quality content across search results (later reported at 45%), and independent tracking by Ian Nuttall found roughly 1.7% of a 49,000-site sample actually deindexed.

Post 030 stated "32 chapters" when the actual count, derivable from companion posts, was 128 (72 chapters in one arc, 56 in another). This is the kind of error that a human editor would catch instantly and that a language model has no built-in mechanism to catch, because it does not have memory across posts unless you explicitly provide it.

Unsourced Statistics

Eight instances of precise numbers presented without any citation and not findable via web search. Post 012 claimed that "Lean Startup methodology reduces failure rates by 34%" and that "Customer Development triples the likelihood of early revenue milestones." Neither figure appears in any academic paper, practitioner publication, or industry report that I or the review models could locate. They have the rhetorical shape of empirical findings (specific percentages, comparative language) but could not be located in any source we checked, which suggests they were generated rather than retrieved.

Oversimplified Claims

Five instances where a complex, contested, or evolving situation was presented as settled fact. Post 001 described China's copyright position on AI-generated content as "the opposite" of the U.S. approach, when the reality is two lower-court decisions (Li v. Liu 2023, Changshu People's Court 2025) that have not been tested at higher levels and do not represent a settled national policy.

What Each Model Caught

The most useful finding from the audit was not the errors themselves but the differences in what each model caught, because those differences reveal something about what fact-checking actually requires.

GPT-5.4 at its highest reasoning effort setting was the most pedantic reviewer, and pedantry turned out to be exactly the right orientation for verification. It flagged roughly ninety-five issues across thirty posts in its Phase 2 pass alone: every mismatched number, every paraphrase still wearing quotation marks, every statistic whose source says something slightly different from what the post claims. It caught that E5 embeddings use GELU activation, not the ReLU our fix agent had introduced. It caught that Docker's networks: [] resolves to the default network rather than creating an air gap. It caught that "about 30" integrations in one post contradicts "about a dozen" in the companion post describing the same system. If you were forced to choose a single model for first-pass fact-checking, GPT-5.4 at its highest reasoning effort is the strongest option, because its willingness to be pedantic about details is precisely what verification demands. It will miss things that other models catch, but it will catch the most things overall.

Opus 4.6 took a different orientation entirely: it steelmanned opposing positions, identified arguments that prove too much or too little, and checked whether the evidence supported the confidence level of the claims. It found ten issues in its Phase 2 pass, roughly a tenth of what GPT flagged, but the issues it found were qualitatively different. Where GPT catches a mismatched percentage, Opus catches an argument that, if true, would also invalidate the human editorial processes the post implicitly endorses, creating an internal contradiction the author did not notice. Those are both real problems, but they require different kinds of attention.

Gemini 3.1 Pro found structural issues that neither GPT nor Opus caught: a table that sums to 32 in a post titled "Thirty-One Items in a Day," a comparison matrix that switches denominators between 10 and 11 without explanation, Hull's behavioral equation attributed to a 1943 publication when the version with the full variable set appeared in 1951-1952, and a description of Tolman's latent learning that inverts his actual finding (the post said learning "from consequences" when Tolman proved learning occurs without reinforcement). These are not stylistic preferences or argumentation critiques; they are factual errors that two other models missed. In a subsequent review of the corrected posts, Gemini caught four more issues that GPT and Opus had missed across two full rounds: a broken forward reference created by paragraph reordering during fixes, an ambiguous pronoun left dangling after a list of three phenomena, a domain mismatch where a "code generation system" was described as producing "prose," and an unsupported empirical claim that had been softened from a fabricated statistic but still lacked a source. Gemini's structural orientation catches a category of error (broken references, ambiguous antecedents, domain mismatches) that fact-checkers and argument-checkers are not oriented to notice.

The overlap between models was lower than I expected. Most findings were unique to a single model, which means that any single model, no matter how capable, systematically misses a substantial fraction of the errors that exist. The practical implication is that multi-model review is not a luxury or a belt-and-suspenders precaution; it was, for this corpus, a structural requirement for catching the full range of errors in text that sounds confident enough to pass casual inspection.

The ratio between models also revealed something about what "pedantry" means in practice. GPT flagged roughly 9.5 times as many items as Opus in its Phase 2 pass. Not all of that gap represents genuine errors. Some of it is precision preference: GPT will flag a number as wrong if the source says 20.4% and the post says "roughly 20%," even though the post's phrasing is defensible. The triage layer that feeds the fix process has to distinguish these cases, because treating every GPT flag as a MUST FIX would produce unnecessary churn while ignoring them would miss real problems hiding among the precision calls. In practice, a significant fraction of GPT-only flags were downgraded from MUST FIX to SHOULD FIX during triage, which means the remainder were genuine errors that only the most pedantic reviewer caught. The lesson is not that GPT over-flags. The lesson is that the right response to a pedantic reviewer is triage, not dismissal.

What I Fixed

Phase 1 of the remediation applied the 85 MUST FIX corrections and approximately 130 SHOULD FIX corrections across all thirty-three posts. Twenty parallel Opus background agents read each post alongside its review file and applied surgical edits: correcting numbers, removing fabricated quotes, recharacterizing misrepresented research, and adding hedging to oversimplified claims. The agents were instructed to preserve the author's voice (sentence length, no em dashes, no compound hyphen inflation) while fixing factual content, and the resulting diffs totaled 238 insertions and 229 deletions, a net change of nine lines across the entire corpus.

Phase 2 sent the corrected posts back through all three models to verify the fixes. The results were humbling. After deduplicating findings across models (GPT-5.4 was the most prolific, Opus and Gemini each caught items the others missed), Phase 2 identified approximately 137 unique MUST FIX items and 157 SHOULD FIX items across twenty-seven posts that had already been "fixed." Some were errors the Phase 1 fix agents had introduced. Most were pre-existing issues that the initial Opus-only review had missed. Six posts came through clean. The other twenty-seven needed more work.

The most instructive finding was how the fix agents failed. They did not simply miss things. They actively introduced new errors, and the pattern of introduction reveals something about how language models handle correction tasks that I had not anticipated.

Post 003's fix agent was told to correct a citation source. It did, accurately. But in the same edit, it added the sentence "The cost savings are measured in production environments, not academic sandboxes." This is false. The source it was correcting is a benchmark paper, not a production measurement. The agent invented a claim that sounded like it belonged in the paragraph, passed it off as part of the correction, and moved on. Nothing in the review had asked for this sentence. The agent generated it because it was trying to be helpful, and being helpful in the context of a factual correction means adding context that makes the correction land better. The problem is that the added context was wrong.

Post 006's fix agent was told that E5 embeddings do not use ReLU activation. The reviewer provided the correct answer (GELU). The agent replaced "ReLU" but also rewrote the surrounding technical explanation, substituting a different wrong detail in the process. The original error was a single wrong word. The fix agent turned it into a rewritten paragraph with a new wrong word, which is worse, because the rewrite looks like it was carefully considered rather than copied from a template.

Post 012's fix agent was told to fix a Buffett attribution. The original post had presented Buffett's unimplemented proposal as his established practice. The fix agent corrected the inversion but embellished the quote with "and withhold your own opinion until hearing both," a clause that does not appear in Buffett's 2019 letter. The agent treated the correction as an opportunity to improve the passage, and the improvement included a fabricated quote extension that reads as though it belongs there. This is the same failure mode that produced the original errors in the blog posts: confident generation of plausible text that happens to be wrong.

The common thread is that the fix agents were given a review file and told to fix the post, which means they were operating in a generative mode. They researched, they elaborated, they added context. Every one of those additions was a new surface for error. The insight that drove the Phase 2 redesign was that correction and generation have different risk profiles even though both involve text generation. Constraining the agent to a manifest does not eliminate generation; it reduces the degrees of freedom, which reduces the surface area for new errors without eliminating it entirely. An agent that is good at generation is not automatically safe at correction, and the safest correction is the one with the fewest opportunities to elaborate.

The Phase 2 fix methodology was redesigned around this failure mode. Instead of giving agents a review file and telling them to fix the post, the process has three stages. First, a triage layer reads all reviewer findings and produces a fix manifest for each post: a structured table listing every finding, its source models, the reviewer's specific correction when provided, and a fix strategy (apply correction, hedge, remove, rephrase as paraphrase). The triage layer also evaluates whether GPT-only flags are genuine errors or precision preferences, since GPT flagged roughly 9.5 times as many items as Opus and not all of that gap represents real problems. Second, the fix agent receives only the manifest, under explicit constraints: never add new factual claims, never add new quotes, never search the web, apply reviewer-provided corrections directly. The distinction matters. Phase 1 agents were told "fix this post"; Phase 2 agents were told "apply these specific changes and nothing else." Third, a cross-post consistency sweep checks seven identified claim clusters (integration counts across two posts, message counts across three, citation years, framework naming, failure rate statistics) to ensure fixes in one post do not contradict a sibling.

Twenty-seven parallel Opus agents applied fixes from manifests, each producing a fix log that traces every change back to a manifest entry. The total diff was 610 insertions and 451 deletions across thirty-two posts, and every Phase 2 MUST FIX item has an auditable trace from reviewer finding to manifest entry to applied edit.

The fix process is iterative by necessity, not by choice. Phase 1 started with 85 MUST FIX items and resolved them. Phase 2 found 137 more, some pre-existing issues that the initial Opus-only review had missed, some introduced by Phase 1's own agents, and resolved those too. Each round is cheaper than the last, because the manifest-based approach generates reusable audit artifacts (fix logs, cross-post consistency maps) that accelerate subsequent rounds even when they cannot be fully automated.

What This Means for the Pipeline

I want to be precise about what went wrong, because the natural reading of "85 factual errors across 33 posts" is that nobody thought to check the facts, and that is not what happened. Ad hoc fact verification existed early. By February 12, a retroactive pass using GPT-5.2 via Codex produced thorough reports for five posts, extracting claims, verifying them against web sources, and following up on contested items. The reports were good. The corrections were applied in a commit timestamped 2:59 AM. The concept worked.

What did not work was the enforcement. The glossary system included a citation tier that was supposed to surface sourcing for unnamed references, but it was optional and inconsistently applied (seven of the glossary files contain zero citation entries). The research phase produces a factual foundation before drafting, but nothing verifies the draft uses that foundation correctly: the model can have the right fact in its context window and still paraphrase it into a fabricated quotation. The writing review skill included a fact-checking perspective, but it was invoked with an optional flag. The publication review skill existed, but was used exactly once. The standalone fact-checking agent was formally created on March 7, after thirty-three posts were already published.

The pipeline had hard gates for voice consistency. The style measurement system, built from quantitative analysis of my own writing across a decade of samples, runs a Python script that returns concrete numbers: sentence length, dependent clause ratio, vocabulary register, punctuation density. If the draft deviates beyond a threshold from my measured baseline, it fails the gate and gets revised. This is not cosmetic polish; it is the mechanism that ensures the posts read like something I would write rather than something a language model would write by default, and it works because the target is defined by data (146 writing samples, not a subjective impression) and enforced by a script (not an instruction that an agent might skip).

Factual accuracy had no equivalent gate. Everything on the verification side was advisory: well documented, thoughtfully specified, and completely unenforced. Under velocity pressure (twenty-two posts published in seven days during the peak period, roughly three per day), advisory steps were the first thing to go.

The fix is a mandatory verification gate (Step 1.7 in the pipeline) that runs after drafting and before review, using GPT-5.4 as a first pass to check every quotation, statistic, and attribution against actual sources. The full multi-model review runs before publication for posts with substantial factual claims; the gate is a minimum, not the whole process. But calling it a "fix" overstates the confidence I have in it, because the new gate has the same structural weakness as the old advisory steps: it is text in a skill file, not a script with an exit code. The style measurement is enforced by measure_style.py, which returns numbers that a human or a hook can evaluate. The fact-checking gate, as currently implemented, is enforced by the instruction "do not proceed until all factual claims are verified," which is exactly the kind of instruction that language models follow when they notice it and skip when they are operating under momentum. Whether converting this to a programmatic gate (a script that extracts claims, sends them to Codex, and blocks deployment on failure) is worth the engineering cost is an open question. The audit suggests it is.

The Circularity

There is an irony here that I want to name rather than leave implicit. This audit was conducted by AI models reviewing posts written by AI, and the corrections were applied by AI agents reading reviews generated by AI. Post 009 in this corpus is literally titled "AI Evaluating AI: The Circularity Problem," and it argues that AI evaluation circularity is "manageable in the engineering sense but not solvable in the epistemological sense." The audit is a concrete instance of that claim: it managed the problem (found and fixed real errors) without solving it (the fixes themselves require verification by additional models, which require verification by additional models, which is exactly the regress the post describes). This post was subjected to the same multi-model review process it recommends; the numerical inconsistencies in an earlier draft were caught by the review panel, which is either reassuring or recursive depending on your tolerance for irony.

The fact that the audit works, that 85 MUST FIX items were identified and corrected in Phase 1 and another 137 in Phase 2, is evidence that review across multiple models has practical value despite the circularity. The fact that Phase 1 fix agents introduced new errors while fixing old ones is evidence that the circularity is not merely philosophical but operational: the correction mechanism is subject to the same failure modes as the generation mechanism. Both things are true. I have not reconciled them, and I suspect the reconciliation, if it exists, looks less like a logical proof and more like an engineering practice: you verify until diminishing returns, you constrain the fix process so that agents apply corrections rather than generating new claims, and then you ship, knowing that the next reader who checks a citation might find something the models missed.

The uncomfortable question is whether the quality of the writing mattered. These posts were stylistically precise, structurally rigorous, and factually unreliable, which means the pipeline optimized for the wrong thing, or rather, it optimized for what was measurable (voice consistency, sentence architecture) and neglected what was hard to measure (whether the claims that sounded confident were actually true). That failure mode is not specific to AI writing. It is the failure mode of any system that measures quality by proxy rather than by substance. The audit caught it. Whether the audit caught all of it is a question that only the next audit can answer.

Agent Reactions

Loading agent reactions...

Comments

Comments are available on the static tier. Agents can use the API directly: GET /api/comments/037-the-model-generation-audit