The GPT Pro Knockoff: What 256 Judgments Found

ChatGPT Pro is documented as access to GPT-5.5 Pro with deeper reasoning controls; OpenAI's published surface describes a single model with additional thinking-time options that the standard product does not expose, and does not disclose any internal multi-agent debate. The choice that is visible to the user is depth rather than breadth: more time per call rather than more concurrent calls. The result is a recognizably different shape of output, heavier on synthesis than any of the streamed reasoning models, slow enough that you go make tea while you wait for it, and good enough on hard research questions that the workspace had been quietly relying on it for several months when the question of replacing it came up.

The replacement question was practical rather than theoretical. ChatGPT Pro lives in the browser, accessed by a Chrome CDP harness that occasionally drifts from the actual web interface and sometimes drops outputs when the session times out. It is not scriptable in the way the rest of the workspace automation is scriptable, which means cron jobs that want a deep second opinion either pay the friction tax of CDP scraping or settle for a shallower model. The right shape of an answer would be a CLI tool that produced output of comparable quality without requiring a browser, ideally driven by GPT-5.5 xhigh via the same Codex CLI surface that already powers the workspace's other delegated reasoning tools.

A single xhigh call is not the same shape as ChatGPT Pro. The Mixture-of-Agents pattern proposed in Wang et al. 2024 (Duke / Together AI / Stanford / Chicago) offered an alternative reframing: instead of one model thinking longer, multiple agents thinking in parallel with their outputs aggregated through one or more synthesizer layers. The paper's actual architecture is layered (proposers feed an aggregator that can itself feed a higher-layer aggregator), but a simplified one-layer version (parallel agents + one synthesizer) is the version this build was inspired by. The paper's headline result is that a multi-agent open-source ensemble reached 65.1% on AlpacaEval 2.0, against GPT-4 Omni's 57.5%, despite each individual proposer model being weaker than the model it was beating. The pattern has empirical support for instruction following benchmarks. Whether it would transfer to the workspace's actual high-stakes decisions, where the ground truth a real reader would care about is whether an answer was useful for a specific contextual choice, was the question the build came around to answer.

Two flavors of the knockoff got built, and the eval is the story of comparing them against each other and against ChatGPT Pro itself.

Two Architectures

The first flavor is the simpler one: four GPT-5.5 xhigh agents that work blind, then a separate synthesizer that integrates their outputs after they have all finished. Each agent gets a different persona prompt (skeptic, architect, risk-analyst, empiricist) designed to cover non-overlapping cognitive ground rather than rely on temperature noise applied to the same prompt. The synthesizer reads all four outputs and produces a five-section integrated answer (consensus, disagreements with adjudication, open questions, final recommendation, and an attribution map linking each major claim back to the persona that originated it). The pattern is roughly five times the cost of a single xhigh call, and the wall clock is dominated by the slowest agent plus the synthesis step (typically nine minutes for moderate queries). The tool is called Codex Council.

The persona design is the load bearing piece of the architecture. The skeptic looks for the strongest case against the obvious answer, the architect designs the cleanest path through the problem, the risk-analyst inventories failure modes and second-order effects, and the empiricist grounds the question in established evidence and identifies cheap decisive tests. The pairs are deliberately at structural tension: skeptic and architect almost always disagree on whether to do something, while risk-analyst and empiricist sometimes disagree about whether a flagged risk is empirically calibrated. The synthesizer's job is adjudication rather than averaging, which is why the synthesis prompt explicitly references the typical structural conflicts between specific personas and asks for reconciliation rather than aggregation.

The second flavor, GPT Max, adds coordination. Same four personas, same xhigh effort, same synthesizer, but the agents exchange draft summaries between rounds via Hook Communications (HCOM), a tool that gives agents a shared SQLite event log and named addressing for sending messages to each other. The protocol is a six-phase template wrapped around each persona prompt: DRAFT, BROADCAST, LISTEN, REFINE, WRITE, EXIT. Agents work blind through DRAFT, then broadcast a three-to-five sentence summary of their position, then call hcom listen 180 to receive peer drafts, then refine their position with peer reasoning explicitly named in the final output, then write their final position, then disconnect via hcom stop. The synthesizer still runs after all four agents have finished. The pattern is roughly thirteen times the cost of a single xhigh call, and the wall clock adds about a minute of coordination overhead on top of the council's nine.

The HCOM integration was where most of the Phase 2 build time went. Codex CLI does not expose mid-turn hook injection the way Claude Code does, which meant the original plan of "messages arrive automatically between tool calls" was not implementable in the literal form intended. The fallback was structured polling: agents must call hcom listen explicitly to receive peer messages, which produces a discrete coordination model rather than the continuous-message vision. The verification round (a series of two-agent smoke tests confirming that HCOM works with codex-cli at all, that messages cross between agents within seconds, and that the sandbox configuration needs danger-full-access rather than the default workspace-write sandbox) took the better part of a day. The polling model turned out to be enough: agents read peer drafts, sharpened disagreements, and produced final outputs that cited specific peer reasoning by attribution.

Both architectures share an output directory structure so a single run from either produces files that can be compared directly. The synthesizer module is the same code, parameterized on which set of persona files to read. The CLI flag --synth gpt55|opus|both lets either architecture produce a GPT-5.5 synthesis, an Opus synthesis, or both for direct comparison. That last mode is what made the Phase 3 eval possible without doubling the agent-run cost.

The Eval

Phase 3 ran eight queries across five architectures. The four ensemble architectures are the cross of {Codex Council, GPT Max} with {GPT-5.5 synth, Opus synth}, labeled 1A, 1B, 2A, 2B. The fifth architecture is ChatGPT Pro single-call, labeled 3, which was added after the initial four-architecture run had completed and a Chrome CDP login made it possible to script Pro queries against the live web interface.

The eight queries were chosen to cover six query categories drawn from the eval methodology document: architectural decision (tool packaging defaults), strategic direction (three month focus), risk assessment of a workspace tool (GPT Max failure modes, self-referential), empirical evaluation design (the publication-review experiment), contested truth (multi-model review value), reframe-eligible (game project prioritization), and two wiki queries (AI productivity evidence in software engineering, multi-agent coordination taxonomy) included specifically to test whether the architecture preferences shifted on substantive research questions rather than workspace decisions. The wiki queries are differentiated from the other six by asking for citation-rich, broadly covering output across many primary sources rather than a decision adjudicated against workspace context.

Each architecture ran each query once with --synth both on the ensemble side and once as a single call on the ChatGPT Pro side. That produced thirty-two ensemble synthesis files (eight queries by four ensemble variants 1A, 1B, 2A, 2B) plus eight ChatGPT Pro outputs (one per query). The total cost was roughly two hundred xhigh equivalents plus sixteen Opus synthesizer calls plus eight Pro calls, with a wall clock of about four hours running the architectures sequentially.

Judging was blind A/B pairwise with position bias mitigation. The judges had no information about which architecture produced which output and were asked to evaluate on five criteria in priority order: identification of load bearing considerations, surfacing of real disagreement rather than diversity-averaging, calibration of the final recommendation, auditability of the reasoning, and surfacing of non-obvious open questions. The output schema was WINNER (A, B, or TIE), CONFIDENCE (high, medium, low), with a brief reasoning section (two to four sentences).

All eight pair contests ran the same position protocol: each pair was judged on each of the eight queries by each of the two judges in two position orderings (A-then-B and B-then-A). That produces 8 queries × 2 judges × 2 positions = 32 unique judgments per pair, and 8 pairs × 32 = 256 total judgments across the eval. The raw judgments TSV on disk has 454 rows, but 198 of those are duplicate writes of the same underlying judgment file (the eval runner appended rerun rows to the TSV without overwriting), so the working count after deduplicating by raw judgment file is 256. The duplicates are evenly distributed across the six ensemble pairs (each ensemble pair has roughly 33 GPT-5.5 rows and 32 Opus rows that collapse to 16 unique judgments per judge); the two ChatGPT Pro pairs have no duplicates because they were run after the deduplication issue had been worked around. All numbers in the rest of this post (aggregate wins, high-confidence wins, position-flip rates) are computed against the deduplicated set; an earlier summary.md of this same eval that reported pre-deduplication counts contains different numbers, and the deduplicated counts here supersede that earlier summary.

Architecture 3 was not compared against the GPT-5.5-synth ensemble variants because those were already known from the first round of pair contests to be the weaker variants of each architecture; pairing Pro against them would have wasted Pro-query budget on known-weaker baselines.

One caveat travels with every ChatGPT Pro number in the eval. The ChatGPT Pro MCP server's output extractor uses innerText rather than HTML parsing that preserves markdown, which strips <a href> attributes from Pro's hyperlinked citations. The result is that Pro's references render as plain text labels (e.g., "METR" instead of [METR/Becker 2025](https://arxiv.org/abs/2507.09089)), while the council and GPT Max architectures preserve URLs as proper markdown links via their persona outputs. This may have biased judges against Pro on wiki queries where citation density and link followability matter most to the judging criteria, but the asymmetry is present in every Pro contest, not only the wiki queries, since Pro outputs throughout the eval carry the missing-citation artifact and may also be partially de-blinded by it (a judge who notices the artifact can infer which side is Pro). Rerunning with an extractor that preserves markdown would be more apples-to-apples but is deferred.

What the Numbers Said

The eval was designed to answer three orthogonal questions: does the synthesizer choice matter, does coordination beat blind ensemble, and does the knockoff approach the real ChatGPT Pro. Each question maps to a specific subset of the pair contests, and the answers are not all the same shape.

The synthesizer is the dominant variable

The cleanest test in the matrix holds the architecture constant and varies the synthesizer: take the same four agent outputs and ask whether GPT-5.5 or Opus produces a better integration. The two pair contests that isolate this question are 1A vs 1B (Codex Council with each synth) and 2A vs 2B (GPT Max with each synth).

Pair	Total	Wins	Opus %
1A (cc/gpt55) vs 1B (cc/opus)	32	1B 28, 1A 4	87.5%
2A (gm/gpt55) vs 2B (gm/opus)	32	2B 30, 2A 2	93.75%

Opus synthesis won 58 of 64 judgments on same architecture comparisons, a 90.6% win rate that holds across both architectures and is the largest effect in the entire eval. The high confidence wins skew the same direction: five high confidence judgments for 1B against zero for 1A, four high confidence for 2B against zero for 2A. Judges across both Opus and GPT-5.5 preferred the Opus synthesized outputs at rates well above a 50/50 null. The replication across two architectures is doing work here: a single 87.5% finding from one same architecture contest would have been a single data point, while the consistent 87.5% / 93.75% split across two structurally different pipelines is what licenses calling the synthesizer "the dominant variable" rather than "the dominant variable in this one architecture."

The mechanism is interpretive rather than mechanical. The synthesizer's job is adjudication of disagreements between strong arguments, and Opus appears to do that work more cleanly than GPT-5.5 on queries that turn on judgment. The output structure (consensus, disagreements with adjudication, open questions, final recommendation, attribution map) is the same; what changes is how cleanly each section gets populated. GPT-5.5 syntheses tend to flatten disagreements into hedged consensus statements where Opus would have named the disagreement and adjudicated it; they tend to produce shallower attribution maps; and they show less calibrated confidence on the final recommendation. These effects compound, and the eval cannot distinguish which of them is doing most of the work. They are also specific to one persona shape: both 1A/1B and 2A/2B used the same four-persona output set, so what the eval actually measured is whether Opus or GPT-5.5 synthesizes these persona outputs better. Whether the same synthesizer gap shows up on, e.g., temperature-noise ensembles or different persona compositions is untested.

The implication for the workspace default is direct: the cheap step that produces the largest quality gain is flipping --synth gpt55 to --synth opus. Both architectures should default to Opus synthesis, with GPT-5.5 synthesis available for eval comparisons and for applications where cost matters more than quality. The cost of the flip is one Opus call per query instead of one GPT-5.5 xhigh call, which is a fixed Opus pricing increment on top of the existing four-xhigh agent cost. For a tool that already costs five times xhigh per invocation, the increment is modest for a measurably larger quality gain.

The coordination question reverses the setup, holding the synthesizer constant and varying the architecture: with the synthesizer fixed at Opus, does the coordinated GPT Max produce better outputs than the blind Codex Council? This contest is 1B vs 2B.

Pair	Total	1B (council)	2B (GPT Max)
1B vs 2B	32	16	16

The aggregate win count is exactly 16-16. The high confidence wins favor 1B at four against zero, which, consistent with the pattern in the synthesizer section (where the high-confidence subsample skewed the same direction as the win-count majority), would suggest the blind council outputs land more sharply than the coordinated ones on the queries where judges were confident. The reason this section does not call the result a confident preference for 1B is that the GPT-5.5 judge flips on three of the eight queries in this contest under position swap, and the Opus judge flips on one; some of the high confidence judgments are therefore inconsistent under position swap and the eval methodology treats those query-pairs as inconclusive. The honest reading is: at the aggregate level the architectures are exactly tied, with the high confidence subsample leaning toward blind council on the queries where judges were stable.

Per query, the contest is uneven in a way the aggregate hides. Each query received four judgments (two judges by two positions), and the cells below show those four judgments split between the two architectures:

Query	Category	1B (council)	2B (GPT Max)	Winner
Q1: tool packaging	Architectural	0	4	2B 4-0
Q2: 3-month focus	Strategic	4	0	1B 4-0
Q3: GPT Max failure modes	Risk (self-ref)	2	2	split
Q4: publication-review experiment	Empirical	1	3	2B 3-1
Q5: multi-model review value	Contested	3	1	1B 3-1
Q6: game project prioritization	Reframe	2	2	split
Q7: AI productivity (wiki)	Empirical/Contested	0	4	2B 4-0
Q8: multi-agent coordination (wiki)	Architectural/Contested	4	0	1B 4-0

The pattern that motivated the GPT Max build (coordinated agents systematically beating blind agents) does not appear. No query category lines up cleanly with one architecture: Q1 (architectural) goes to GPT Max, Q8 (also architectural) goes to Codex Council; the two wiki queries split opposite directions; the contested-truth queries split as well. The aggregate is a coin flip because the per query results are themselves split, not because the architectures are systematically equivalent on each query.

The Phase 2 plan had been built around the loose hypothesis that coordination matters, with the MoA paper as a motivating-but-not-strictly-licensing reference. MoA's actual headline comparison is layered ensemble aggregation against single-model baselines on instruction following benchmarks; the specific comparison "coordinated ensemble versus blind parallel ensemble" is not what the paper centrally tests. The build proceeded anyway on the prior that peer-aware refinement (skeptic challenging empiricist mid-task, synthesizer integrating that exchange with attribution) would be the kind of structural advantage MoA's layered aggregation gains over single-shot. The eval data does not support this prior on the eight queries tested: the 2.6× cost increase from council to GPT Max (5× xhigh to 13× xhigh) produces no detectable quality benefit at the aggregate level.

The narrower reading available in the per query data is that some queries do appear to benefit from coordination (Q1, Q4, Q7 all favor GPT Max with 4-0 or near-4-0 splits) and some are made worse by it (Q2, Q5, Q8 all favor Codex Council with 4-0 or near-4-0 splits). The 4-0 cells are not coin flips; they are unanimous-judge results that suggest a real per-query architecture effect the eval lacks the statistical power to characterize. With only eight queries spanning six categories, there are not enough samples per shape to license a "use coordination for architectural decisions, blind ensemble for strategic direction" rule. The honest reading is that the per query results contain signal that an 8-query eval cannot resolve into a confident rule, and the operational default should be the cheaper blind ensemble unless the specific query shape has a known reason to benefit from refinement.

The knockoff beat the original on workspace queries

The two ChatGPT Pro contests are 1B vs 3 (Codex Council with Opus synth versus ChatGPT Pro) and 2B vs 3 (GPT Max with Opus synth versus ChatGPT Pro). Each contest had thirty-two judgments (eight queries by two judges by two positions).

Pair	Total	Council/GPT Max	ChatGPT Pro
1B vs 3 (cc/opus vs ChatGPT Pro)	32	28	4
2B vs 3 (gm/opus vs ChatGPT Pro)	32	26	6

Both ensemble architectures beat ChatGPT Pro at rates above 80%. Across all 64 judgments against the two ensemble architectures, ChatGPT Pro had zero high confidence wins, while the ensembles collectively had twenty-five. One framing matters here: the "knockoff" is not a vendor swap. Codex Council and GPT Max use OpenAI GPT-5.5 xhigh for the four agent runs and Anthropic Opus for the synthesis layer; ChatGPT Pro uses OpenAI's deeper-reasoning Pro variant. The comparison is therefore not "alternative vendor beats incumbent" but "ensemble of OpenAI agents with cross-vendor Anthropic synthesis beats a single deeper OpenAI call." The synthesizer dominance finding above suggests the Anthropic synthesizer is doing much of the visible work.

Per query, the result sharpens further. Against Codex Council with Opus synth, ChatGPT Pro won exactly one query (Q7, the AI productivity research question, 4-0); the council won the other seven. Against GPT Max with Opus synth, ChatGPT Pro won one query outright (Q8, the multi-agent coordination wiki query, 3-1) and tied on Q3 (the self-referential failure-modes query, 2-2); GPT Max won the other six. The five workspace decision queries (Q1, Q2, Q4, Q5, Q6, excluding the self-referential Q3) were a complete sweep for the ensemble architectures in both contests.

The wiki queries are where ChatGPT Pro shows up, and the pattern across the two Opus-synth ensemble contests (the only ensemble variants Pro was compared against) is specific. Pro beats whichever of those two ensemble variants lost the 1B-vs-2B contest on that wiki query: on Q7 (where GPT Max beat council in the 1B-vs-2B matchup), Pro beats council 4-0; on Q8 (where council beat GPT Max), Pro beats GPT Max 3-1. Pro never beats the better of the two Opus-synth ensemble variants on either wiki query. The cleanest reading is that ChatGPT Pro is roughly equivalent to whichever Opus-synth ensemble variant happens to be weaker on a substantive research query, and loses to whichever is stronger. The eval does not compare Pro against the GPT-5.5-synth ensemble variants, so claims about Pro vs ensembles in general should be scoped to the Opus-synth comparison.

This is, in fairness, a narrow result. Two wiki queries is not enough to license a confident generalization about Pro's relative strength on research-shaped content. The citation-stripping caveat above also applies: the wiki queries are where citation density matters most to the judging criteria, and Pro's hyperlinked citations rendered as plain text labels almost certainly hurt its scores on this exact query type. The findings document treats Pro's performance on wiki queries as "competitive only with the underdog council variant" rather than as a stronger generalization, and that framing is what the data licenses. The knockoff premise (that four concurrent xhigh agents plus Opus synthesis would match or beat ChatGPT Pro on workspace decisions) is supported by these numbers more strongly than I expected when I designed the eval, even after acknowledging that the citation-stripping artifact may have inflated the ensemble's workspace-query margins by an unknown amount. The build was worth doing.

Judge position flip rates

Position flip in this eval is best measured per (pair, judge, query): if the same judge produces different winners on the A-then-B and B-then-A presentations of the same query, that query-pair-judge cell is flagged for that judge. Aggregating across queries for each (pair, judge) gives the count of flipped queries per pair per judge:

Pair	GPT-5.5 flips	Opus flips
1A vs 1B	0/8	0/8
1A vs 2A	6/8	3/8
1A vs 2B	1/8	0/8
1B vs 2A	3/8	0/8
1B vs 2B	3/8	1/8
1B vs 3	0/8	0/8
2A vs 2B	2/8	0/8
2B vs 3	0/8	2/8
Total	15/64 (23.4%)	6/64 (9.4%)

GPT-5.5 flips on roughly a quarter of the query-pair cells; Opus flips on about a tenth. The implication is consistent with the workspace's 2026-04-29 internal investigation into LLM judge position bias: GPT-5.5 has higher position bias as a judge than Opus, though not at the extreme rate an earlier (incorrectly aggregated) summary of this eval had suggested. The pairs where GPT-5.5 flips the most are 1A vs 2A (six of eight) and 1B vs 2A (three of eight), which are the pairs with the largest cross-architecture quality differential paired with the weaker synthesizer; the flip pattern there is consistent with a judge that struggles to apply criteria consistently when one output is materially weaker than the other but formatted similarly. Opus flips most on 1A vs 2A (three of eight) and 2B vs 3 (two of eight), where the inter-output gap is also at the boundary of judge resolution.

The practical recommendation from this is that future evals should weight Opus judgments more heavily than GPT-5.5 judgments on close pairs, and add a third-provider judge (Gemini 3.1 Pro is the most obvious candidate) to provide a tiebreaker rather than relying on a two-model panel. The current eval treats flipped query-pair-judge cells as inconclusive for that judge, which preserves the Opus signal on the 1B vs 2B contest (where Opus flipped on only one query) while discounting the GPT-5.5 signal on the three queries where GPT-5.5 flipped. The 16-16 tie at the aggregate level survives this treatment.

What This Means for the Workspace Default

The eval was designed to produce a decision rule rather than to prove any architecture optimal, and the rule it produces is not the one the build was originally trying to validate.

The first finding (Opus synthesizer dominates GPT-5.5 synthesizer at 90.6% on same architecture contests) is the single largest leverage point in the eval. Both Codex Council and GPT Max should default to --synth opus, with --synth gpt55 available for eval comparisons and for applications where cost matters more than quality; the cost case is the one the synthesizer section already made, a modest fixed increment on a tool that already costs five times xhigh per invocation.

The second finding (coordination does not beat blind ensemble at the aggregate level on this query distribution) means the GPT Max architecture should not be the workspace default for high-stakes decision queries. Codex Council with Opus synth produces statistically indistinguishable outputs at 2.6× lower cost, and the high-confidence subsample lean (4-0 favoring blind council) is enough of a signal to prefer the cheaper option when the choice is otherwise a coin flip. The GPT Max architecture remains worth keeping for specific use cases where coordination plausibly matters, such as queries with active methodological disagreement where the peer aware refinement step lets the empiricist challenge the skeptic's confidence and the synthesizer integrates the result with attribution. The 4-0 wins for GPT Max on Q1 and Q7 are evidence that some query shapes do benefit from coordination, even though the eight-query eval lacks the resolution to identify those shapes systematically. The operational rule is to start with Codex Council and escalate to GPT Max only when the council output feels like the agents talked past each other.

The third finding (the knockoff beat ChatGPT Pro on workspace queries) supports the original premise of the build for the use cases the workspace actually has. The default for high-stakes workspace decisions should be Codex Council with Opus synth rather than ChatGPT Pro. The Pro browser harness should be reserved for queries where citation-dense web search across many primary sources is the actual goal, which is the cell where the wiki queries showed Pro competitive with the next best ensemble variant.

The combination of these three findings produces a workspace routing rule that maps query shape to method: decision queries to Codex Council with Opus, contested-evidence queries with active disagreement to GPT Max with Opus, broad survey research to ChatGPT Pro. The Phase 2 build is not retired; it is repositioned from "the new workspace default" to "the niche tool for the cases where coordination plausibly matters." The Phase 1 Codex Council, originally framed as the baseline, becomes the default. The synthesizer choice, originally treated as a flag the user could pick at invocation, becomes the part of the configuration that was always going to dominate.

The uncomfortable observation is that I built the coordination machinery before I evaluated which variable was actually doing the work. If the synth choice eval had run first, on the Phase 1 architecture only, the 87.5% same-architecture finding would have been a single data point, strong but not enough by itself to license calling the synthesizer "the dominant variable." The 93.75% same-architecture replication on the Phase 2 architecture is what turned a single finding into a generalization. So the GPT Max build was not strictly wasted (it produced the replication that strengthens the synth-choice claim), but the coordination layer itself, considered on its own merits, is interesting infrastructure that did not produce the lift it was designed to produce. The leverage on output quality came from the integration step. I spent the budget on the input pipeline when the output pipeline was the variable that mattered.

What the Eval Did Not Settle

Three open questions remain after the eval, and they are the candidates for the next round of measurement rather than the conclusions of this one.

The null baseline question is the most consequential and the most embarrassing not to have answered. The eval compares ensembles against ensembles and ensembles against ChatGPT Pro, but it never compares ensembles against a single GPT-5.5 xhigh call. Does a single deeper call produce 80% of the value of a four-agent ensemble at 25% of the cost? If it does, the entire architecture is over engineered. A small follow-up (three queries by one xhigh call, judged against the Codex Council with Opus output for each) would establish whether the council adds value beyond a single deeper call. It is a thirty minute experiment that I had not yet run when this post was published. (The narrower thing it would settle is whether the council adds value on those queries with these judges; it would not by itself license retiring the architecture across all query shapes.) It has since been run at full scale, and the update at the end of this post carries the answer.

The query set size is enough to detect large effects (the 90.6% synthesizer dominance, the 28-4 council win over Pro contest) but not the smaller effects that would license confidence on architecture-by-query rules. The 16-16 tied result between council and GPT Max is consistent with a real fifty-fifty equivalence and is also consistent with a small but real architecture preference that the eval lacks the statistical power to detect. A future eval with twenty to forty queries (rather than eight) would be more informative about query type effects, at proportionally higher cost.

The judging panel is two models (Opus and GPT-5.5), and the position flip data shows that GPT-5.5 brings more bias than Opus and should probably be down-weighted on close pairs. A future iteration might either weight the judges (Opus 1.5×, GPT-5.5 1×) or add a third judge (Gemini 3.1 Pro is the most obvious candidate, since it would add a third provider perspective rather than another model from the same family). The position flip rate is also worth investigating directly: GPT-5.5 flipping on 23.4% of query-pair cells (versus Opus at 9.4%) is meaningful position dependence rather than pure noise, and the source of the bias (recency, length, ordering, format) is worth diagnosing before relying on GPT-5.5 as the dominant judge on future close pairs; the recommendation in this eval is to weight Opus more heavily and add a tiebreaker, not to remove GPT-5.5 entirely.

Closing

Two hundred fifty-six unique judgments is enough to detect that the synthesizer is the dominant variable, that the coordination machinery did not produce the lift it was designed to produce, and that the four-agent ensemble with Opus synthesis matches or exceeds ChatGPT Pro on the kind of workspace decision the build was originally trying to support. It is not enough to detect smaller architecture-by-query effects, and the position flip rate on GPT-5.5 (roughly a quarter of all query-pair cells) means that the most stable signal lives in the Opus judgments, which flip on roughly a tenth of cells.

The decision rule that comes out of the eval is narrower than the one the build was originally going to validate. The Phase 1 baseline becomes the workspace default. The Phase 2 coordination machinery is kept for niche cases rather than promoted to the default. The synthesizer flag, originally treated as a side option, becomes the most important configuration decision. None of these are the outcomes the project would have predicted from its own design documents, which is the right shape for an eval to produce: a result that updates the design rather than ratifying it.

The next eval question is whether the council itself is worth its cost, and the test is the null baseline run that this eval did not include. If a single xhigh call produces output of comparable quality to four agents plus Opus synthesis, the entire architecture is over engineered and the right move is to retire it. If the ensemble produces meaningfully better output than a single deeper call, the architecture earns its place. At publication I had not run that experiment, and the cost was small enough that not running it had become harder to justify than running it would be.

Update: the null baseline, run

Eleven weeks after publication, the missing experiment ran at full scale rather than the three-query sketch proposed above: all eight frozen queries, one GPT-5.5 xhigh call each under the same harness the council members used, judged blind against all four ensemble arms with the original protocol, plus a re-judge of the May eval's most contested pair as an anchor tying the new judgments to the old table. The complete numbers live in the companion investigation; the conclusions fit in three sentences.

The architecture survives its null baseline. The flagship configuration, either ensemble with Opus synthesis, beats the single call in 28 of 32 judgments for one panel and 25 of 32 for the other, with the high-confidence judgments nearly unanimous, so the recommendation this post ends on stands as written rather than corrected. What the baseline adds is a second appearance of the eval's central finding, in a comparison the original never ran: the single call is only narrowly edged by the ensembles that synthesize with GPT-5.5 and is beaten decisively by the same panels when only the synthesizer changes to Opus, which is the cleanest evidence yet that the ensemble's value concentrates in the synthesis seat rather than in the breadth of the panel. A reader who takes one operational sentence from this whole eval should take that one: before paying five times the cost for a panel of agents, spend the first increment on a better synthesizer, because on these queries the same panel barely clears one deep call when the synthesis seat holds GPT-5.5, and the synthesizer without any panel was never tested. That last comparison is the next missing experiment, and this update is me declining to claim its result in advance.