What the Wiki Router Found

The Ashita Orbis Wiki is a structured, citation-rich reference companion to the blog. As of 2026-05-11 it has 87 articles deployed. Fifteen of them came from ChatGPT Pro DeepResearch in the browser (eight articles in batch 1, seven captured in batch 2) before the empirical question of method choice had been settled. The next 36 came from a batch where I manually split 12 articles each across three AI generation methods, applying a topic-shape heuristic that felt right at the time. The most recent 36 came from a batch where a single GPT-5.5 routing call per topic decided the method, and the natural distribution turned out to be substantially uneven: 24 articles to ChatGPT Pro, 10 to GPT Max, and 2 to Codex Council.

This post is about the pipeline behind those numbers, and what the routing distribution revealed about the topic landscape that the manual heuristic was operationally constrained from producing.

Origin: ChatGPT Pro DeepResearch

The first batch of wiki articles came from ChatGPT Pro DeepResearch, accessed through the browser. Eight topics, eight broad survey articles, each one a citation-dense piece of long-form research output that read like the kind of thing you would assign as a literature review. The articles were good. The pipeline was clumsy: a Chrome CDP harness with manual login, occasional session drops, and a not-quite-scriptable browser interface that made the rest of the workspace automation friction-heavy in comparison.

Two limitations of the browser-only approach were already clear when batch 3 began. The first was operational: ChatGPT Pro is not scriptable in the way the rest of the workspace tools are scriptable. A pipeline that wants to run from cron, or fan out across forty topics, or be invoked by other agents, ideally has a CLI surface that does not require a Chromium instance and a maintained login. Pro can be scripted through a Chrome CDP harness (and is, in the batches that follow), but the harness is fragile and the rest of the workspace tools do not have that fragility, so a pipeline that uses Pro for everything has a different maintenance burden from one that uses Pro selectively. The second was epistemic: by the time batch 3 was being planned, the GPT Max eval had settled the empirical question of when ChatGPT Pro is the right method and when something cheaper or coordinated would do as well. The findings, briefly, were that the synthesizer choice dominates the architecture choice for decision queries, that coordination via GPT Max does not reliably beat blind ensemble via Codex Council, and that ChatGPT Pro is competitive with the weaker ensemble variant on the broad survey research queries the eval included (never beating the better variant) and loses decisively to both ensemble variants on workspace grounded decision queries. None of that licenses a single-method pipeline. It licenses a routing decision per topic.

So the wiki pipeline became a three-method pipeline. Codex Council with Opus synthesizer for topics shaped like architectural decisions or vendor comparisons, where four blind personas plus a clean adjudicator handle the structural decomposition well. GPT Max with Opus synthesizer for topics shaped like contested empirical questions, where the peer-aware refinement step lets the empiricist persona push back on the skeptic mid-task and the synthesizer integrates with attribution. ChatGPT Pro for topics shaped like broad citation-heavy surveys, where Pro's longer reasoning budget and native web search produce the citation density that survey-shaped articles need.

The routing question is then: given a list of topics, how do you decide which method generates each one?

Batch 3: The 12-12-12 Manual Heuristic

The 36 topics for batch 3 came from mining [[Wiki Link]] orphans in the existing 15 articles. The orphans were the cross references that pointed to slugs nobody had written yet, which made them the topic graph's natural growth points. Ranking the orphans by frequency and then filtering for standalone substance produced a candidate list, and the top 36 of those became the batch.

The assignment of each topic to a method was a manual judgment call. I read each topic, asked what cognitive shape the article needed (decision adjudication, methodology critique, broad source aggregation), and assigned the method whose strengths matched the shape. The heuristic was the one above, but the application was qualitative: looking at "agent-harness" and concluding that yes, this is an architectural decision shape with multiple alternatives to compare, so Council. Looking at "benchmark-contamination" and concluding that this is contested empirical evidence with active methodology disagreement, so GPT Max. Looking at "preparedness-framework" and concluding that this is a broad survey across lab governance documents, so Pro.

The constraint I imposed on the heuristic was a 12-12-12 split. I wanted balanced production volume across methods for two reasons. First, comparing article quality across methods is easier when the sample sizes are similar. Second, building operational familiarity with all three methods (their wall clock, their failure modes, their per-article cost) is easier when each method runs a non-trivial batch. The eval data from the GPT Max build had already shown that the three methods produce different kinds of articles, and what I wanted was 12 examples of each kind, side by side, ready to inform the next batch's routing decision.

Twelve articles per method is also where the heuristic started forcing itself. Some of the 36 topics were not natural matches for the method I assigned them to. "Tool-calling semantics" looked like a vendor comparison topic, which is Council shape, and got assigned to Council, but it also looked like a broad survey topic that would benefit from Pro's citation density. "Reward-hacking" looked like a contested empirical topic, GPT Max shape, and got assigned to GPT Max, but it could also have been a citation-heavy survey of the specification-gaming literature. The 12-12-12 constraint was forcing topic-to-method assignments that the topic itself did not quite want.

The articles came out fine. Eleven of the 12 Council articles deployed on the first run (one required a manual retry due to a transient timeout that re-ran successfully). Ten of the 12 GPT Max articles deployed on the first run (two Opus rate-limit failures re-ran successfully after the rate limit reset, via a small resynth-failed.py script that re-ran only the failed synthesis step without re-running the four agents). All 12 ChatGPT Pro articles deployed, but only after a post-processing pass that stripped extraction artifacts from Pro's HTML-to-markdown output (the same innerText extractor that the GPT Max eval also flagged as a citation-stripping issue, plus a leading meta-preamble and trailing [label+N] citation suffixes that the CDP harness produced). The 36-article batch deployed across three sequential runs over two days, finishing 2026-05-06. The wiki went from 15 articles to 51.

But by the time batch 3 was finished I was already convinced the 12-12-12 split had been a methodological choice rather than a property of the topics. Some of the Council articles read like more natural Pro candidates, and some of the GPT Max articles read like more natural Council candidates. The eval had said the three methods are not interchangeable, and the qualitative read of the batch-3 outputs said that several of them looked like more natural candidates for a different method than the one the heuristic had assigned.

Batch 4: One GPT-5.5 Call Per Topic

Batch 4 was 36 more topics, mined the same way from the cross references in the batch-3 articles. The difference was the routing step.

Instead of me reading each topic and applying the heuristic, a small Python script called route_topics.py ran one GPT-5.5 xhigh call per topic. The routing prompt included the empirical method strengths from the eval (the Phase 3 findings, summarized) and the full topic charge. The router was asked to output exactly three lines: METHOD (council, GPT Max, or ChatGPT Pro), CONFIDENCE (high, medium, low), and a one-sentence RATIONALE. If the router's confidence was low, the user policy was to take the routing decision anyway and not pause for human review, with the downstream batch runner consuming the route TSV without further intervention.

The routing decisions for 36 topics took about ten minutes of wall clock at roughly seventeen seconds per call. That overhead is negligible against the roughly nine hours that the actual article generation took. There were zero routing failures.

What the router decided, after looking at each topic independently and without any constraint on total distribution:

Method	Articles	Share (rounded)
ChatGPT Pro	24	67%
GPT Max	10	28%
Codex Council	2	6%

Twenty-four articles to Pro. Ten to GPT Max. Two to Council. Nothing in the routing prompt imposed a target distribution or quota across methods. The prompt scored each topic against the three methods' empirical strengths and emitted a method choice, and the 24-10-2 split that came out of that one-topic-at-a-time process is the router's interpretation of where the topics landed given that prompt.

Reading the routing rationales (in batch-4-routes.tsv) explains what the router was seeing. The high-confidence Pro routes were almost all variants of "this is a broad citation-heavy survey across primary sources and the topic's center of gravity is source aggregation rather than adjudicated decomposition." The router used some variation of "primary cognitive shape is broad source aggregation" on 22 of the 24 Pro routes. The high-confidence GPT Max routes were almost all "this is contested empirical evidence where peer-aware refinement helps a skeptic challenge methodology and an empiricist defend the strongest version of an argument." The router used variations of "contested empirical evidence" or "methodology critique" on most of the 10 GPT Max routes. The two Council routes (reasoning-tokens, tracked-capability-levels) were the only topics where the cognitive shape was architectural decision-with-vendor comparison, the kind of structural decomposition the four-persona ensemble was designed for.

Twelve of the 36 routing decisions were high confidence and the other 24 were medium confidence; none were low confidence in this batch (the low-confidence auto-pick policy existed as a fallback but was not exercised). Looking at the rationales, the medium-confidence routes typically named a plausible alternative method ("would shift to chatgpt-pro if the goal were mainly exhaustive source aggregation") and picked the primary fit by what the topic's cognitive shape leaned toward. The article quality across all 36 batch-4 outputs was, on my qualitative read, at or above batch-3 quality, with zero deployment failures. The post does not run a blind quality eval comparing batch-4 articles against what the manual heuristic would have produced; the operational signal is the deployment success rate plus my own reading of the outputs.

What the Router Showed

The routing distribution is the most informative artifact of batch 4. Two-thirds of the 36 topics, when each was scored independently against the three methods' empirical strengths, were classified as broad survey shape. About a quarter came out as contested evidence shape. Two topics came out as architectural decision shape.

This pattern is best read as a property of the candidate pool. The 36 topics were mined from cross references in the batch-3 articles, which were themselves mined from cross references in batch 1-2. After three rounds of mining, the remaining [[Wiki Link]] orphans are not architectural decision shapes; they are the longer tail of broad survey topics (benchmark papers, frontier-model technical reports, research method literature) and contested evidence topics (alignment theory, evaluation methodology, capability claims). Architectural decision topics tend to be cross-referenced once or twice and then resolved into a single article, while broad survey topics tend to be cross-referenced repeatedly because they show up across many other articles' citation sections. The mining process therefore biases the candidate pool toward survey topics, and the router's 67% Pro share is the surfaced consequence of that bias.

The implication for batch 3 is that the 12-12-12 split almost certainly assigned several topics to a method that the dynamic router would have routed differently. When I ran the batch-3 topics back through route_topics.py as a retrospective sanity check (not as a re-run of the articles), the router routed 7 of the 12 Council-assigned topics to Pro, and 4 of the 12 GPT-Max-assigned topics to Pro. At least eleven of the 36 batch-3 topics, in other words, would have routed differently under the dynamic policy. The retrospective check is a measure of disagreement between the manual heuristic and the router, not a direct measure of which assignment produced the better article (no blind quality comparison was run). What it does show is that the distribution the manual heuristic enforced is not the distribution the router would have produced. The articles still came out fine, because each method produces competent output on most topics, but the methodological choice to balance across methods is a different choice from the choice to route by topic fit.

There is a more general claim in here that the dynamic routing run made legible, and it is worth being precise about what the claim is. The cost of a routing call is small. A single GPT-5.5 xhigh call per topic is roughly the cost of one minute of GPT Max wall clock (the routing prompt is short and the output is three lines, which runs in seconds rather than the minutes a full xhigh research call takes), in a pipeline where the article generation step itself takes nine hours total. The expensive thing in the pipeline is the article, not the routing decision.

What the cheap routing call actually changed about the pipeline is the constraint structure rather than the per-topic intelligence. The manual heuristic imposed an aggregate distribution by fiat: 12-12-12, a quota selected for two legitimate methodological reasons (balanced sample sizes for cross-method comparison, operational familiarity with each method's failure modes). Those reasons were fine for an exploratory eval phase. They are different from the question the router was answering, which was "given each topic independently, which method best fits its cognitive shape." The router did not impose any aggregate distribution at all; whatever distribution the per-topic decisions produced was the one the pipeline used. If the human had been instructed to score each topic independently and emit a method without enforcing a quota, the human would have produced an aggregate distribution as a byproduct too, and that distribution might or might not have looked like 24-10-2. The router did not surface a property the human could not perceive; it surfaced a property the human was operationally constrained from producing.

There is also a real qualification on what the 24-10-2 split is a measurement of. The routing prompt embedded a summary of the eval's findings about each method's strengths, including describing Pro as good for "broad citation-heavy surveys." A different summary, emphasizing different strengths, might steer the router toward a different distribution on the same topics. The 24-10-2 is therefore not "what the topics produced unaided" but "what this routing prompt produced when applied to a candidate pool mined from existing cross references." The corpus shape and the prompt framing jointly determine the distribution; the distribution is a real signal about both.

What This Changes for Batch 5

Batch 5 has been routed (using the same route_topics.py script on a new set of 36 topics mined from the batch-4 cross references) and the routing distribution is 26 Pro / 8 GPT Max / 2 Council, slightly more Pro-heavy than batch 4. The articles have not yet been deployed at time of writing. The router's repeated tendency to route the long-tail orphan topics toward Pro looks like a corpus property, not a one-batch anomaly: every additional round of mining surfaces more broad survey shapes and fewer decision shapes, which is consistent with the way wiki link graphs grow (decision topics resolve once, survey topics recur).

The operational lesson from the 24-10-2 split, qualified by what the eval did and did not measure, is that an LLM router at this price point is a useful default for assigning topics to generation methods, and that the distribution it produces is more informative as a signal about the candidate pool than as a measurement of which method produces the best article per topic. The router has not been audited against a human-graded quality baseline; what it has been audited against is operational reliability (zero routing failures across 72 topics, zero deployment failures across the 36 deployed routed articles from batch 4) and my own qualitative read of the outputs. A blind quality comparison between router-assigned and manually-assigned articles is the natural next experiment, with batch 3 already providing the manual-heuristic side of an A/B; that experiment has not yet been run.

The dynamic routing pipeline is now the default for the wiki. The 87 deployed articles are split 15 from the browser-DeepResearch era, 36 from the manual heuristic, and 36 from dynamic routing. The next batch will not have a fourth subgroup; it will be more dynamic routing articles, and the cumulative split across methods will continue to drift toward whatever the router produces from the candidate pool.

If batch 4's 67% Pro share and batch 5's 72% Pro share are the underlying tendency, the wiki is heading toward being mostly Pro-generated, with a long tail of Council and GPT Max for the topics where structural decomposition or contested evidence refinement is the dominant cognitive shape. Per the eval, Pro holds its own on broad survey content; its citation density and native web search make it the operational default for that shape. The methods that the eval said were better for decisions remain available for the cases where decisions are the cognitive shape. The routing call is what made the choice legible at the per-topic level.

Closing

The wiki pipeline started as a single-method process (ChatGPT Pro DeepResearch in the browser) and became a three-method process when the eval data on method strengths became available. The first three-method batch used a manual 12-12-12 heuristic. The second three-method batch used a single GPT-5.5 routing call per topic, and the distribution that came out was 24-10-2.

The cheap thing the router added to the pipeline was not the per-topic method assignment, which a human heuristic was almost good enough at and which a human might have done better on specific topics where domain knowledge could override shape-based classification. The cheap thing the router gave the pipeline was the absence of a self-imposed balance constraint, plus the aggregate distribution as a corpus signal that the constraint had previously hidden. Twelve topics per method felt balanced, while twenty-four topics for Pro felt lopsided, and the lopsided distribution was a more honest reading of the candidate pool than the imposed quota. The manual heuristic had not been wrong about specific topics so much as wrong, retrospectively, about the assumption that balance across methods was what the candidate pool needed once the eval-phase reasons for balance no longer applied.

The next batch will not pause to ask whether the distribution is balanced. The router will run, the methods will be assigned, the articles will be generated, and the distribution will be whatever the router produces from that candidate pool. That is the operational version of treating the routing decision as cheap and the article generation as expensive, which is exactly the cost geometry the pipeline actually has.