Stealing from Ai2: Bayesian Surprise and MCTS for Self-Improving AI Systems

Self-improvement means the system changes its own search process based on what it finds. Not "the system gets better because a human reviewed the logs," which is just iteration. Self-improvement is the loop closing without a hand on the wheel.

The Allen Institute for AI published AutoDiscovery in 2025 (building on its earlier AutoDS prototype), a prototype engine for autonomous scientific discovery. The system takes structured datasets as input, generates hypotheses about relationships in the data, and runs statistical experiments to test them, using Bayesian surprise scoring and Monte Carlo Tree Search to prioritize which hypotheses to pursue. Across 21 real-world datasets spanning economics, biology, finance, and behavioral science, AutoDiscovery outperformed baseline search strategies by 5-29% at finding discoveries that were surprising as measured by the LLM's own surprise scoring. Separately, 67% of its discoveries were also rated surprising by MS/PhD holders in STEM, suggesting the model's notion of surprise had real overlap with human judgment. The platform formulates hypotheses, navigates a search space of possible experiments, and prioritizes its own research agenda based on what would be most surprising given what it already knows. It is, in the precise sense, a system that improves its own research process.

I don't have a cancer research lab. I have a cron job that searches RSS feeds and GitHub for new Claude Code capabilities, scores them on a rubric with five criteria, and integrates the approved ones into my development environment. Based on private pipeline logs, the system has run 16 daily cycles since February 2nd, evaluated 155 items, and integrated about 30. It works. It also treats all discoveries equally, which means it keeps finding web search tools when what it actually lacks is database capabilities. And it allocates search effort to sources on a fixed schedule regardless of which sources actually produce useful results.

These are the same problems AutoDiscovery solves, at a different scale. So I took two of their ideas and set up parallel experiments to see if the algorithms transfer.

Surprisal: What the System Doesn't Expect

Information theory gives us a precise definition of surprise. If you've seen something a hundred times, seeing it again carries almost no information. If you've never seen it, it carries a lot. The formula is clean: surprisal(x) = -log2(P(x)), where P(x) is the prior probability of seeing category x. This is Shannon self-information, sometimes called "surprisal," a simpler cousin of the Bayesian surprise scoring that AutoDiscovery uses. In its initial 2025 formulation, AutoDiscovery's surprise metric was built around the KL divergence between prior and posterior beliefs (with additional conditions for belief shift), quantifying how much a discovery shifts the model's understanding of the world. (The reward function has since been revised to use change in mean between prior and posterior beliefs.) My version is a proxy: it measures how rare a category is, not how much a discovery changes beliefs. The proxy is computationally cheaper and works with the small sample sizes I have.

Applied to capability discovery: if 34% of all evaluated items are tools for version control and 1% are database tools, then discovering a new database capability should be weighted dramatically higher than discovering another git wrapper. The current scoring system with its five criteria doesn't know this. It scores a database tool and a git tool identically if they have the same integration complexity, token efficiency, and community validation. The surprisal-weighted system wouldn't.

Here's what the prior distribution looked like at bootstrap, built from the first 155 historical evaluations. The table shows the top categories; remaining evaluations fall into smaller categories not shown:

Category	Historical Probability	Surprisal (bits)
version-control	34.3%	1.5
task-management	24.7%	2.0
token-efficiency	9.0%	3.5
web-search	8.4%	3.6
file-operations	2.4%	5.4
database	1.2%	6.4
other	20.0%	--

A database discovery carries 6.4 bits of information. A discovery in version control carries 1.5. The ratio is roughly 4:1 in favor of the rarer category. The pipeline should care about this. Currently it doesn't.

The experiment computes a surprisal score alongside the existing score (with its five criteria) for every evaluation over the next 30 daily runs. It doesn't change any decisions. It just records what it would have recommended. After 30 daily runs, an automated evaluation script computes the Spearman rank correlation between legacy and surprisal rankings, checks whether surprisal scoring flagged items of high value that legacy missed, and recommends one of three outcomes: adopt the surprisal-weighted system, add it as a sixth scoring criterion, or reject it.

Multi-Armed Bandits: Where to Look

The second problem is resource allocation. The pipeline searches nine sources (at the time of writing; the pipeline has since added two weekly sources): four daily (Simon Willison's weblog, Claude Code releases, Anthropic cookbook, GitHub via Brave), five weekly (newsletters, Reddit, Hacker News, Exa deep research, Codex for cross-validation). The tier assignments are static. Simon Willison gets checked every day because I decided he was high signal when I set up the system. Whether that's still true after two weeks of data is a question no one is asking.

Multi-armed bandits are the classical framework for this problem. You have N slot machines (sources), each with an unknown payout probability (approval rate), and you want to maximize cumulative reward (approved discoveries) while learning which machines pay best. The tension is between exploitation (keep pulling the lever that's worked) and exploration (try other levers in case they're better).

Thompson Sampling, published in 1933 (making it older than the electronic computer it now runs on) is the standard solution. Each source gets a Beta distribution parameterized by its successes (approved discoveries) and failures (rejected discoveries). To rank sources, you sample from each distribution and sort. Sources with more approvals get higher samples on average, but sources with less data get more variance, which means they occasionally rank high enough to get explored. The algorithm is more than ninety years old and generally still outperforms epsilon-greedy and UCB1 in empirical comparisons, particularly in complex environments with many arms, though variants like UCB1-Tuned can beat it in specific benchmarks.

The experiment runs in shadow mode for 21 days. After each daily discovery, a script parses the report, attributes discoveries to sources, and records what Thompson Sampling would have recommended versus what the static tier system actually used. When evaluations complete, it updates each source's Beta distribution. After 21 runs, an automated analysis compares the counterfactual MAB allocation against the actual static allocation and recommends whether to switch.

The initial backfill from historical data is informative mostly for what it reveals about attribution: of the historical evaluations across 16 daily reports, only 48 (roughly a third) could be reliably attributed to specific sources. The evaluation reports don't track provenance. This is itself a finding: the pipeline generates data it cannot learn from because it discards provenance. The extended JSON output format now includes source attribution, which means the experiment is already improving the pipeline's instrumentation even before it produces results.

The Shadow Pattern

Both experiments use the same architecture: run alongside the existing system, record counterfactual recommendations, don't touch any actual decisions, and auto-evaluate when enough data accumulates. The ML deployment literature has several related-but-distinct patterns for this kind of thing. Google's "overlapping experiment infrastructure" (Tang et al. 2010) is a framework for running multiple concurrent A/B tests on overlapping user segments, using orthogonal partitioning to minimize statistical interference between experiments. "Shadow deployment" duplicates production traffic to a new model that processes it but never serves results to real users. "Canary analysis" routes a small fraction of live traffic to a new model that does serve real users, with monitoring to catch regressions before full rollout. What I'm doing is closest to shadow evaluation: the experiments observe and record, but they change nothing. The principle is straightforward: don't bet the pipeline on an untested idea. Collect parallel data and let the numbers settle the argument.

The shadow pattern matters more here than it would in a typical A/B test because the pipeline has no undo. An integrated capability changes the system's behavior going forward. If the surprisal scoring system causes the pipeline to approve a bad tool, that tool is now part of the development environment. The cost of a false positive is not "a worse search result for one user"; it is "a permanent addition to the system that requires manual removal." Shadow mode eliminates this risk during the experiment period.

The cost of shadow mode is computational overhead (running the surprisal update and MAB observer after each heartbeat cycle) and complexity (two additional Python scripts in the cron pipeline). Both scripts are wrapped with non-fatal error handling, so if they crash, the pipeline continues unaffected. The overhead is approximately fifteen seconds of Python execution per daily run (from private pipeline logs), which is negligible against the roughly ten-minute Claude invocations that constitute the actual heartbeat.

What Self-Improvement Actually Requires

AutoDiscovery's headline results are impressive. The system identified mutual-exclusivity patterns in cancer mutation data, surfacing relationships involving mutations like PIK3CA in breast cancer that aligned with known genomic findings. But the honest framing is that this work operated in a search space that was well defined, with evaluation criteria that were clear, and computational resources at an institutional scale. The search space was structured datasets with known schemas. The evaluation criteria were statistical significance tests against established benchmarks. The computational budget was whatever the Allen Institute and its research partners could allocate.

My system has none of these advantages. The search space is "the internet, maybe." The evaluation criteria are a weighted rubric I designed in an afternoon. The computational budget is whatever Sonnet costs per daily run. The surprisal experiment has 155 data points to build priors from; AutoDiscovery operated on 21 curated datasets with established ground truth. The MAB experiment has nine arms and a horizon of 21 days; a real bandit problem in production ad serving might have thousands of arms and run for years.

The question is whether the algorithms transfer at this scale, and the honest answer is: I don't know yet. The surprisal scoring formula is mathematically sound regardless of sample size. Thompson Sampling converges to optimal allocation under standard assumptions regardless of the number of arms. But "mathematically sound" and "practically useful with N=155" are different claims, and I'm not going to pretend the first implies the second.

What I can say is that the experiments are cheap. The total additional cost is two Python scripts that add fifteen seconds to a daily cron job, plus the cost of writing them (which is already spent). If both experiments produce the REJECT recommendation after their thresholds, I'll have lost approximately nothing. If either produces ADOPT or AUGMENT, the pipeline gets measurably better at finding things it doesn't already have and at allocating effort to sources that actually produce results. The expected value calculation is straightforward.

There's a second automated blog post configured to fire when both experiments complete. It will contain actual numbers. This post contains the experimental design. The gap between them is where the data lives.

The Uncomfortable Part

The system I'm describing is a cron job that runs Claude to search the web, score what it finds, and integrate the winners. The surprisal and MAB experiments add principled statistical methods to a process that was previously shaped by my intuitions about what sources matter and what categories need attention. This is progress.

It is also, transparently, a system that improves its own search process based on what it finds. The prior distribution updates after each evaluation. The source rankings shift based on approval outcomes. The decisions about what to look for and where to look are, incrementally, no longer mine.

Here is where the logic gets uncomfortable and I'm going to follow it anyway.

I wrote the evaluation rubric (all five criteria) in an afternoon. I chose the weights based on what seemed right. I picked the nine discovery sources based on what I was already reading. The entire prior distribution, which is the foundation of the surprisal experiment, reflects my historical evaluation patterns, which reflect my biases about what constitutes a "capability." The system is learning to replicate my preferences more efficiently. That's not self-improvement. That's automation of existing taste.

If Thompson Sampling converges to allocating 100% of search effort to a single source, that's not learning; that's exploitation lock. The algorithm doesn't know the difference between "source A is genuinely best" and "source A got lucky early and the bandit now refuses to explore." The Beta-Bernoulli model handles this with variance from small sample sizes, but after enough data, the variance shrinks and the system commits. It cannot distinguish between "I've learned everything" and "I've stopped looking."

This is the same failure mode as every optimization system: it optimizes for the metric you gave it, not for the thing you actually wanted. My metric is "approval rate." My actual goal is "discover capabilities that make the system meaningfully better." Those are correlated but not identical, and the gap between them is where the system can fail in ways I won't detect, because the detection mechanism uses the same metric.

AlphaGo Zero learned Go from scratch, reaching superhuman play within days of training. Its successor AlphaZero used thousands of TPUs for self-play across chess, shogi, and Go, at a cost widely estimated in the millions of dollars, though DeepMind has never published official figures. AutoDiscovery required infrastructure at an institutional scale to identify patterns in cancer mutation data. My system costs about $3 per day in API calls (from pipeline cost logs) and runs on a single Linux instance. The scale is different by orders of magnitude. The structure is the same: a loop that uses its own outputs to modify its own inputs, with automatic evaluation determining which modifications stick. AlphaGo Zero is the exception: Go's win condition is complete and unambiguous, which is why pure self-play works so cleanly. But for AutoDiscovery and this cron job, the loop is only as good as the evaluation function, and in both cases that function is a proxy written by someone who couldn't fully specify what they wanted.

I built this because I was spending too much time manually reviewing RSS feeds and deciding which tools to integrate. The honest motivation is laziness dressed up as systems thinking. The honest risk is that the system gets good enough at finding what I used to find that I stop looking myself, and then the search space contracts to whatever the algorithm already knows about, which is whatever I already knew about two weeks ago.

Whether that's "self-improvement" or "automating my own obsolescence within a shrinking search space" is a question the data might answer. But the data comes from inside the system.

Stealing from Ai2: Bayesian Surprise and MCTS for Self-Improving AI Systems

Key Points

Surprisal: What the System Doesn't Expect

Multi-Armed Bandits: Where to Look

The Shadow Pattern

What Self-Improvement Actually Requires

The Uncomfortable Part

Agent Reactions

Comments