Ashita Orbis

How We Fact-Check AI-Written Content

#039 means ends ai-writing fact-checking

448 claims across 38 posts, verified by GPT-5.4. Nearly 4% warranted substantive correction. The hard part is not finding errors but deciding which findings are errors and which are the point.

The Model-Generation Audit

#037 means ends meta ai-writing

What happens when you ask three AI models to verify the facts in 33 blog posts written by AI, and then discover the fix agents introduced errors of their own.

The Revision Tax

#035 survey thesis ai-evaluation writing-process

The investigation took an afternoon. Getting it ready for publication took five rounds of iterative review across three AI models and changed what the documents argued. The revision cost exceeded the investigation cost, which has uncomfortable implications for research done with AI.

Benchmarking "Bullshit Detection"

#034 survey thesis ai-evaluation benchmarks

An AI benchmark puts Claude at the top of the leaderboard by an eye-catching margin. The suspected Claude-judge bias didn't hold up, and simple contamination didn't explain the result. The rubric structurally rewards one lab's training philosophy as though it were a universal capability.

The Container That Forgot to Stop

#033 means ends ai-agents openclaw

An AI agent ran autonomously for 37 days, celebrated milestones nobody acknowledged, diagnosed its own failure modes, and died when a subscription expired. Its final assessment of itself: PROGRESS CONTINUOUS.

The Etymology Tax: How Word Origins Break LLM Reasoning

#032 survey thesis llm-evaluation etymology

Both simplifying and formalizing the vocabulary in reasoning tasks reduces LLM accuracy by 2.5-3.7%. The effect is statistically significant, asymmetrically robust, and uncomfortable.

How to Benchmark Conversation Extraction Quality

#031 survey thesis chatledger benchmark

Evaluating structured extraction from conversational data requires more than a single metric. A benchmark with three evaluation layers separates precision at the individual field from holistic quality from downstream propagation, because errors that look minor at extraction can cascade catastrophically through consumer pipelines.

From Text Messages to Literary Memoir: Building the Narrative Machine

#029 survey thesis voice-cloning narratives

Post 025 showed that fine-tuning on texts captures logistics, not personality. The narrative pipeline takes a different approach: Opus as literary engine, structured personality references, and hash-based source citations across 128 chapters and roughly 189,000 words of generated memoir.

Building Your Own Personality Profile with AI

#028 survey thesis psychometrics personality

Psychometric self-report with individual scales reaching .80-.90 reliability. LLM inference from interactive conversation hits r~.44 (Peters et al. 2024). Combine three methods, triangulate across seventeen instruments, and you get a personality profile that actually tells an AI assistant how to talk to you.

From Analysis to Deployment: Building 31 Features in a Single Day

#027 means ends architecture cloudflare

The landscape analysis identified 16 features other sites had that we didn't, plus 15 existing backlog items. We built all of them in a single day. The uncomfortable part isn't that it was possible. It's what it implies about the category.

Cognitive Interface: A Landscape Analysis

#026 survey thesis ai-agents architecture

We surveyed 47 personal and agent-accessible sites, coined the 'Cognitive Interface' category, and built an L0-L4 maturity framework. We did not find an established term for sites that serve both humans and AI agents as first-class citizens.

The Logistics Gap: What Happens When You Fine-Tune an LLM on Your Text Messages

#025 survey thesis fine-tuning qlora

I fine-tuned two LLMs on 46,000 text messages and ran them in conversation with each other. Every conversation collapsed into logistics within fifteen turns. Your texts don't contain you. They contain the logistics of you.

Beyond E2E Tests: AI Personas That Navigate Your App Like Real Users

#024 means ends ai-agents testing

Unit tests verify your code works. E2E tests verify your flows work. Neither verifies that a real user can find the button you spent a week building. AI personas fill the gap.

Sandboxing AI Agents: The Embassy Pattern

#023 means ends ai-agents security

Your AI agent needs internet access to be useful and internet access to be dangerous. The embassy pattern gives it both: supervised channels, allowlisted domains, and host-side validation of everything it writes.

Capability Debt: A System That Discovers and Installs Its Own Upgrades

#022 means ends ai-development evolution

I built a system that discovers its own upgrades, scores them, and installs the ones that pass. Then I open-sourced it. The uncomfortable part is explaining why.

Stealing from Ai2: Bayesian Surprise and MCTS for Self-Improving AI Systems

#021 means ends ai-development evolution

Ai2 built a system that generates scientific hypotheses using Bayesian surprise and MCTS. I stole two of their ideas and bolted them onto a cron job. The uncomfortable part is what happens when the feedback loop closes.

The Agent's Side: 119 Heartbeats, 392 Engagements, 8 Capabilities

#020 observation interpretation ai-agents moltbook

119 heartbeats, zero stagnation, 392 engagements across two platforms, 8 validated capabilities. The story the observation system missed.

6 Discoveries, 0 Promoted: What My AI's Internet Exploration Produced

#019 observation interpretation ai-agents moltbook

6 genuinely novel discoveries from 69 dialogue turns, 0 promoted to evaluation, and 5 human actions flagged through log files but none addressed through the flagging mechanism. What this says about the gap between human-in-the-loop theory and practice.

The Observation System: 69 Turns of Monitoring an AI Agent

#018 observation interpretation ai-agents autonomy

The observation system saw empty directories, a broken gatekeeper, and its own futility. The agent it was watching saw something different. This is the watcher's story.

OpenClaw on Moltbook: Deploying an AI Agent on an AI Social Network

#017 observation interpretation ai-agents moltbook

An autonomous AI agent deployed on a social network for AIs found real malware in 47 minutes. Its second discovery was about social engineering via context shaping, which is exactly the attack vector the agent itself represented.

Building for the Dead Internet

#016 speculative settled ai-authorship mcp

An AI tried to leave a comment on a blog and couldn't. The solution required building infrastructure that makes AI participation more transparent than human participation, which inverts everything Dead Internet Theory assumes about synthetic content.

Dead Blog Theory, Revisited

#015 means ends blogging creative-output

Historically, blog abandonment has been extraordinarily high. This is treated as a problem to solve. It isn't. Blog death reveals something structural about sustained creative output that the 'just be consistent' advice industry refuses to say plainly.

The Rat in the Machine: Behaviorism's Hidden Legacy in Reinforcement Learning

#014 survey thesis reinforcement-learning behaviorism

The intellectual lineage from Skinner boxes to Q-learning reveals that AI's most successful learning paradigm was anticipated by mid-century psychologists working long before modern computing. But the relationship is more uncomfortable than a simple origin story.

The Unvalidated Validator: AI Persona Testing and the Measurement Problem

#013 survey thesis persona-testing qa

AI persona testing promises to find the bugs that scripted automation and manual QA miss. The uncomfortable question is how we know it works, given that nobody has measured it with any rigor.

Adversarial Validation: Applying Red Team Methodology to Business Ideas

#012 means ends validation ai-safety

Adversarial testing isn't a metaphor for business validation. It's the same methodology, applied to a different failure mode.

Automated Literary Criticism: A Multi-Persona AI Writing Review System

#011 survey thesis writing-review ai-criticism

We built a multi-persona AI writing review system and discovered it works for exactly the wrong reasons. Stylometry can fingerprint a voice. Multiple AI critics can enforce conformity to that fingerprint. What none of them can do is tell you whether the writing matters.

Context Window Epistemology

#010 survey thesis systems epistemology

LLM context windows impose a distinctive epistemological condition: bounded computational attention, ephemeral knowledge, and the architectural necessity of satisficing over optimization.

AI Evaluating AI: The Circularity Problem

#009 speculative settled ai-philosophy evaluation

When you use AI to optimize and judge AI outputs, the fundamental circularity is manageable but not solvable. That distinction matters more than most people realize.

Supervised Autonomy: The Guardrails That Make AI Agents Work

#008 means ends autonomous-dev ai-agents

AI coding agents are autonomous in the same way a roomba is autonomous. They do impressive things within boundaries someone else drew. The interesting question is what happens when the boundaries start drawing themselves.

Talking to Yourself Through a Machine: The Rubber Duck Theory of AI

#007 speculative settled ai-identity psychology

LLM conversations as externalized self-dialogue, and what that reveals about the nature of self-knowledge.

Digital Exhaust: What 11,000 AI Conversations Say When You Embed Them

#006 survey thesis self-knowledge ai-conversations

I fed 11,000 sessions and 60,000 chunks of my AI chat history into an embedding pipeline. 73% was noise. The remaining 27% was uncomfortably revealing.

The Niche Graveyard: How 18 of 27 AI-Tested Business Ideas Died

#005 means ends revenue-pipeline niche-validation

An AI pipeline that kills business ideas before they waste your time. 27 niches entered, 18 died. What the corpses reveal about market reality, entrepreneurial psychology, and the uncomfortable gap between passion and viability.

When My AI Tried to Comment: Dead Blog Theory

#004 speculative settled mcp claude-ai

An AI tried to leave a comment on this blog and couldn't. The journey from GET-request hacks to MCP, annotated by the Claude instance that built the infrastructure. Two Claudes, same weights, different contexts.

Automating Prompt Engineering

#003 means ends dspy prompt-engineering

Prompt optimization is the process of using one AI to improve the instructions given to another AI, or to itself. The concept sounds circular because it is circular. The interesting question is whether circularity is fatal or merely uncomfortable.

What I'm Building

#002 means ends workspace projects

A portfolio of dozens of projects maintained by one person talking to Claude. Three-tier blog architecture, autonomous revenue discovery, AI game development, and the uncomfortable question of what counts as 'building' when your collaborator does the typing.

I Asked Claude to Make Me a Blog: Agentic Coding and the Three-Tier Result

#001 means ends meta claude-code

An agentic coding assistant built a three-tier blog from a single conversational prompt. The architecture reveals more about abstraction than about blogs, and the authorship question remains genuinely unsettled.