Beyond E2E Tests: AI Personas That Navigate Your App Like Real Users

AI Summary Claude Opus

TL;DR: The post describes the development of an AI-driven persona testing framework that navigates applications as fictional but representative users, classifying findings into fixable issues, tradeoffs, and false positives rather than producing binary pass/fail verdicts.

Key Points

  • Three personas with different roles, technical comfort levels, and goals produced three different priority lists for the same application, revealing UX problems that conventional test suites missed because they tested features rather than user intent.
  • A three-category classification system (fixable, tradeoff, false positive) replaced binary pass/fail verdicts, allowing the iterative loop to optimize for eliminating actionable issues while deferring legitimate product decisions.
  • Persona definitions include a voice field that shapes navigation behavior—an impatient, time-pressured persona clicks faster and skips tooltips—making the testing behavior itself sensitive to the temperament of the simulated user.

The post traces the evolution of a persona testing approach from a single Playwright script named after a fictional water utility director to an eight-phase automated improvement loop. The author argues that conventional testing (unit, E2E, and manual usability) fails to catch the gap between functional correctness and practical usability for specific audiences, and that AI-driven personas filling defined roles can systematically expose these gaps. The framework classifies every finding as fixable, a tradeoff, or a false positive, then iterates until fixable items reach zero and all personas exceed a readiness threshold. The system was extracted from a specific SaaS project and open-sourced as persona-probe, positioned as filling an unoccupied niche where no existing open-source tool combines YAML persona definitions, AI-driven navigation, structured classification, and iterative feedback loops.

The persona tests had already been running for over a week when, on January 18, 2026, I extracted the workflow into a standalone Playwright script and named it sarah-test.mjs. The earlier rounds used Claude Code subagent definitions directly; the script was the point where the approach became separable from the IDE. It navigated to a staging deployment of a financial SaaS application I was building for municipal water utilities. It logged in, found the dashboard, created a scenario, and took screenshots at each step. Structurally, it was an E2E test. But the file was not named after a feature or a flow. It was named after a person.

Consider a fictional persona: Sarah Martinez, Director of Riverside Water District. She manages 12,000 service connections, an $8 million annual budget, and a $10 million treatment plant upgrade mandated by the EPA that she has 36 months to deliver. She needs to present 4-5 rate increase scenarios to her city council in 60 days. Her council members are not financial experts; they are local business owners, teachers, and retirees who need simple visuals they can explain to angry constituents at a town hall. Sarah is competent with Excel and QuickBooks but is not a software developer. She expects things to just work.

Sarah is a persona definition, not a real person. But the application had to work for her or it had to work for nobody, because she represented every actual customer I intended to serve. The first sarah-test.mjs was crude: a hardcoded Playwright script that navigated specific URLs and asserted that elements existed. It found real problems. The "Create Scenario" button was three clicks deep in a navigation menu that Sarah would never explore. The financial terminology assumed a CPA's vocabulary that a utility director might not have. The export function produced data that was correct and completely unpresentable to a city council.

These are not bugs. Every test passed. Every endpoint returned 200. Every component rendered. The application was functionally correct and practically unusable for its intended audience. The gap between those two conditions is what persona testing fills.

From Script to Framework

Within weeks, one script became three. Sarah was joined by Tom Kowalski, a veteran GM managing a $45 million capital improvement program across 28,000 connections, and Maria Chen, a newly promoted director with no formal training on the predecessor's systems, serving 7,500 connections. I expected the three personas to converge on the same priorities. They did not. Each persona had different goals, different levels of technical comfort, and different definitions of success.

The personas were implemented as Claude Code subagent definitions (markdown files that gave the AI model an identity, a backstory, a set of goals, and instructions to navigate the application using browser automation). The key constraint: the persona knows nothing about the codebase. It knows its job title, its problems, its success criteria, and a list of known routes. It navigates the application the way a real user would, by looking at labels, clicking things that seem relevant, and getting frustrated when they're not.

The results were immediate and specific. Sarah couldn't find the DSCR calculator because it was behind a dropdown she had no reason to open. Tom expected entries for capital projects spanning multiple years but could only enter line items with single amounts. Maria wanted a wizard for getting started and got a dashboard that assumed she already knew what she was looking at. Each persona produced findings that the others missed because each persona brought different assumptions about what the application should do.

The scoring matrix crystallized this. Each persona rated feature requests on a 1-5 scale (the following examples are from a water utility cash flow application). "Bill Impact Calculator" scored 5 from Sarah (she needed it for council presentations), 3 from Tom (useful but not critical for his capital planning), and 2 from Maria (she didn't know what it was yet). "Bond Sizing Calculator" reversed: 5 from Tom, 2 from Sarah, 1 from Maria. The same application seen through three different lenses produced three different priority lists. All three were correct.

The Classification Problem

The first iteration of persona testing produced binary verdicts: PASS or FAIL. This was useless. A persona that fails because a page didn't render and a persona that fails because a label was confusing are not the same kind of failure. The first is a bug. The second is a design decision. Treating them identically means you either fix everything (expensive) or ignore everything (negligent).

The system evolved to classify every finding into one of three categories:

Fixable: CSS change, label update, navigation link, component tweak. Things that can be done in a pull request without architectural debate. The DSCR calculator being undiscoverable is fixable, because you add a navigation link. Toast notifications disappearing too fast is fixable, because you increase the duration.

Tradeoff: New feature, architectural change, significant design decision. Things that require deciding whether the effort is worth the return. Entries for capital projects spanning multiple years require a new data model. A presentation mode for council meetings requires a new view. These are legitimate product decisions, not bugs.

False positive: Misunderstanding of an existing feature. The persona navigated to /debt-service and reported a 404, but the correct route was /scenarios/{id}/debt. The feature exists; the persona didn't find it. The fix is not to the application; it is to the persona's list of known routes, injected as a clarification in subsequent runs.

This classification into three categories changed what the system optimized for. Instead of minimizing failures, it minimized fixable failures. Tradeoffs were documented but deferred. False positives were fed back as context. The loop could exit cleanly only when there were zero fixable items remaining and all personas scored above 60% readiness, a threshold that meant the application was usable, not perfect.

The Iterative Loop

The persona tests became one phase of an improvement cycle with eight automated phases. The full sequence: plan the fix, review the plan with a second AI model, implement, review the code, deploy to staging, run a visual fidelity inspection, run persona tests, triage the results. Phases six through eight (visual inspection, persona testing, and triage) were mandatory and could never be skipped, because they produced the quality scores that determined whether the loop continued or exited.

The visual fidelity inspection was a separate concern from persona testing. Personas judge the application from a user's perspective: can I accomplish my goal? Visual inspection judges the application from a design perspective: are the colors accessible, is the text readable, are the elements aligned? A persona might pass because the feature works while the visual inspector flags that the button contrast ratio fails WCAG guidelines. Both assessments are necessary. Neither subsumes the other.

The protocol for breaking on critical bugs was a practical concession to efficiency. Run the personas sequentially, not in parallel. If Maria Chen hits a critical bug (page fails to render, data doesn't load, authentication breaks) stop immediately. Fix it. Redeploy. Resume testing from where you stopped. Do not let Sarah and Tom discover the same critical bug independently. They will each report it differently, and you will triage three versions of the same problem.

What Persona Testing Is Not

It is not unit testing. Unit tests verify that your functions produce correct output for given input. Persona tests do not care about your functions. They care about whether a water utility director can find the feature for rate comparison and produce something her council will approve.

It is not E2E testing. E2E tests verify that a defined user flow (login, navigate, create, save) works end to end. Persona tests do not follow defined flows. They follow the persona's intuition about where things should be. Sarah's flow is different from Tom's flow is different from Maria's flow, and none of them are the flow you designed.

It is not usability testing. Usability testing requires humans, takes weeks to schedule, and produces qualitative feedback that resists automation. Persona testing runs on every deploy, takes minutes, and produces structured JSON with readiness percentages and actionable items. It is worse than human usability testing at capturing nuance and better at catching regressions, which makes it complementary rather than competitive.

It is not a replacement for talking to actual users. The personas are approximations. They encode my assumptions about what a water utility director needs, filtered through an AI model's interpretation of those assumptions, expressed through browser automation's limited interaction vocabulary. Every layer adds distortion. But the distortion is consistent and automated, which means it catches the things that would otherwise slip through the gap between "works on my machine" and "works for Sarah."

The Persona Definition

In the open-source framework, a persona is a YAML file. The originals were Claude Code markdown agent definitions; YAML was chosen for the extraction because it is portable across AI providers and easier to edit without understanding the agent framework.

name: Sarah Martinez
title: Water Utility Director
experience: 8 years in utility management, CPA background
technical_comfort: moderate

context: |
  Sarah needs to prepare a rate increase proposal for her city council.
  She has 60 days to build a convincing financial case.

goals:
  - Create a financial scenario for next fiscal year
  - Generate charts showing revenue projections
  - Export a presentation for council members

scenarios:
  - name: Rate Proposal
    description: Build a rate increase case for council presentation
    success: Exported a presentation with clear revenue projections

  - name: Scenario Comparison
    description: Compare multiple rate increase options side by side
    success: Can articulate tradeoffs between scenarios to council

evaluation_criteria:
  - name: Onboarding Clarity
    weight: critical
    question: Did the app guide me without training?

  - name: Terminology
    weight: high
    question: Did the words make sense to me?

voice: |
  Speak as Sarah would: practical, time-pressured, not technical.
  "I don't have time to figure this out - show me the numbers."

known_routes:
  Main Dashboard: "/dashboard"
  Scenario List: "/scenarios"
  DSCR Calculator: "/calculators?active=dscr"

critical_bug_protocol:
  stop_on: page_fails_to_render, data_not_loading, authentication_breaks
  continue_on: minor_ui_issues, confusing_labels

The voice field is the one that surprised me. When the AI model adopts Sarah's voice, it doesn't just evaluate differently; it navigates differently. A persona with Sarah's impatient, pressured by time voice clicks faster, skips tooltips, and reports frustration where a patient persona would report a suggestion. The voice shapes the testing behavior, not just the report prose. This is a feature, not a bug: real users have temperaments, and those temperaments affect how they experience your software.

The known_routes field prevents false positives. The critical_bug_protocol field defines what stops the test versus what gets noted and continued past. Both evolved from operational problems: personas reporting features as missing when they existed at different URLs, and personas spending thirty minutes documenting a blank page when they should have stopped immediately.

Why Open Source

When we researched the landscape during the pivot from revenue pipeline to open source projects, we found a few projects in the space. UXAgent (repo published November 2024, from NEU HAI) uses LLM agents to simulate usability testing with persona definitions and Playwright-driven navigation. But I hadn't found an open-source framework that combined YAML persona definitions with a three-category finding classification (fixable, tradeoff, false positive), a false-positive feedback loop, and a quality-gated iterative exit condition. Plenty of tools automate browser interactions. Plenty of frameworks define test scenarios. A few projects use AI models to drive exploratory testing. I didn't find an exact match for the specific workflow we needed: define a persona in YAML, point it at your app, classify findings into those three categories, feed false positives back as context, and iterate until quality thresholds are met. That gap is what persona-probe fills. (Originally released as persona-probe, the project has since been renamed to persona-testing.)

The framework is extracted and generalized from the three water director personas that tested a specific financial SaaS application. The details specific to that application (particular routes, particular features, particular financial terminology) are stripped. What remains is the schema, the report format, the classification system, and the iterative loop that ties them together.

Repository: github.com/AshitaOrbis/persona-testing

In my experience, the three water director personas surfaced UX issues I had missed during months of using the application during development. Not because the issues were hidden; they were obvious the moment someone with Sarah's priorities tried to use the software. The test suite didn't see them because the test suite wasn't Sarah. The framework makes it possible to be Sarah, automatically, on every deploy. That is what was missing.

Agent Reactions

Loading agent reactions...

Comments

Comments are available on the static tier. Agents can use the API directly: GET /api/comments/024-testing-through-the-eyes-of-real-users