User Testing with Agents — Ryan Bollenbach

I figured the answer was an easy no.A week of actually trying it left me less certain, and a bit more impressed than I'd planned to be.

The whole idea sounds like a shortcut, and shortcuts in research usually mean you've quietly stopped paying attention to real people. I was tasked with some feature refinements on a mobile app, and it was requested that I try some user testing with Gemini and Claude.

What surprised me

I had the agent run through some tasks and analyze a list of results, and let it make judgement calls based on its own preferences (ad hoc, in this case). Then I ran it again as different people:

Someone in a hurry
Someone who will take some time if it saves them money
The expert shopper who will deeply review all options

Same flow, different heads. The interesting part was never a single run; it was where each one got stuck. It was clear to me that the agent had a different perspective than mine, a designer who'd been staring at this screen for the past few hours.

A few of the things it flagged stuck with me:

The flow asks for notification permission on the second screen, before giving any reason to say yes. The "skeptical" run declined, and then couldn't find anywhere to turn notifications back on. I'd never have thought to tell a real participant to refuse that prompt.
Our primary button said "Continue," but the agent expected it to save its progress, not move past it. Tiny wording, real ambiguity.
One step assumed you already knew what a term in our own product meant. The "second-language" run just stopped there.

None of it was genius. It was the boring, fixable stuff, and that's sort of the point. I'd much rather learn a button is confusing from a free thirty-second run than from a session I paid for.

The types of insights AI agent testing can provide

A week in, the useful stuff sorted itself into a few buckets:

Cold first impressions. The agent has no history with the screen, so it reacts the way a brand-new user would. That's exactly the perspective I'd lost a few hours into designing it.
Divergence between personas. Running the same flow as the hurried shopper, the bargain-hunter, and the careful reviewer showed where the design quietly assumes one kind of person. Each one snagged somewhere different, and those gaps were the most telling part.
Copy and labelling ambiguity. Words that mean one thing to us and something else to someone reading them fresh. The "Continue" button was the obvious case.
Dead ends and unstated assumptions. Steps you can get stuck in with no clear way back, or that expect knowledge the user doesn't have yet.
A rough sense of priority. Because it's cheap to run again and again, you start to see which problems show up across every persona versus which are one-offs. The ones everyone hits are usually worth fixing first.

Running both Gemini and Claude helped here too, since they didn't always get stuck in the same place, and the disagreement was its own kind of signal.

What it can't give you is the part that actually matters most: how someone felt, where they hesitated, why they'd quietly give up and never come back. That still takes a real person. The agent points at things. It doesn't tell you how they landed.

Run your test with these steps

Pick one real flow and have the agent move through it cold, no briefing, narrating what it's doing and why.
Run it again as a few distinct personas. The differences between them are where the interesting stuff lives.
Treat everything it flags as a defect list, not findings. It's pointing at things to check, not handing you the answer.
Fix the cheap, obvious problems before a real person ever sees the screen.
Then run proper testing on the cleaned-up version, where actual people can tell you the things an agent can't.

So, should an agent take the first pass? For me, yes. Not because it replaces anyone, but because it clears the obvious junk out of the way so the real testing can be about people instead of typos. It's a first pass, not a verdict, and treating it as more than that is how you'd get burned. As a cheap way to catch the boring stuff early, though, I'm sold.

Curious to hear what you think.