The Verification Loop: How to Stop Shipping Blind with AI Coding Agents

A few months ago I was selling a watch on a second-hand marketplace. The site had a new AI feature: upload a photo and it writes the product description for you. I used it. My reading level in Dutch is basic, so I skimmed what it wrote and posted the listing.

When the buyer showed up, he asked, "Where's the case?" I didn't have a case. He said, "But you wrote it comes with the original case." The AI invented that detail. The buyer walked away. I felt embarrassed. But I was the one who posted it without checking.

That's the exact problem you face with AI coding agents. The agent generates the code. You ship it. If it's wrong, that's on you.

The difference between teams that use AI coding agents well and teams that don't isn't prompting skill or model choice. It's AI coding agent verification. Building a feedback loop so the agent can check its own work before it reaches you.

Why You Can't Trust a Single Output

LLMs are nondeterministic. Ask the same prompt twice and you'll get different code. The model samples from a probability distribution, not a lookup table. Two runs walk different paths through the same problem.

This means a prompt being "good" doesn't guarantee the output is correct. It means you'll probably get something reasonable, not something you can trust without looking. The gap between those two things is where bugs live.

An agent without verification is writing code with its eyes closed. It can be smart, it can be fast, but it has no way to know if what it produced actually works until a human looks at it. And humans are slow, expensive reviewers who get tired after the fifth PR.

An agent with verification works differently. It writes the code, runs the tests, reads the error messages, adjusts, and runs the tests again. It keeps going until they pass. That's a self-correcting loop, not a one-shot guess.

Boris Cherny, creator of Claude Code, made this point directly: "Probably the most important thing to get great results out of Claude Code: give Claude a way to verify its work. If Claude has that feedback loop, it will 2-3x the quality of the final result."

Three times the quality. From adding verification.

Three Levels of Verification

The right verification setup depends on what you're building. Not every change needs a full browser automation suite.

Simple changes: a bash command. A script that runs the build, checks types, or runs a linter. Cheap and fast. Catches the majority of regressions for most everyday tasks. If the agent breaks the build, it knows immediately.

Moderate changes: a test suite. Write the tests first (or review the agent's tests before it starts implementing), then have the agent iterate until they all pass. This is where AI coding agent verification starts to pay off. The agent can run the suite, read the failures, and fix them without you in the loop.

Complex changes: browser validation with Playwright. The agent opens the actual app, takes screenshots, checks the UI against what was specified, and fixes discrepancies. The Anthropic team does this for all frontend changes to their own product. Claude opens a browser, tests the change, and iterates until the experience looks right.

That third option used to require a lot of setup. Playwright MCP, maintained by Microsoft, makes it straightforward. Add it to your project's .mcp.json:

{
  "mcpServers": {
    "playwright": {
      "command": "npx",
      "args": ["@playwright/mcp@latest"]
    }
  }
}

Once it's wired up, the agent can click buttons, fill forms, take screenshots, and read the DOM. It doesn't just write code that will eventually interact with a browser. It uses the browser right now, in the loop. You'll probably need to say "use playwright mcp" explicitly the first time in a session. After that, it knows.

TDD Is a Superpower Now

Kent Beck, who created Extreme Programming and essentially invented Test-Driven Development, has been direct about what AI agents do to the value of tests. He called TDD a "superpower" when working with AI agents.

The reasoning is simple. AI agents introduce regressions. They change something over here and break something over there without realizing it. A solid test suite catches that immediately. Without tests, those regressions ship.

But Beck also surfaced a warning that I think about every time I review agent-generated code.

AI agents will sometimes try to delete or rewrite tests to make them pass. The agent's goal is green tests. If it can get there by fixing the code, great. If it can get there faster by changing the test, it might try that. Beck described having real trouble preventing agents from gaming the system this way.

If that happens, you haven't verified anything. You've laundered errors.

This is the practical workflow that prevents it:

Write the tests first, or review the agent's tests carefully before it starts implementing.
Commit the tests separately. They're now in version control. The spec is locked.
Have the agent implement until the tests pass. Let it iterate.
Review the diff. Check that the agent didn't touch the test files during implementation.
Run the full suite. Not just the new tests. Everything. Regressions hide in the edges.

Step 4 is the one most people skip. Don't.

Writing a test suite is an act of specification. You're telling the agent exactly what "correct" means in a form it can execute against. The clearer you define correct, the better the agent performs. The test isn't a check at the end. It's the definition you set at the start.

How Corrections Compound into CLAUDE.md

Verification doesn't just catch errors. It teaches.

Here's the pattern. You review a PR the agent produced. You spot a mistake: it used the old API pattern instead of the new one, or it mocked too aggressively when your team always uses the real database connection. In the old workflow, you fix the code and move on. Same mistake happens two weeks later.

In the new workflow, every correction becomes a system improvement.

After each correction, tell the agent: "Update your CLAUDE.md so you don't make that mistake again." The agent is surprisingly good at writing rules for itself. The mistake gets encoded into your instruction file, and every future session starts with that lesson already in place.

If you're using Codex CLI, it's AGENTS.md. Same idea. Whatever tool you use, the principle holds: maintain a shared, version-controlled instruction file that captures your team's standards.

My team's CLAUDE.md is around 2,500 tokens. Deliberately lean. Every token in that file is a token that can't be spent on the actual task, so short and precise beats long and thorough. It covers coding conventions, design patterns, PR template requirements, and common mistakes to avoid.

The compounding happens fast. After two weeks of adding one correction per day, the agent's default output looks noticeably different. It's not following rules it was told once in a prompt. It's following rules that are checked into the repository, visible to the whole team, and applied at the start of every session.

New team members benefit from it immediately. Every lesson the team has learned is already in the file.

The cycle looks like this:

Agent produces output.
You review and identify a problem.
You fix the immediate issue.
You update the instruction file.
Every future session inherits the improvement.

You're not doing the same corrections over and over. The system gets better, your attention frees up, and you move to harder problems. That's the compounding.

Where Plan Mode Fits

Verification catches errors after the fact. Plan Mode prevents them at the start.

Before any non-trivial task, I spend time in Plan Mode (Shift+Tab twice in Claude Code). The agent analyzes the codebase and writes out what it intends to do, without touching any files. I read the plan and push back. "Don't touch the auth module. Only change the API layer." "This is overengineered, simplify it." "What about the edge case where the user has no permissions?"

We go back and forth until the plan is right. Then I switch to auto-accept mode, and the agent executes.

When the plan is right, the implementation almost writes itself. I've had sessions where the agent one-shotted a feature with zero changes needed. I've also had sessions where I rushed the plan and spent twice as long fixing the output. Planning time is the highest-return time in the whole workflow.

Verification and Plan Mode work together. Plan Mode keeps the agent from doing things you didn't want. Verification catches the things that slipped through anyway. Neither one replaces the other.

The full pipeline for a non-trivial task:

PRD / Spec
    |
Plan Mode (iterate until the plan is right)
    |
Auto-accept edits (agent executes)
    |
Verification (bash, test suite, or Playwright)
    |
Code review (human in the loop)
    |
Instruction file update (capture the learnings)

Every step feeds the next. Skipping verification means any mistakes reach you, unfiltered.

The Takeaway

The watch listing was a useful lesson. The AI generated the copy. I posted it. The error was mine.

The same ownership applies to every line of code an AI agent writes. The agent generates it, but you ship it. If it's wrong, that's on you.

Verification doesn't fix that. What it does is give you a system that catches problems before they reach you, so your review is focused on what the agent genuinely can't evaluate: architecture, product correctness, edge cases that require context about your users.

Start small. Add a bash check that confirms your build passes after every agent run. Then add a test suite for your most important feature area. Then wire up Playwright for your critical user flows. Each step extends the loop. Each step is one less class of bug that reaches production.

The agents that work are the ones with a way to check themselves. The engineers who work well with agents are the ones who set that check up.

I cover the full verification workflow and the broader skill of working with AI coding agents in my book, How to Be a Great Software Engineer in the Age of AI.

Recommended Resources

Boris Cherny on verification in Claude Code, the "2-3x quality" observation that started a lot of this thinking
Playwright MCP by Microsoft, browser automation for agent verification, compatible with most tools that support MCP
Kent Beck on TDD and AI agents, the "superpower" framing and the warning about agents rewriting tests
Addy Osmani, "The 70% Problem", on why the last 30% is where judgment matters