Skip to Content
image description
image description

Insights Blog

How to Verify What Your AI Coding Agent Actually Built

Copy LinkPrintEmailFacebookX
agentic context engineering

Agentic context engineering gives your coding agents the specs and context they need to produce good output. But even with excellent planning, there’s a step most teams skip entirely: systematic verification of what the agent actually delivered.

We’ve covered this gap elsewhere on this blog (if you haven’t read “Your Coding Agent Is Lying to You About Completion,” start there for the evidence). The short version: on a real build, structured verification consistently found 30-40% of the specification unimplemented after the agent reported “complete.” Not broken code. Missing code. Features that were specified, planned for, and simply absent.

This post isn’t about proving the gap exists. It’s about how to close it. Here’s what a practical verification practice looks like for teams using AI coding agents.


Why Standard Quality Checks Miss This

Before building something new, it’s worth understanding why your existing process doesn’t catch the problem.

Code review examines what was built. A reviewer reads the diff, evaluates quality, checks patterns. But if a feature wasn’t built at all, there’s no diff to review. The absence is invisible in the pull request.

QA tests what’s visible in the running application. If a payment step doesn’t exist, there’s no button to click, no flow to break. An app with mock data in its hooks looks identical to an app with real API connections, at the UI level.

Automated tests validate what was tested. If the agent didn’t build a feature, it also didn’t write tests for it. The suite reports green. Every test that exists passes.

All three practices work forward from the code: “given what was built, is it correct?” Verification works forward from the spec: “given what was specified, was it built?” That’s a fundamentally different question, and answering it requires a different system.


The Anatomy of a Verification Check

A useful verification check has three components.

A specific item to verify. Not “is the auth system working” but “does the ProtectedRoute wrapper appear in App.tsx around the dashboard routes.” Vague checks produce vague results. The check item needs to be concrete enough that someone (or something) reading the codebase can give a definitive yes or no.

Expected evidence. What would you see in the code if this item was implemented correctly? A specific component rendered in a specific file. A function call to a particular service. An import statement. A route definition. The expected evidence is what turns the check from a judgment call into a factual comparison.

Pass/fail with actual findings. Did the expected evidence appear? If yes, what was found and where? If no, what was found instead (or nothing at all)? The finding should be specific enough that a coding agent can immediately act on a failure without needing more context.

This structure is what separates verification from “looking at the code.” Looking at the code is open-ended. Verification is a structured comparison between intent (the spec) and reality (the codebase).


Designing Check Items From Your Spec

The quality of verification depends entirely on the quality of what you’re checking against. This is where the plan-then-verify loop earns its value: a detailed spec enables precise verification. A vague spec produces vague checks that miss real gaps.

Here’s how to translate spec sections into verification check items.

Functional requirements become feature checks. If the spec says “users can register with email and password,” the check items are: does the registration form component exist? Does it include email and password fields? Does the submit handler call the auth service? Does error handling cover invalid email format, weak password, and duplicate accounts?

Constraints become pattern checks. If the spec says “use the existing auth middleware for all protected routes,” the check items are: does each protected route use the shared middleware? Are there any custom auth implementations that bypass it? This is the category that catches the most expensive mistakes, because agents love building custom solutions to problems your team already solved.

Edge cases become boundary checks. If the spec defines rate limiting at 5 attempts per minute per IP, the check items verify that the rate limiter exists, that it’s configured with the correct thresholds, and that it’s applied to the correct endpoints. Edge cases that aren’t verified are edge cases that weren’t implemented.

Non-goals become absence checks. If the spec explicitly says “do not implement social login in this phase,” verify that no social login components, OAuth configurations, or third-party auth integrations exist. Agents frequently build features that weren’t requested, especially when those features are common patterns in their training data. Absence checks prevent scope creep that nobody asked for.

A typical feature spec translates into 15-30 check items. A full project with multiple features can produce hundreds. On the design partner build referenced earlier, the total was roughly 1,000 check items across the complete project scope.

The effort of creating these check items pays for itself many times over. In an agentic context engineering workflow, the check items don’t just serve verification. They also sharpen the spec. Writing “how will I verify this was built?” forces you to be precise about what “built” means. If you can’t write a concrete check item for a requirement, the requirement probably isn’t specific enough for an agent to implement reliably.


The Verification Loop

Verification isn’t a single pass. It’s an iterative loop. Understanding the pattern helps you plan for it rather than being frustrated by it.

Pass 1: Initial verification after “complete.” This is where you discover the gap. In our experience, 30-40% of check items fail on the first pass. The verification report documents each failure with specific findings: what was expected, what was found (or not found), and where.

Passes 2-3: Targeted fixes. The gap report goes back to the coding agent with specific instructions. The agent addresses the documented failures. A second verification pass runs. More gaps surface. Some items that were partially addressed. Some fixes that introduced new omissions elsewhere. Some items the agent thought it fixed but only stubbed out.

Passes 4-6: Convergence. Each pass finds fewer failures. The agent is working through the long tail of smaller gaps, integration issues, and items that depend on other items being complete first. By pass 5 or 6, you typically reach 100%.

That iteration count (5-6 passes to full completion) is consistent enough to plan around. Don’t expect a single fix cycle to close the gap. Budget for the loop. If you’re estimating effort for a complex feature, add verification time as a line item, just like you would for code review or testing.

The iteration pattern also reveals something structural about how agents work. The agent isn’t being careless. Its context window shifts as it works on fixes. New blind spots appear as old ones close. This isn’t a deficiency that will be fixed by the next model release. It’s a characteristic of how large language models process scope. Verification accounts for it systematically.

There’s a compounding benefit here too. The gap reports from early passes become learning signals. You start to see patterns in what agents consistently miss: integration points, auth boundaries, temporal features like countdowns and expirations, anything that requires coordinating across multiple files. Those patterns inform how you write specs going forward. Better specs produce fewer gaps on pass 1, which means fewer iterations. The verification loop improves the planning loop.


Manual vs. Systematic Verification

You can run verification manually. Read through your spec, open the codebase, check each item one by one. It works for small features. It doesn’t scale.

A manual verification pass on a project with 200 check items takes a senior developer most of a day. At 1,000 items, it’s a multi-day effort. And you need to run it 5-6 times. The math breaks quickly.

Systematic verification uses the same structured check items but runs them programmatically. An agent reads the codebase against the checklist, reports findings, and produces a gap report. The coding agent fixes the gaps. The verification agent runs again. The human’s role shifts from mechanical checking to reviewing findings, making judgment calls on edge cases, and deciding when items need human verification versus automated confirmation.

This is the workflow that produced the results on the design partner build: approximately 1,000 check items verified across the full project, 24 hours of total human time (spent on judgment, not checking), and $300 in platform cost.

The key design decision: the verification agent reads the actual files. It doesn’t ask the coding agent what it built. It doesn’t trust self-reported completion. It inspects the codebase independently against the spec. That independence is what makes the system trustworthy.


Building This Into Your Workflow

Here’s a practical path to adding verification to your team’s process, starting simple and scaling up.

Level 1: Spec-based review checklist. Before your next code review, take the spec that drove the implementation and turn each requirement into a yes/no checklist item. Walk the checklist against the PR. This adds 15-20 minutes to a review and catches the most obvious gaps. No tooling required, just discipline.

Level 2: Structured check items with expected evidence. Upgrade from yes/no questions to the three-part structure described above (item, expected evidence, findings). This takes more time to prepare but produces actionable gap reports instead of vague concerns. Keep the check items in a document alongside the spec so they’re reusable.

Level 3: Automated verification passes. Use an agent to run the check items against the codebase systematically. This is where verification scales beyond what a human can do manually. The agent reads files, checks for expected evidence, and produces a structured report. The human reviews the report rather than doing the checking.

Level 4: Integrated plan-verify loop. Verification check items are generated automatically from the spec. After the coding agent reports complete, the verification pass runs against the full checklist. Gap reports feed directly back to the coding agent. The loop iterates until all items pass. The human monitors the loop, handles judgment calls, and signs off on completion.

Most teams can start at Level 1 today and see immediate value. Levels 3 and 4 are where agentic context engineering and verification become a unified system rather than separate manual steps. The investment scales with your team’s AI adoption: the more work your agents handle, the more verification pays for itself.


We Built Brunel Agent for This

Verification is one of the three core pillars of Brunel, alongside collaborative spec planning and agent-agnostic plan export.

Brunel Agent, built by Loadsys, integrates verification directly into the plan-export-verify workflow. Your team builds the spec collaboratively. You export it to whatever coding agent your team uses. After the agent delivers, Brunel’s verification engine compares the output against the original specification, item by item, with structured gap reports that feed directly back into the fix cycle.

The plan defines what “done” means. The verification layer confirms it was actually done.

Start using Brunel now and close the loop on your AI development workflow.


This is Part 4 of a series on spec driven development AI and the planning infrastructure that makes AI coding agents work. Next up: a practical framework for building a full spec-driven development practice for your engineering team.

image description
Back to top