Your Coding Agent Is Lying to You About Completion. Here’s the Proof.
Your coding agent is lying to you about completion. Not maliciously. Not even technically incorrectly, in its own context window, the work does look done. But when a structured verification agent reads the actual files against a detailed specification, the story changes.
On a recent application build, every time the coding agent reported a phase complete, the verification agent found 30–40% of the work was not actually done. Not broken. Not wrong. Simply absent. And the coding agent had no idea.
This happened across nearly 1,000 verification check items. It took 5–6 verification-and-fix iterations to reach 100%. The total human time on the entire engagement, planning through final verification, was 24 hours.
Here’s what that means for teams running AI coding agents today.
The Completion Illusion
There’s a specific failure mode that nobody in the AI development tooling conversation is talking about honestly.
Coding agents are very good at generating code. They’re much less reliable at knowing when they’re done. The agent’s context window has a horizon — it knows what it built in this session, in this conversation, against the prompt it was given. It doesn’t have a persistent, structured picture of everything the specification required.
So it reports complete. Confidently, with good reason from its own perspective.
And 60–70% of the spec is implemented.
This isn’t a corner case. In this build, across multiple verification passes covering nearly 1,000 check items — data models, API integrations, UI components, payment flows, route guards, real-time subscriptions, test files — the pattern held consistently. Every “complete” declaration from the coding agent was followed by a verification pass that found roughly a third of the work still missing.
To be clear: the code that was written was good. The agent built what it said it built. The problem is everything it didn’t mention, the features specified in the plan that simply weren’t there yet.
What the Verification Layer Actually Looks Like
This build used a structured verification system with close to 1,000 check items across multiple phases of the project — organisms, pages, data hooks, API integrations, route guards, payment flows, authentication patterns, test coverage, real-time subscriptions, accessibility.
Each check item had:
- A specific thing to verify (not “does auth work” but “does the ProtectedRoute wrapper appear at line X of App.tsx”)
- Expected evidence (the exact component, prop, or function call that would confirm implementation)
- Pass/fail status with the actual evidence found (or noted as absent)
When the coding agent declared a phase complete, the verification agent ran through the full checklist against the live codebase. It didn’t ask the coding agent what it had built. It read the files.
The results were consistent across every phase: the coding agent had implemented roughly 30–40% of what the specification required. The verification report was handed back. The coding agent fixed the gaps. Another verification pass. More gaps. This cycled 5–6 times before a full pass.
What did those gaps look like in practice?
A complete registration wizard with four steps — except Step 4 (payment: Stripe + offline selection) was missing entirely. The UI flowed smoothly to a blank screen.
Five data hooks written and exported correctly — but still calling setTimeout with mock data instead of the real AppSync GraphQL client. The app looked functional in every environment. It wasn’t connected to anything.
A waitlist feature fully specified in the planning documents — with status display, position tracking, countdown timer, claim window — not present at all. Not broken. Just absent.
Route guards protecting dashboard pages — present on most routes, missing on three. You could navigate directly to admin pages without authentication.
None of these were detectable by looking at the app. They required checking the files against the spec.
The Planning Layer: What You’re Verifying Against
For verification to work, you need something to verify against. That’s the other half of this story.
Before a single line of code was written on this build, the project went through five phases of structured AI planning: scope, requirements, architecture, data design, API design, frontend patterns, infrastructure, CI/CD, testing strategy, and roadmap. Eleven documents, cross-referenced and internally consistent.
Then a structured review pass — three parallel agents covering scope, architecture, and roadmap simultaneously — flagged 77 findings. Eleven were critical.
The wrong database technology was documented (PostgreSQL vs DynamoDB). The wrong API paradigm was specified in scope (REST vs GraphQL, contradicting the architecture document). A Step Functions workflow type was chosen that doesn’t support the callback pattern the architecture required. COPPA compliance — mandatory for a platform serving minors — was entirely absent from the specification.
These are the findings that, caught during build, cost $15,000–$40,000 each. Caught in planning, they cost an update to a document.
The eleven critical findings and twenty-two major findings were resolved before implementation began. The resulting planning suite became the specification the verification agent ran against across every subsequent phase.
That’s the loop: plans precise enough to verify against, verification rigorous enough to catch what the coding agent missed, iteration fast enough to close the gap before it becomes technical debt.
The Numbers
Let’s look at what this actually cost — and what it would have cost without it.
Total investment:
- Brunel platform: ~$300
- Human oversight across the full engagement: 24 hours (8–10 hours on planning, the remainder on coding agent oversight and verification review)
- At $150/hour blended rate: ~$3,600 in human time
- Total: ~$3,900
What the planning phase caught (conservative estimates on avoided downstream cost):
| Planning Finding | Cost if Found During Build |
|---|---|
| Wrong database technology | $12K–$18K |
| Wrong API paradigm | $20K–$40K |
| Step Functions constraint violation | $8K–$15K |
| COPPA compliance undefined | $20K–$100K+ |
| SLA contradictions | $5K–$15K |
| DR validation absent | $20K–$50K |
What the verification layer caught (conservative estimates on avoided production cost):
| Verification Finding | Cost if Shipped to Production |
|---|---|
| 5 data hooks returning mock data | $18K–$36K emergency debugging + rework |
| Payment flow missing entirely | $30K–$80K incident + compliance review |
| Auth guard gaps | $15K–$30K security incident response |
| Core features absent (waitlist, registration mutations) | $20K–$40K sprint + release delay |
Conservative avoided cost across planning and verification: $128K–$394K.
Return on $3,900 total investment: 33x to 100x.
The 24 Hours
This is the part that usually prompts disbelief: 24 hours of human time for a 5-phase, 11-document planning suite, a full architecture review, and nearly 1,000 check items of implementation verification across multiple sprint phases.
The human wasn’t writing the plans or running the checks. They were directing, reviewing findings, making decisions, and providing the judgment that the agents couldn’t. The agents were doing the systematic work — generating documents, running parallel review passes, reading codebases, producing verification reports, iterating on fixes.
What a senior engineer’s time bought in this engagement:
- Architectural judgment on the 11 critical planning findings
- Business context for the COPPA and compliance gaps
- Decision-making on the 3 deferred major findings (offline mode, data import, AI algorithm spec)
- Oversight of 5–6 verification iterations to confirm the gaps were actually closed
That’s 24 hours of high-leverage human judgment, not 24 hours of mechanical checking.
The Question for Every Team Running Coding Agents
When your coding agent declares a phase complete, how do you know 30–40% of the spec isn’t missing?
Most teams don’t have a systematic answer to this question. They have code review — which catches what was built badly, not what wasn’t built at all. They have QA — which catches failures in flows that were implemented, not absences of flows that should have been. They have experienced developers who intuitively notice gaps — but that scales with headcount, not with the number of agents you’re running.
The verification gap is the gap between what the coding agent thinks it built and what the specification required. Closing it needs a system, not a person reading code line by line.
That’s what the planning layer and verification layer together provide: the specification that makes verification possible, and the systematic process that makes it happen at every phase.
The constraint on AI development productivity isn’t the coding agent. It’s the loop around it.
Brunel Agent is an AI development planning platform. Plan → Export → Execute → Verify. If you’re ready to close the loop on your AI development workflow, get started now →