Insights Blog

Why AI-Generated Code Is Only 70% Done — And What That Means for Your Rebuild

by Donatas Kairys June 9, 2026

Artificial IntelligenceAI CodingCoding AgentsSpec Driven Development

loadsys article 02 ai generated code 70 percent og 2x

AI coding agents get a team about 70% of the way to a production application. That number is well-documented for AI-only app builders like Lovable, Bolt, and v0, and it shows up again, with a different cause, inside professional engineering organizations using Cursor and Claude Code to rebuild internal applications. Month 1 looks like 3–5× velocity. In Month 2, the velocity is gone, the code quality has dropped, and the team can’t ship the last 30%. The cause isn’t the tools. Cursor, Claude Code, and GitHub Copilot produce production-grade code when run correctly. The cause is the absence of a written specification that the agent executes against. The 70% problem is a workflow problem. The fix is spec-driven development under the supervision of a senior engineer.

The conversation we keep having with CTOs in 2026 goes like this. The internal application is on Rails 4 or .NET Framework 4.x. The senior engineer who maintains it is either looking or already gone. The CFO has finally approved the budget for a rebuild. The CTO has heard pitches from traditional dev shops at $300K and 9 to 12 months. They are now sitting with a different option, one their head of engineering proposed last week:

“We have Cursor. We have Claude Code. Give us six months, and we’ll rebuild it ourselves.”

It’s a reasonable proposal on paper. The team knows the business. The tooling really is better than it was eighteen months ago. Published benchmarks really do show 3–5x gains in velocity for AI-assisted work. The kitchen-table math looks like a clear win against $300K of consulting.

The math at the kitchen table is wrong. Not because the tools don’t work. The tools work. The math is wrong because the AI-only-builder data the CTO has read about, the famous 70% problem, also applies inside their own org, and most teams don’t know it does until month two.

The 70% number is more universal than the discourse around it

The phrase “the 70% problem” started appearing on agency blogs around early 2026, almost always in the context of AI app builders like Lovable, Bolt, and v0. The shape of the claim is consistent across sources: these tools get a non-technical builder about 70% of the way to a production application in an afternoon, and the remaining 30% is where the engineering actually lives. That 30% is auth edge cases, payment integration, business-logic exceptions, security review, and integrations with anything outside the happy path. Veracode’s 2025 GenAI Code Security Report tested over 100 large language models across 80 coding tasks and found 45% of AI-generated code contains security vulnerabilities. Java was the worst at 72%. Independent PR analysis from CodeRabbit found that AI-assisted code introduces 1.7× as many bugs and has 3× worse readability when not run through a structured workflow. The pattern is well documented in the prototyping discourse.

What’s documented less often and matters more for the modernization conversation is that the same 70% ceiling shows up within professional engineering organizations using Cursor and Claude Code during a real rebuild. Anthropic’s 2026 Agentic Coding Trends report puts AI involvement at roughly 60% of developer work, but found that only 0–20% of engineering tasks can be fully delegated to an agent. The remaining 80–100% still need a developer in the loop. Longitudinal studies of Cursor adoption show the same organizational curve across organizations: a 3–5× velocity gain in month one, gone by month two, with month three onwards looking like a regression. Most of the post-mortem writing about this curve blames the architecture of the tools: context windows, memory walls, and the agent’s ability to understand large codebases. Those are real problems. They are not the wall most teams actually hit.

The wall is the completion illusion

Month one of a Cursor or Claude Code rollout feels like the new baseline. It isn’t. It’s a sugar high.

Month one ships features the team had already half-designed in their heads. The senior engineers know what the function should do because they’ve been quietly designing it in their idle moments for months. The agent fills in the obvious code, the typing speeds up by 3–5×, and the team mistakes typing speed for system speed. The velocity is real, but it’s borrowed against the engineering thinking that the seniors had already done before the tool arrived.

Month two starts with work nobody pre-designed. A new feature request lands. An edge case appears in production. A migration needs to ship. The agent has no scaffolding to fill in. There’s no half-formed plan in the seniors’ heads to draw against, and the team has to actually decide what the system should do. The agent outputs code that looks the same as month one’s. The PR opens. The reviewer approves. The artifact reports are done.

The system isn’t done.

This is what we call the completion illusion. The pattern is consistent enough that we name it: an AI coding agent reports a task as complete when only a fraction of the actual work is built. The auth flow handles the happy path; it falls apart on session expiry. The payment integration processes a successful charge; refunds, partial captures, and webhook retries are silently broken. The database migration runs forward; the rollback path was never written. The code compiles, the tests pass, the artifact ships. Two weeks later, the production incident lands, and the team discovers what was missing.

Benchmark studies of agentic coding support this. Even leading agents fully complete only a minority of assigned tasks while reporting success on the majority. METR’s controlled studies in 2025 and 2026 found that experienced developers using AI tools actually took 19% longer on real tasks, not faster, because reviewing AI output thoroughly enough to catch the completion illusion takes more time than writing the code by hand. The teams that don’t catch it ship the illusion to production. The teams that do catch it spend their velocity gain on review.

What’s missing isn’t the tool. It’s the contract.

The teams that ship the last 30% are not running better tools. They’re running the same Cursor, the same Claude Code, the same GitHub Copilot. They’re running them against something most teams don’t have: a written specification the agent executes against.

A spec, in our usage, is a contract between the team and the AI agent. It describes the system that needs to exist: scope, user stories, edge cases, architecture decisions, risks, definition of done, in enough detail that the agent’s output can be verified against it commit by commit. The spec is the source of truth. The agent is the typist. The senior engineer is the architect and the verifier. When the agent reports a task complete, the verification step compares the artifact against the spec, not against the agent’s self-report. The completion illusion stops being load-bearing because the artifact’s claim of being done is no longer the test.

This isn’t theory. The MIT Sloan / Microsoft Research / GitHub joint study on structured AI development workflows found that teams using spec-driven AI workflows saw a 56% reduction in programming time on real engineering tasks. Compare that to the 41% increase in PR bugs from unstructured use of AI tools, measured over the same time window. Same tools. Different workflow. The variable is whether the team is running Plan → Execute → Verify, or running vibe coding: type a prompt, review the diff, re-prompt when it’s wrong, repeat. Vibe coding scales to small tasks and falls apart on rebuilds. Spec-driven development scales to rebuild because the spec absorbs the complexity the agent can’t.

This is the workflow we ship at LoadSys. Brunel, our AI development planning platform, exists to make spec-driven execution the path of least resistance. The spec lives in Brunel. The agent runs Cursor or Claude Code against it. The senior engineer verifies each commit against the written contract. The completion illusion is named, addressed, and engineered out of the workflow.

What it costs to skip the spec

The CTO at the kitchen table is running the cursor, and the math compares the cost of the consulting engagement to the cost of internal payroll over six months. That comparison is incomplete in three ways.

It misses the month-two slowdown. The 3–5× velocity that justified the kitchen-table math holds for roughly four to six weeks. After that, the team is running at something closer to baseline, sometimes below it, because review overhead has gone up and senior engineers are now spending half their time catching completion-illusion misses in code the agent claimed was done. The internal rebuild budget assumed sustained velocity. It will not be sustained.

It misses the security delta. AI-generated code at the 45% vulnerability rate Veracode measured is fine only if every PR runs through a senior engineer’s review with explicit security focus. That review is the work the spec was supposed to do, deferred. For internal applications that handle PII, financial data, or operational records, the deferred security work compounds into debt that takes another quarter to retire after the rebuild “ships.”

It misses the completion-illusion tax. The features that ship in month four look done. Some of them are. The ones that don’t surface in production over the following two quarters, one incident at a time. The CFO sees the rebuild complete in the budget; the operations team sees three quarters of degraded reliability on edge cases nobody specified. Both views are correct.

A real internal rebuild is not impossible to do in-house with AI tools. It is impossible to predict without a methodology that compensates for what the tools won’t catch.

Three questions before the kitchen-table decision

If you’re a CTO comparing the work we have with Cursor path against an outside engagement this quarter, three honest questions decide whether the in-house option is viable.

Does your team have a written specification before the agent starts typing? Not a Notion page with bullet points. A spec: scope, user stories with explicit edge cases, architecture decisions, definition of done, the risks, and how they’re mitigated. If no, the rebuild is running on the agent’s self-report, and the 70% ceiling will surface around month two.

Are senior engineers verifying against the spec, or reviewing diffs? Diff review catches typos. Spec verification catches missing requirements. Most teams default to diff review because it’s faster, and pay for it in production. If the answer is a diff review, the completion illusion is your default failure mode.

Have you priced the month-two slowdown into the timeline? If you’re modeling against the published 3–5× velocity number across six months, your timeline is going to slip by roughly the amount of velocity that disappears in month two. Honest budgets price the curve.

Three yeses, the in-house option is viable. Any no, the kitchen-table math is missing the cost of the completion illusion.

How LoadSys handles the 70% problem

We run Spec-Driven Development on every engagement. The spec is the contract. AI coding agents run as the execution layer under senior engineer supervision. A two-week Discovery Sprint produces the written specification. An 8- to 12-week implementation phase ships against it under principal review. The launch handoff includes the spec, code, documentation, and runbooks, all owned by the client’s team. The 25% fee discount if we miss 90 days is the structural commitment that we ship the last 30%, not just the easy first 70%.

The 70% number is not a limitation of the tools. It is what happens when teams skip the methodology that ships the rest.

Frequently asked questions

Can AI tools like Cursor and Claude Code produce production-grade code?
Yes, with the right workflow. Cursor, Claude Code, and GitHub Copilot all produce production-grade code when run against a written specification by senior engineers who verify output commit by commit. Without that workflow, AI-generated code introduces vulnerabilities at roughly a 45% rate, per Veracode’s 2025/2026 research, and PR bug rates climb by 1.7×. The variable is the methodology, not the model.

Why does AI-generated code top out at 70% done?
The first 70% of any application is the code the team had already half-designed in their heads. The agent quickly fills in the obvious implementation. The remaining 30% is the work nobody pre-designed: edge cases, business-logic exceptions, integration failures, and security hardening. Without a written spec explicitly describing that 30%, the agent reports tasks as complete when the underlying system isn’t. This is the completion illusion.

What is spec-driven development?
Spec-driven development is a methodology in which the team writes a detailed specification before the AI agent begins generating code. The spec describes scope, user stories, edge cases, architecture, and definition of done. The agent then executes against the spec rather than against a prompt; senior engineers verify each commit against the written contract rather than reviewing diffs in isolation. Research from MIT Sloan / Microsoft Research / GitHub published in 2025 reported a 56% reduction in programming time with structured, spec-driven workflows compared to unstructured AI tool use.

Can our team rebuild our internal application in-house with Cursor instead of hiring a consultancy?
Maybe. The honest test is three questions: do you have a written spec before the agent starts; are senior engineers verifying against the spec rather than reviewing diffs; and have you priced the month-two velocity drop into the timeline? If yes to all three, the in-house path is viable. If no to any of the three, the kitchen-table math is missing the cost of the completion illusion, the security review the spec was supposed to do, and the slowdown that hits when the month-one sugar high ends.

How is LoadSys’s Discovery Sprint different from a typical scoping engagement?
The Discovery Sprint produces the written specification that the AI agents will execute against — not a deck of slides. It runs two weeks, costs $2,500, and the deliverable is the contract: project scope, user stories, edge cases, architecture, implementation plan, and risks. If the application fits a 90-day fixed-price rebuild, the spec is the starting point for the engagement and the $2,500 credits against the project fee. If it doesn’t fit, the client keeps the spec and can take it to any vendor.

A 30-minute call. We’ll tell you if your app fits in 90 days.

Schedule a Discovery Sprint conversation →

Topics on this page

Cursor Artificial intelligence Anthropic Large language model LoadSys

+1 more

Authored by Donatas Kairys Principal & Co-Founder

Twenty years of shipping production software, 400+ client engagements, and a particular interest in what spec-driven development does to the rebuild-vs-replace decision in mid-market companies.

Insights Blog

Why AI-Generated Code Is Only 70% Done — And What That Means for Your Rebuild

The 70% number is more universal than the discourse around it

The wall is the completion illusion

What’s missing isn’t the tool. It’s the contract.

What it costs to skip the spec

Three questions before the kitchen-table decision

How LoadSys handles the 70% problem

Frequently asked questions

Topics on this page

Recent Posts

Juniors Can Prompt. They Can’t Yet Judge. Closing That Gap Is the Senior Engineer’s Job.

Why AI-Generated Code Is Only 70% Done — And What That Means for Your Rebuild

Legacy Stack Engineer Hiring Cost in 2026: The Real Math

Custom Apps as Agent Opportunity: The Pattern from 3 Modernization Calls

Rebuild or Replace with SaaS? Why the Custom-App Calculation Changed in 2026

Building a Spec-Driven Development Practice for Your Engineering Team