You can’t verify all the AI-generated code

More code + the same number of engineers = problems.

By Chris Stokel-Walker

March 03, 2026

Estimated reading time: 5 minutes

When AI-generated code goes wrong, you’re likely grappling with verification debt.

Coding with AI is the future, but verification methods are lagging behind.

A survey of more than 1,100 developers by code verification firm Sonar found that AI tools now account for 42% of all committed code – a figure developers expect to rise to 65% by 2027. But Sonar also identified what it calls a “verification gap”. While 96% of developers don’t fully trust AI-generated code to be functionally correct, only 48% say they always check it before committing.

“AI-assisted coding has expanded the volume of code being developed exponentially,” says Tom Finch, engineering leader at Chainguard, a container security software firm. “At the same time, it has expanded the surface area of the codebase, subjecting it to more risk of error, abuse, and even malicious injection.”

“Code is a liability,” says Christian Kästner, professor of software engineering at Carnegie Mellon University. “Producing more code is not necessarily a good thing.”

He argues that much of the extra output AI enables – internal dashboards, one-off scripts, or specific tools that he develops for teaching his own university classes – is disposable software that nobody would have bothered writing before. The real question is whether the same casual attitude is being applied to production-critical systems where quality and security really matters.

Your inbox, upgraded.

Receive weekly engineering insights to level up your leadership approach.

Who’s watching the watchers?

To try and mitigate against the biggest potential harms, it’s important to keep a watchful eye over what’s being pushed into production, reckons Finch. “Without constant guardrails and oversight, we risk introducing vulnerabilities by omission,” he says. “This creates a massive verification debt, as humans are now being asked to manually verify more code than is realistically possible.”

Chainguard’s own research shows that while 93% of engineers find building features rewarding, they now spend only 16% of their week doing it; with 86% of UK-based developers reporting they spend more time maintaining and patching code than creating it.

The problem is that patching that code when it goes wrong becomes exponentially more difficult because it’s been generated by an unknowable AI. When humans code projects, they have an inherent knowledge of the decisions they made and why they do it – much in the same way people remember parts of physical books they read.

Werner Vogels, Amazon’s CTO, described the tension last year. When humans write code, their comprehension comes during the act of creation; when machines write it, comprehension has to be rebuilt during review. He called it “verification debt.”

“It’s been a well-worn cliché in programming circles that taking over someone else’s code is challenging at best,” argues Jamie Boote, associate principal security consultant at application security firm Black Duck. “When supervising agents who produce code, developers whose job goes from solving problems to grading code will eventually be faced with maintaining a code base made up entirely of logic that they weren’t involved in the original creation of.”

If something goes wrong with AI-generated code, there’s no muscle memory of what a developer wrote or for what reason.

Losing the plot

“Now, developer teams are reviewing code they no longer have clear provenance on – how specific components were built, where they came from, or what changed when something breaks,” warns Finch. “Without provenance, we no longer know where the bones are buried when an outage occurs, which makes recovery slow and painful.”

“This is the same problem we had before AI, but now it’s just happening faster,” says Justin Megawarne, managing partner and cofounder of software development consultancy Megaslice. “Most people didn’t understand their codebases even before AI. AI just makes it worse because it can produce so much more code so much more quickly. The fundamental problem hasn’t changed.”

What that lack of understanding looks like in practice became public earlier this month. Amazon’s AI coding tool Kiro reportedly decided the fastest fix to an infrastructure problem was to delete and rebuild an entire production environment, triggering a 13-hour outage of AWS Cost Explorer.

Amazon blamed misconfigured access controls, but insiders said the AI agent made the decision autonomously. It was the kind of decision a human engineer almost certainly wouldn’t have made – and the kind of incident that’s nearly impossible to diagnose quickly when no one on the team fully understands the codebase the agent was modifying.

Cognitive debt rises

Margaret-Anne Storey, professor of computer science at the University of Victoria, recently coined a term for this: “cognitive debt”. In a blog post that went viral among developers, she argued that while technical debt lives in the code, cognitive debt lives in the minds of the developers – and it’s a bigger threat in the age of AI coding. “Even if AI agents produce code that could be easy to understand, the humans involved may have simply lost the plot,” she wrote.

Kästner has seen the consequences of that complacency first-hand. In a recent class assignment, he tasked students with fixing a security vulnerability in an AI agent.

The coding assistant generated what sounded like a robust solution – a two-phase confirmation protocol – but the fix was completely broken. “Some 80% of students lost points in that assignment because they just believed what the agent did,” he says. “The agent was very confident, it sounded very good, but it was completely broken.”

New problems, old solutions

How to prevent the risk of something going wrong at the hands of AI coding agents is another question. But it’s one that Megawarne argues we already have in our toolbox. “We actually already know how to do this as an industry, but for some unfathomable reason, we’re behaving like we’ve forgotten,” he says.

Having a strong set of specs is important for the AI agent to follow, and for people to check code in the event that something goes wrong.

The next thing is to ask either human or AI coders to focus on architecture and design. “As a senior engineer, you’d pay a lot of attention to picking the right tech stack and breaking the system into manageable components,” says Megawarne. “Architecture is often dismissed as ‘boxes and lines’, but it’s actually about intellectual control. That speaks directly to understandability.”

Boote suggests that “code with AI contributors needs to have checks at every level – from the line-level or commit-level vulnerability scan, to looking for problems that only emerge as all these interactions combine at the codebase and application level with full automated security scans and application penetration tests.”

Berlin • November 9 & 10, 2026

Engineering leadership has never moved this fast.
See how other leaders are keeping pace at LeadDev Berlin.

Explore

It’s only by instigating these kinds of measures that you can ward against the debt that comes with AI-generated code – whether you label that debt cognitive or verification.

About the author

Chris Stokel-Walker

Chris Stokel-Walker is a freelance journalist based in the UK.
- @stokel

Newsletters

Panel discussions

Videos

Reports

For you

London

Meetups

New York

Berlin

You can’t verify all the AI-generated code

By Chris Stokel-Walker

Your inbox, upgraded.

Who’s watching the watchers?

More like this

Losing the plot

Cognitive debt rises

New problems, old solutions

About the author

Chris Stokel-Walker

London

Meetups

New York

Berlin

You can’t verify all the AI-generated code

By Chris Stokel-Walker

Your inbox, upgraded.

Who’s watching the watchers?

More like this

Losing the plot

Cognitive debt rises

New problems, old solutions

Share:

About the author

Share:

More like this