Estimated reading time: 13 minutes
Key takeaways:
- With AI, shipping more code is easy. Trusting it is the hard part.
- “Prompt and review” is quietly burning developers out.
- You can’t verify what you can’t observe!
Shipping large amounts of code should feel like a win. But when skilled engineers spend their days auditing machine output instead of building code, the hidden costs to motivation, retention, and code quality start to add up.
In ‘Nobody knows what programming will look like in two years,’ I referred to an issue with code verification. Essentially, when AI is writing all, or the majority, of the code, how do you ensure that that code is correct?
Since writing it, I reached out to Liz Fong-Jones, technical fellow at observability vendor Honeycomb, who told me: “There are vigorous debates happening at Honeycomb about this topic.”
One obvious answer is peer code review. This manual inspection of source code by developers other than the author is recognized as a valuable tool for improving the quality of software projects.
In 2008, research by Capers Jones found that formal inspections had a latent defect discovery rate of 60–65%, compared with around 30% for most forms of testing.
Your inbox, upgraded.
Receive weekly engineering insights to level up your leadership approach.
Modern code review vs human code review
Modern code review, often featuring the use of tools developed to support it, has been widely adopted by companies like Google and Microsoft. In addition to catching defects, it provides additional benefits such as knowledge transfer, increased team awareness, and creation of alternative solutions to problems.
Human code review also has genuine technical merit as a review and verification layer for AI-generated code. However, it carries a hidden human cost that’s easy to overlook when optimizing for velocity. The issue isn’t that developers object to code review per se. Most accept it as a component of collaborative software development. The problem is one of role identity and professional purpose.
Developers have numerous motivations for entering the field, but a common one is that they want to build things by exercising creativity, problem-solving, and craft. Code review, in its traditional form, sits alongside that creative work.
However, if the primary job becomes ‘write prompts, inspect output,’ the balance is inverted, with the creative core hollowed out and replaced by something analogous to Quality Assurance (QA) on an assembly line.
There is also a difference between reviewing a colleague’s code, where you’re engaging with another human’s reasoning (potentially learning something and contributing to shared ownership) versus auditing machine output for subtle errors. The latter is cognitively demanding yet unrewarding: it requires sustained, high-stakes attention to catch the kinds of plausible errors that AI systems are prone to, without the satisfaction of having created anything yourself.
Two modes of working with LLMs
While experienced developers can report productive, even fulfilling, ways of working with Large Language Models (LLMs), these tend to look quite different from the organizational model described here.
Sam Aaron, the developer of Sonic Pi, is now working on its successor Tau5 – free, open-source live coding environments and software synthesizers, designed to teach programming through music and act as professional performance tools. It is, by any measure, challenging software to build.
As part of Tau5, Aaron is working on SuperSonic, a port of the 30-year-old SuperCollider synthesis engine, to improve modern browser audio technology. For this project, there was no documentation to follow, no prior art to copy, and no community to consult. Aaron also has no JavaScript background, and didn’t know about the internals of SuperCollider.
“LLMs have allowed me to navigate the space, learn about the technologies, and understand whether or not there’s an effective solution,” he told me.
After 10 months of working with LLMs, he now uses them to rapidly generate prototypes, then discard those prototypes, synthesize what he’s learned, design his own architecture, and either code the final solution himself or use the LLM with “super slow” verification at every step. Crucially, he never lets the LLM commit to Git.
His approach involves two distinct modes of working. The first is exploratory vibe coding – generating rough prototypes freely, not caring about the quality of the code, just trying to establish what’s possible. The main benefit of being able to rapidly generate so many prototypes is that it enables him to ‘live in,’ or experience, each one, which gives him richer feedback than he would gain by imagining it. It also allows him to rapidly test seemingly weird ideas, ideas he would probably have just disregarded before, in case there is something valuable there.
Once he has found a solution, he switches gears entirely: working slowly, incrementally, and keeping the engineering decisions firmly in his own hands.
“Lots of people using LLMs describe their approach as ‘I’m trying to get the right spec so I can one-shot the solution.’ That’s the opposite of my approach. I’m happy to one-shot the experimental prototypes, but I don’t see engineering development as a one-shot approach. It’s a painstaking, slow, incremental journey where I’m driving the system.”
More like this
Assume the LLM is gaslighting you at all times
Aaron’s first guiding principle is to assume the LLM is gaslighting you at all times. “If you assume the LLM is always lying to you, even if it’s not, you can ask enough questions to overlay it, so you can figure out the truth.”
Early in the Supersonic project, Aaron asked an LLM to run SuperCollider inside a browser audio worklet. Half an hour later it returned a working demo, but the LLM, unable to do what he’d asked, had quietly substituted its own synthesizer.
“It used the traditional web audio stack, made its own synthesizer, and played that. It tried to do what I had asked, but couldn’t because it’s extremely hard.”
The system did what the user wanted – it made a sound – but it hadn’t done what Aaron needed; hence Kent Beck’s genie analogy. It’s a great illustration of the verification problem: the output looked correct and even felt correct, until you examined it closely.
What’s striking about Aaron’s approach is that the creative and intellectual core of the work remains his. The LLM accelerates certain kinds of exploratory work, but the craft, judgment, and engineering decisions belong to him.
He’s using AI the way a sculptor might use a rough cutting tool: it removes material fast, but the sculptor’s hand and eye are always in charge. It works because it requires and rewards deep expertise, and because he has retained full agency over the process. As Aaron puts it: “I believe that my many years of development experience have been essential.”
The de-skilling of developers
However, his process likely doesn’t scale well to larger teams. The ‘prompt and review’ model being rolled out across enterprise development teams appears to scale better, but it also inverts that relationship, positioning the LLM as the primary producer and the human as the checker.
Along with feeling less fulfilling, it risks undermining the very expertise that makes meaningful verification possible. Developers who have stopped building things are slowly losing the depth of judgment needed to catch what the AI gets wrong.
Prompt and review creates a specific kind of professional dissatisfaction that’s worth taking seriously. Autonomy, mastery, and purpose, as outlined in Dan Pink’s book Drive, are clear motivators for employees. Roles stripped of them are potential drivers of burnout.
Asking skilled engineers to spend their days as a verification layer for generative AI may produce short-term output gains while quietly degrading the motivation, retention, and mental health of the very people the system depends on.
The risk is that you end up with developers who are disengaged, teams that haemorrhage talent, and a drop in verification quality as people who find their work meaningless become less effective at it.
“You cannot sit there and review code all day; you will become tired and start stamping [approving] everything,” Fong-Jones told me.
Formal verification
If human code review is a fragile solution, what about more rigorous alternatives? Formal verification, which uses mathematical proof to establish that software behaves correctly, is the most technically robust approach.
However, formal methods have largely been confined to aerospace, nuclear, cryptographic, and hardware design. These are environments where the cost of failure ranges from high to catastrophic, and where there are sufficient resources to employ specialists. Most developers have defaulted to human testing, not out of indifference to correctness, but because formal verification was simply beyond practical reach.
AI is beginning to change this calculus. LLMs can now assist with proof generation by predicting proof steps, fixing buggy proofs, and translating informal code into formal representations.
In 2024, research introducing DafnyBench demonstrated that top LLMs, particularly Claude 3 Opus, achieved a ~68% success rate in automatically generating verification hints for the Dafny formal verification language. This large-scale benchmark (over 750 programs) highlighted the potential of LLMs to reduce the burden of writing manual assertions, with performance increasing when models were allowed to retry.
There’s also a structural advantage that makes AI particularly well-suited to this domain. A symbolic proof checker, rather than a probabilistic AI, validates any solution so the usual unreliability of LLM output becomes less of a liability. You’re trusting the checker, not the generator.
Recent releases like Mistral’s Leanstral (2026), the first open-source AI agent built specifically for the Lean 4 proof assistant, suggest the field is moving toward purpose-built, cost-efficient formal verification tooling, rather than applying general-purpose models to the problem.
Fong-Jones recently spoke with Geoff Huntley, who believes that formal verification is the long-term future, particularly with LLMs eventually generating their own proofs. However, he accepts that it’s the future, not the present.
Fong-Jones agrees: “With the exception of a few, like the Amazon Aurora team, formal methods verification is presently out of the reach of most teams because it’s a highly specialist skill set to develop formal proofs.”
The challenge is that, before any proof can be written or checked, someone has to specify what ‘correct’ means. For most real-world software, especially business logic, that specification work requires interpreting ambiguous human requirements, making judgment calls about edge cases, and understanding the domain deeply enough to articulate constraints that capture the intent.

London • June 2 & 3, 2026
New activities added to LDX3 London 🎉
You can’t verify what you can’t observe
AI cannot currently do this work, and the developers best equipped to write those specifications are the ones at risk of having their expertise hollowed out by years of prompt and review workflows. Even as formal verification may be becoming more accessible, it still demands the kind of deep engineering judgment that, as an industry, we are in danger of systematically failing to cultivate.
Aaron’s solution, in the absence of formal verification tooling, has been to build deep telemetry directly into his system and lean heavily on Rust’s type system as a form of internal validation. “With LLMs and this lack of trust, I am leaning on the types way more,” he said. Observability became a critical part of his development process, not just an afterthought. The lesson generalizes: you can’t verify what you can’t observe.
Building from this, can we use the tooling we’re already familiar with – observability and feature flagging – to act as a verifier? Feature flagging can, to a degree, reduce the blast radius if things go wrong, which allows us more leeway to experiment. Furthermore, observability allows us to develop more confidence with the code over time.
Honeycomb’s experience is interesting. The firm has seen a roughly 2x increase in pull requests (PRs) with only ~40% more engineers, compared with this time a year ago. Their SOC 2 controls require human review of every change, but human review is now a serious bottleneck.
“There’s a massive backlog of PRs growing at Honeycomb, and we are in tumult about what to do about it,” Fong-Jones told me. “Is it OK for a human to stamp a PR based on a separate instance of Claude Opus having done a review of the code? If part of the point of PRs is not just compliance control, but also knowledge sharing, then what if code is landing in production that has zero people understanding it? Not the person who wrote it, not the reviewer — zero people understand it.”
As a business with enterprise customers, a related question they are wrestling with is how much churn those customers will tolerate. Fong-Jones said: “Does this new era of AI, where everyone expects features to be delivered overnight, mean we are relaxing our error budget and Service-Level Objectives (SLOs) because we’ve taken two steps back in our software maturity?”
Mob programming with an AI
One emerging pattern at Honeycomb is teams mob programming with AI together in real time. This preserves consensus and satisfies the two-person review rule.
“You have multiple engineers sitting in a room guiding the AI together live. That way there is shared understanding, some degree of multiple people reviewing, and it satisfies the SOC 2 requirement,” said Fong-Jones. It shifts the role of the reviewer, “from looking at it afterwards to co-creating at the same time.”
Honeycomb has found, however, that the mob programming approach works best for greenfield projects. “Greenfield vs brownfield is one axis,” Fong-Jones said. “Stateful vs stateless is another, and a third is the complexity of the problem to be solved.”
Honeycomb’s platform teams are experimenting more with AI than the product teams, and their co-founder and CTO, Charity Majors, wondered why.
Fong-Jones explained: “It turns out that that correlation is not by position in the org, but by what people are working on and the corresponding difficulty of validation. Validating a UI change is a lot harder than validating a query like, ‘Did this terraform PR succeed?’ Our platform teams are under immense pressure to dogfood AI and agentic workflows, to make sure that we’re staying ahead of the problem. Whereas our product teams are under pressure to deliver features, and the fastest way to do this is to do the work yourself.”
Fong-Jones is confident that observability provides the right foundation for catching problems in production. However, she was candid about a trickier class of failure – slow quality degradation that doesn’t show up in conventional metrics.
She referenced Anthropic’s own postmortem about accidentally serving degraded model outputs: “That didn’t manifest to the user as ‘the system is down.’ It manifested as the system is more stupid, in a way that I can’t fully articulate. That doesn’t show up in latency metrics or error rates – it’s revealed in the number of user turns taken, and in Customer Satisfaction Score (CSAT).”
Beware the AI code verification trap
According to Fong-Jones, working with agentic systems means that: “You have a system that is inherently unpredictable. But verifying software relies on a level of predictability that an agent doesn’t have, so we’re going to embrace the chaos and build an adaptable, resilient system, instead of trying to prevent failures. The question then becomes, ‘What are the necessary feedback loops to make that happen?’”
She also noted that AI is accelerating a pre-existing need rather than creating an entirely new problem. “While this might have happened without AI, AI is forcing organizations like mine to confront this issue, and it can potentially be part of the solution.”
Software is in a transitional phase that doesn’t yet have a clean resolution. The tools are outpacing the processes, and the processes are outpacing the culture. We’ve optimized for generation while treating verification as a problem that will sort itself out. So far it hasn’t, and the humans who understand the systems deeply enough to verify them are a resource we cannot afford to squander.
There’s a parallel to be drawn from Charity Majors’ ‘test in production’ provocation from around 15 years ago. Majors’ observation that User Acceptance Testing (UAT) and QA environments only got you so far, and that a new approach was needed, was controversial at the time. It seems obvious now.
Something similar is true here, although we don’t yet know the answer. Automated verification feels like the only destination that makes sense at agentic scale, but, “We’re all figuring this out together. People are learning in the open. It is still early days,” Fong-Jones said.