Estimated reading time: 6 minutes
Key takeaways:
- AI didn’t create the PR review problem – it made it unsustainable.
- The real issue is data: agents working on incomplete, poorly correlated data produce PRs that fix symptoms, not root causes.
- The fix is layers, not a single solution.
Pull request (PR) reviews were already a known weak point in software development before AI-coding agents arrived. While AI tools didn’t create the problem, they made it impossible to ignore.
Even when humans wrote code at human speed, the model struggled. PRs sat unreviewed for days. Reviewers skimmed 500-line diffs before returning to their own work. There were rubber-stamp approvals on changes nobody fully understood. Every engineering org had some version of this, and most quietly accepted it as the cost of moving fast.
Then AI-coding tools showed up and poured fuel on the fire.
Your inbox, upgraded.
Receive weekly engineering insights to level up your leadership approach.
Reviewing code has always been harder than writing it
The root problem with PR reviews is a context asymmetry that has nothing to do with AI.
When you write code, you carry everything with you: the tradeoffs you considered, the approaches you rejected, and the reason a particular solution made sense for a codebase. When you review someone else’s code, you’re reconstructing all of that from the diff alone.
That’s hard even when the author is a colleague you work with every day, who can answer questions in Slack and explain their reasoning in the PR description. It’s harder still when the author is an agent that made hundreds of non-deterministic decisions that you have no visibility into.
The context gap that made human PR reviews difficult is an order of magnitude wider with AI-generated code. What’s more, the volume problem compounds it.
The PR slop problem
Teams using AI-coding agents are now contending with a larger volume of PRs and a lower average quality, a combination that breaks review workflows that were already under strain.
GitHub’s Octoverse 2025 report documents what open source maintainers are calling “AI slop.” These are high-volume, low-quality, and often inaccurate contributions that consume reviewer attention without adding proportionate value. This phenomenon is showing up across engineering organizations of every size.
The underlying mechanism matters. Most AI bug-reporting and data-gathering tools weren’t designed for agents from the ground up. They were built to surface problems for humans to investigate. When an AI layer was bolted on top, the underlying data infrastructure didn’t change. Agents are making decisions on incomplete, poorly correlated, ungrouped data and producing fixes that are plausible-looking but miss the actual root cause.
The result is a 400-line PR that passes continuous integration (CI), looks syntactically correct, and addresses the symptom rather than the failure. A human reviewer catching this has to reconstruct the causal chain the agent never had access to in the first place.
The CEO of a well-known error monitoring tool acknowledged the pattern directly: agents working with low-quality data inputs produce low-quality PRs that are more work to fix.
In short, the problem is that AI agents are being asked to make decisions about systems using data that was never designed for machine reasoning.
More like this
A framework for thinking about the fix
Complex systems rarely fail because of a single thing, and they rarely get fixed by one either.
The Swiss cheese model is the most honest framework for thinking about software quality verification. Imagine several slices stacked on top of each other, each representing a defensive layer: a process, a check, a safeguard. Each slice has holes, but as long as the holes don’t align across all slices simultaneously, failures don’t get through. Stack enough imperfect layers and you get a system more reliable than any individual component.
The PR review crisis is a Swiss cheese problem. No single intervention fixes it. What’s emerging is a set of partial answers that each address a specific failure mode and work better together than any of them do alone.
The emerging layers of PR reviews
Spec-driven development
This addresses the failure mode of misaligned intent. If an agent works from a detailed, well-reasoned specification before writing a single line of code, the output is more predictable and the reviewer has something concrete to check the code against.
Multi-agent competition
This addresses the failure mode of a single agent’s blind spots. Assign the same task to multiple agents simultaneously, let them produce diverging solutions, and select based on which passes the most verification steps, introduces the smallest diff, or avoids new dependencies. Competition creates a signal you wouldn’t get from a single attempt.
Better data infrastructure for agents
This addresses the failure mode at the root of PR slop. When agents reason with high-quality inputs (unsampled, full-stack, pre-correlated, deduplicated context) the quality of what comes out changes significantly. An agent working from a properly structured causal trace of a production failure behaves differently than one working from a raw log stream.
Automated verification layers
Layers such as expanded test coverage, CI checks, contract verification, and static analysis catch what reviews used to catch, earlier and faster, without requiring a human in the loop for every change.
What’s striking about these approaches is that none of them are mutually exclusive. A team doing spec-driven development still benefits from better agent data. A team with strong automated verification could still run multi-agent competition on complex problems. These are layers, not alternatives.
Each layer in the Swiss cheese model takes something off the reviewer’s plate:
- Spec-driven development validates intent before a line of code exists.
- Better agent data means fewer low-quality PRs making it to review.
- Automated verification catches mechanical errors before a human sees them.
What remains are PRs requiring the judgment calls on intent and architecture that reviewers are actually qualified to make (not diffs reviews that should never have reached them in the first place).

London • June 2 & 3, 2026
⏰ 1 day to go. Last chance to join LDX3 London.
What actually comes next?
So, will PR reviews become extinct? The PR review as most teams practice it today (i.e. a human reading a diff line by line and approving a merge) almost certainly will. The volume of AI-generated code makes that model economically and cognitively unsustainable. No amount of tooling bolted onto the existing process will change that fundamental arithmetic.
What replaces it is a layered verification system where human judgment is reserved for intent and architecture rather than implementation.
Humans define what success looks like, set the constraints, and make the calls that require genuine understanding of business context and second-order consequences. Machines handle the verification that those constraints were met.
This is a significant change to how teams need to evaluate incorporating AI tools into their workflows. It’s not sufficient to “adopt this new tool and keep everything else the same.” What’s needed is a genuine reorganization of where human attention goes and what it’s for.
The goal of self-healing software has been talked about for years. What’s different now is that the pieces are finally being assembled in the right order: better data for agents to make decisions on, verification layers that don’t require human eyes on every line, and a growing recognition that the review process itself needs to be redesigned.
The way we verify software quality needs to evolve at the same pace as the way we generate it. Keeping a broken review process and hoping better tools paper over it isn’t a strategy.