Estimated reading time: 2 minutes
A new Queen’s University study throws up warning signs for those relying on AI output without careful checking.
A new analysis of 82,845 chat logs shows that popular coding assistants often reply at great length – typically 14 times longer than the human prompt – and much of the code they produce contains basic errors.
Researchers from Queen’s University in Canada looked at the real-world conversations from the WildChat corpus of developer conversations with ChatGPT, containing 368,506 code snippets in more than 20 programming languages. Nearly seven in ten conversations involved some form of back and forth, often because users shifted goals mid-thread or had to clarify missing details. Those chats got pretty in the weeds, too: the average model response was around 2,000 characters, compared to 836 for a typical Stack Overflow answer.
But beyond burning tokens, the quality of the code generated by the AI assistants was cause for concern. Among the issues identified included undefined variables in 75% of JavaScript snippets, invalid naming in 83% of Python snippets (with undefined variables in 31%), missing headers in 41% of C++ code, missing required comments in 76% of Java snippets, and unresolved namespaces in 49% of C# outputs. Those syntactic mistakes weren’t the only problem: maintainability and style issues were common, too.
Check, check and check again
Your inbox, upgraded.
Receive weekly engineering insights to level up your leadership approach.
“I think that is a big issue, that it has a lot of defects,” lead author Suzhen Zhong, a researcher at Queen’s University in Kingston, Canada, said. She’s particularly worried about the risk of AI-defected code being deployed into a large-scale, real-life project.
The idea of spotting errors in conversation with chatbots then fixing them seems sensible, but even there are problems. Zhong and her colleagues found that errors persist, and can even worsen, over time.
In Python, the share of conversations with undefined variable issues rose from 24% to 33% by turn five in a chat. That wasn’t the case across all languages, though: Java’s documentation violations improved with iteration by chatbots, dropping from 78% to 63%, suggesting some problems are fixable when users explicitly point them out.
Zhong was “really surprised” by how often Python import errors cropped up, and by how uneven assistants were across languages. “It means that an LLM has different capability levels in different programming languages,” she concludes.
More like this
How to solve the problem
All those issues don’t mean AI assistants are unusable. Indeed, Zhong’s a fan of the tools in her own work. “I’m using LLMs to generate code a lot,” she says.
But her practical advice for how to harness AI’s efficiencies while ironing out the wrinkles is simple: run the bot’s output through static analysis and feed the diagnostics back into the next prompt. She also says part of the issue is from humans’ non-specific instructions. “Developers should be very clear about their prompt engineering,” she says.
Combine those and you get closer to something that can be a reliable colleague, rather than an unreliable intern wrecking your codebase. That, and bearing in mind that your purported productivity gains might not be as significant as you think they are, according to other research.

November 3 & 4, 2025
Book now and save up to €250 💸