A study by AI researchers at Princeton and the University of Chicago suggests that LLMs are a long way from being able to solve common software engineering problems.
The rise of generative AI tools has already changed the way software developers tackle common day-to-day programming tasks, prompting concerns that the machines will soon either take away these jobs, or reshape them beyond recognition.
This perceived threat from generative AI is already changing how computer science is taught, and nearly half of developers “showed evidence of AI skill threat” – meaning they worried about AI threatening their jobs – in an October 2023 survey by Pluralsight.
But is this fear justified?
An October 2023 study, recently submitted to the International Conference on Learning Representations, taking place in May 2024, suggests that AI is unlikely to replace humans in the software development loop any time soon.
Carlos Jimenez, a Ph.D. student studying AI at Princeton University, working alongside a group of his colleagues and some peers from the University of Chicago, developed an evaluation framework that drew nearly 2,300 common software engineering problems from real GitHub issues – typically a bug report or feature request – and corresponding pull requests across 12 popular Python repositories to test the performance of various large language models (LLMs).
Researchers provided the LLMs with both the issue and the repo code, and tasked the model with producing a workable fix, which was tested after to ensure it was correct. But only 4% of the time did the LLM generate a solution that worked.
Their specifically-trained model, SWE-Llama, could only solve the simplest engineering issues presented on GitHub – while mainstream LLMs like Anthropic’s Claude 2 and OpenAI’s GPT-4 could only solve 4.8% and 1.7% of problems, respectively.
“It’s just a very different type of problem” to those that LLMs are traditionally trained on, Jimenez explained in an interview to discuss the findings. “If you’re not going to help the model a lot, they’re not very good at isolating the problem from a large code file, for instance.”
The SWE-bench evaluation framework tested the model’s ability to understand and coordinate changes across multiple functions, classes, and files simultaneously. It required the models to interact with various execution environments, process context, and perform complex reasoning. These tasks go far beyond the simple prompts engineers have found success using to date, such as translating a line of code from one language to another. In short: it more accurately represented the kind of complex work that engineers have to do in their day-to-day jobs.
“In the real world, software engineering is not as simple. Fixing a bug might involve navigating a large repository, understanding the interplay between functions in different files, or spotting a small error in convoluted code. Inspired by this, we introduce SWE-bench, a benchmark that evaluates LMs in a realistic software engineering setting,” the researchers wrote.
What devs make of the results
“We’re not there yet that an LLM really can substitute a software developer,” says Sophie Berger, cofounder and chief technology officer at mobile keyboard company Slate. “There’s a lot of work that has to be done there.”
The low scores that LLMs got on the benchmark were surprising to many. “Five percent is pretty much zero, to be honest,” says Daisuke Shimamoto, engineering manager at PROGRIT, a Japanese company, who has around 20 years of experience in development. “They seem to get distracted by all the noise around the essential bits.”
Maksym Fedoriaka, a senior iOS developer at Applifting, a software company based in Czechia, wasn’t as surprised by the findings of the research. “The main job of a developer is to solve problems, and LLMs aren’t that great at it in a software context,” he says. “Whenever I struggle with something, it’s something cutting edge or obscure and complicated, which means there are not that many resources and discussions about it online. ChatGPT just exits the chat at that point.”
Berger pinpointed a potential computational limit when it comes to context. “The issue is that if you feed a model a bunch of context, even if it can hold enough tokens to fit the entire codebase, that actually makes it harder for the model to pinpoint the exact areas that are relevant to solving an issue,” she says. It results in context overload, which can lead to hallucinations or errors.
A matter of time?
Of course, this limitation could only be temporary at this early stage in the development of LLMs. “Over time, these tools will improve at writing code, and engineers will rely on them more and more,” says Joe Reeve, an engineering manager at San Francisco software company Amplitude. “This will likely evolve the software engineering role, which will become more about reviewing and verifying AI-generated code than writing it.”
“I think we all need to get better at reading code, and finding new bugs that might have crept in,” Shimamoto says.
Put another way: while LLMs are already useful copilots, they aren’t yet capable of being set to autopilot. Shimamoto regularly uses GitHub Copilot, as well as ChatGPT on occasion, “but as the paper says, it’s quite far from things being done automatically by AI,” he says. “Looking at the speed of improvement of all these AIs, I think ultimately, we software engineers will become more like reviewers of the code.”
That’s a future that Jimenez also believes is likely. “I don’t think that we’re going to have fully automated developers very soon,” he says. “But I do think that we will have tools that will make developers’ jobs or lives a little bit easier.”