The Debugging Dilemma: Where AI Helps and Where It Fails

Written by Andrea | Jun 5, 2025 1:57:19 PM

AI is the future, and the present.

How much of it is “the present” is mostly a matter of what we use it for and how it’s applied. There are some big ideas on the tasks AI is capable of performing (with or without Human guidance). While most people are still playing around with AI in their personal lives, businesses have begun making serious investments in using it to enhance or replace Humans.

One area where this is definitely true is Software Development. Google estimates that over 25% of new code is generated by AI, and I suspect that percentage is already outdated. Microsoft estimates that 30% of its code is AI-generated. For these companies, and many others, those percentages are going to increase dramatically (perhaps even exponentially) over the next several months and years

Why Debugging Keeps Developers Essential in the Age of AI

With all of this AI-generated code, you might think that it’s only a matter of time before the role of software developer is obsolete. But, I don’t think that’s going to happen, and here’s why. The truth is, most developers spend more time debugging rather than writing code, and since neither Humans nor AI writes perfect code, debugging will always be a critical task. In fact, it is probably more important that AI tools can help with debugging code, rather than writing code

I’m not the only one who thinks so either. Researchers at Microsoft’s virtual lab, debug-gym, aimed to determine how effectively AI tools can use real programming tools to resolve coding issues. To test this, they created a virtual lab where AI attempts to correct broken code using tools that Human programmers use, such as setting breakpoints or checking the values produced by different parts of the code.

This allows the AI to explore and understand the code more deeply before trying to fix it. If the AI feels confident, it can even suggest changes. The question was, with the right tools, could AI improve at helping programmers solve complex coding problems by actually understanding what the code is doing.

The lab is designed for training and evaluating AI coding tools, primarily ones based on large language models (LLMs). The purpose is to learn how to debug code interactively, similar to how Human programmers do. Current AI coding tools do well at suggesting fixes by analyzing code and error messages. However, they aren’t able to find additional information when their solutions don’t work.

Human programmers debug iteratively. First they analyze the code (and sometimes the requirements) to determine what they think may be wrong. Then, they find examples by stepping through the code and using debuggers to aid in finding and repairing the problems. They repeat these steps until the code is completely fixed.

Promising Tools, Persistent Gaps

Debug-gym provides AI agents with access to a toolbox of interactive debugging tools. These tools expand the agent's capabilities beyond just rewriting code. Tools include:

PDB (Python debugger): Allows stepping through code, pausing execution, and inspecting variable values
View: Enables the AI to look at the contents of any file with line numbers
Rewrite: Used for modifying the code, allowing for targeted changes
Eval: Lets the AI run the code, often by executing test cases, to check if its changes worked
Listdir: Allows the AI to explore the directory structure

Early research used a simple prompt-based agent in debug-gym and explored how well LLMs could use interactive debugging tools. The experiments compared agent performance with and without debugging tools. The simple agent with tools rarely solved more than half of the SWE-bench Lite issues.

This tells me we are not there yet, with being able to use AI for debugging . There was a significant performance improvement compared to the agent without debugging tools. However, the performance improvement validates that interactive debugging with tools is a promising research direction, but again, AI debugging is not as good as having a developer do it.

The experiments also showed that the effectiveness of tools depends on the task and the AI's capability. For simple code generation, tools didn't make a big difference. However, for debugging-specific challenges, agents with PDB performed much better. On real-world bugs, results were mixed, and some less advanced AIs needed to develop basic debugging skills before effectively using powerful tools like PDB. Stronger models benefited, and using the tools after initial attempts proved most effective overall.

The observed tool usage patterns suggested stronger models used a wider variety of tools and even showed curiosity in exploring the project structure. Future work in this area will involve training or fine-tuning LLMs specifically for interactive debugging using specialized data, like trajectory data that records interaction with debuggers. There's also interest in AI generating its own tests during the debugging process, improving trustworthiness (ensuring root cause fixes) and expanding beyond Python and PDB .

The Bottom Line...

The bottom line is that debugging is definitely a task best left to Human developers right now. AI agents with debugging tools can’t match, let alone outperform their Human counterparts. AI tools lack the ability to search for the necessary information that is needed to get the answers on how to debug. They also struggle with the iterative process necessary to be successful in debugging.

In the future, with better data models, as these AI tools improve they may be able to be more successful in order to free up Human Developers for other tasks. However, even then, a Human Developer will still be able to do the best job at debugging since they can do more complex work that relies less on pattern recognition and more on Human Judgement. Human developers understand the entire system including configurations, databases and third-party services. They also understand the business logic and potential edge cases that AI tools struggle to recognize or understand.

AI may be more successful in solving code errors in the near future, but achieving success with delivering code that meets business requirements will definitely be further out in the future. The lack of understanding context, lack of business domain knowledge, and inability to make judgement calls are just a few of the reasons why. These are all abilities that Human team members have though. A Human team member can also ask better follow-up questions and push back when there is a lack of clarity around the requirements.

There is another side to AI’s inability to handle some of the scenarios mentioned above. Humans will need to be better at communicating requirements so that there isn’t ambiguity or implicit domain knowledge necessary to develop the solution. If that happens, AI may be able to add more value than it does. AI still won’t be as valuable as a Human team member though.

Interested in learning how to utilize Scrum agile methodology as a means of delivering your software more effectively?

Check out our Agile 101 workshop, suitable for anyone looking to learn the basics of Scrum or Kanban.

View full post