From Code to Correctness: Closing the Last Mile of Code Generation with Hierarchical Debugging
Abstract
While large language models have made significant strides in code generation, the pass rate of the generated code is bottlenecked on subtle errors, often requiring human intervention to pass tests, especially for complex problems. Existing LLM-based debugging systems treat generated programs as monolithic units, failing to address bugs at multiple levels of granularity, from low-level syntax errors to high-level algorithmic flaws. In this paper, we introduce Multi-Granularity Debugger (MGDebugger), a hierarchical code debugger by isolating, identifying, and resolving bugs at various levels of granularity. MGDebugger decomposes problematic code into a hierarchical tree structure of subfunctions, with each level representing a particular granularity of error. During debugging, it analyzes each subfunction and iteratively resolves bugs in a bottom-up manner. To effectively test each subfunction, we propose an LLM-simulated Python executor, which traces code execution and tracks important variable states to pinpoint errors accurately. Extensive experiments demonstrate that MGDebugger outperforms existing debugging systems, achieving an 18.9% improvement in accuracy over seed generations in HumanEval and a 97.6% repair success rate in HumanEvalFix. Furthermore, MGDebugger effectively fixes bugs across different categories and difficulty levels, demonstrating its robustness and effectiveness.
Community
MGDebugger, a hierarchical bottom-up LLM code debugger ๐ฅ that can fix bugs from low-level syntax errors to high-level algorithmic flaws.
It achieves an โญ๏ธ 18.9% improvement in accuracy over seed generations in HumanEval and a โญ๏ธ 97.6% repair success rate in HumanEvalFix.
Code and demo available at https://github.com/YerbaPage/MGDebugger.
Brilliant ๐ค
Approximately, what is the overhead? I.e the ratio between the subtotal tokens (finished code) and total tokens (debugging steps + finished code)
Great question ๐
Most debugging methods like Self-Debugging, LDB, Reflexion, etc., tend to have a high ratio of debugging tokens to finished code tokens (often > 5), as they perform extensive analyses to identify and resolve bugs. Despite this, they sometimes struggle to detect and fix subtle issues.
In our approach, MGDebugger might incur slightly higher token costs due to the hierarchical decomposition process, where we isolate and debug subfunctions separately. However, the method's effectiveness justifies this overhead since it addresses errors at multiple levels of granularity, allowing it to debug issues that other methods might overlook.
Hope that clarifies things!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Enhancing the Code Debugging Ability of LLMs via Communicative Agent Based Data Refinement (2024)
- Debugging with Open-Source Large Language Models: An Evaluation (2024)
- Revisiting Evolutionary Program Repair via Code Language Model (2024)
- An Exploratory Study on Fine-Tuning Large Language Models for Secure Code Generation (2024)
- Benchmarking ChatGPT, Codeium, and GitHub Copilot: A Comparative Study of AI-Driven Programming and Debugging Assistants (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
We have released our demo on Hugging Face here: https://huggingface.co/spaces/learnmlf/MGDebugger ๐โจ.
Thanks to LDB for the inspiration!
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper