Researchers are increasingly focused on utilising Large Language Models (LLMs) to automate software development, yet comprehensive evaluation of their ability to proactively discover bugs remains a significant challenge. Steven Liu, Jane Luo, and Xin Zhang, working with colleagues from Microsoft, the School of ECE at Peking University, and the Chinese Academy of Sciences, introduce TestExplora, a new benchmark designed to assess LLMs as proactive testers within realistic software repositories. This work addresses a critical gap in current evaluations, which largely focus on regression prevention or reactive bug reproduction, rather than identifying defects before failures occur. TestExplora comprises 2,389 tasks across 482 repositories, deliberately concealing defect-related signals and requiring models to compare implementation against documentation to establish intent. The findings reveal a substantial performance gap, with current state-of-the-art models achieving a maximum Fail-to-Pass rate of only 16.06%, and demonstrate that advanced agentic exploration, exemplified by SWEAgent instantiated with GPT-5-mini, holds considerable promise for autonomous software quality assurance.

Current software assurance techniques primarily focus on preventing regressions, ensuring existing functionality remains intact, or reproducing known issues after they have been reported. This work moves beyond simply verifying code against existing implementations or known bug reports, instead challenging models to compare code behaviour against its intended design as documented in project documentation.

TestExplora comprises 2,389 tasks sourced from 482 distinct software repositories, deliberately concealing any signals related to existing defects. Models are tasked with identifying discrepancies between the code’s implementation and the documented intent, effectively using documentation as the definitive reference point.

This approach necessitates a shift from reactive bug fixing to proactive defect discovery, mirroring the demands of real-world software quality assurance. A key innovation of this work lies in its deliberate obscuring of all signals related to existing defects. Unlike conventional evaluations that rely on known bugs or post-failure data, TestExplora compels models to proactively identify issues by comparing code implementations against documentation-derived intent.

Documentation serves as the definitive oracle, establishing the expected behaviour against which the LLM’s generated tests are assessed. To facilitate this proactive assessment, the research team implemented a continuous, time-aware data collection strategy. This approach ensures the sustainability of the benchmark and minimises the potential for information leakage, a common concern in LLM evaluations.

Models are tasked with autonomously generating tests that reveal discrepancies between the code and its documented purpose, effectively simulating a preventative quality assurance process. The methodology prioritises realistic repository environments, mirroring the complexities of real-world software development.

The experimental setup involved rigorous control over the information available to the LLMs. By withholding defect-related signals, the study isolates the model’s ability to reason about code and documentation independently. Furthermore, the team explored the impact of different context methods, including plain, deep, and dependency-aware configurations, to understand how models leverage available information.

Agentic exploration was also investigated, instantiating models like SWEAgent and TraeAgent to freely explore repositories and generate tests, moving beyond simple input-output interactions. This approach aimed to determine whether autonomous exploration could improve bug discovery rates.

Large language model performance on proactive defect identification in software repositories

Initial evaluations using TestExplora reveal a significant capability gap in current large language models, with the strongest models achieving a maximum Fail-to-Pass rate of only 16.06%. This indicates that even state-of-the-art LLMs struggle to proactively identify defects in realistic software repositories.

The benchmark comprises 2,389 tasks sourced from 482 repositories, all meticulously curated to exclude any explicit signals related to existing defects. Models are challenged to determine discrepancies between code implementations and the intent documented, using documentation as the sole oracle for correctness.

The research highlights that simply generating a larger number of tests does not necessarily correlate with improved performance, indicating the need for more sophisticated testing strategies. The study also reveals model-specific preferences regarding dependencies, suggesting that the way models navigate and utilise external code components significantly impacts their ability to find defects.

The filtering process retained 12,227 preliminary pull requests, subsequently narrowed to a focused benchmark through reachability analysis and intent consistency checks. This ensures that tests meaningfully exercise modified code and accurately reflect potential defects caused by the code patch, differing from prior benchmarks that relied solely on execution-based filtering.

The Bigger Picture

The relentless pursuit of software quality has long been hampered by a fundamental asymmetry. We are remarkably good at confirming that software doesn’t break, through regression testing and bug reproduction, but appallingly bad at proactively discovering flaws before users encounter them. For years, the field has relied on reactive measures, patching vulnerabilities after they’ve manifested as problems.

The difficulty lies in defining ‘correctness’ without relying on known failures, a circular problem that this research elegantly sidesteps by using documentation as a source of truth. While current LLMs demonstrate limited ability in this proactive role, achieving a relatively low success rate in identifying bugs, the potential is clear.

The improvement seen with agentic exploration, exemplified by SWEAgent, suggests a promising path forward. However, the benchmark itself, while comprehensive in scope, remains a simplification of real-world complexity. The challenge of navigating intricate cross-module dependencies and the limitations of current LLMs in truly understanding intent are significant hurdles.

Future work must focus on more sophisticated agent designs, incorporating techniques for long-term planning and knowledge retention. Beyond this specific implementation, the broader implication is a shift in focus: from merely automating testing to building AI systems capable of autonomously improving software quality, a goal that demands a more proactive and exploratory approach.

👉 More information
🗞 TestExplora: Benchmarking LLMs for Proactive Bug Discovery via Repository-Level Test Generation
🧠 ArXiv: https://arxiv.org/abs/2602.10471

Tags:

agentic exploration bug reproduction documentation-derived intent Fail-to-Pass rate Large Language Models proactive testing regression prevention software assurance software quality assurance.

AI Proactively Finds Software Bugs before Failures in Realistic Codebases

Large language model performance on proactive defect identification in software repositories

The Bigger Picture

Rohail T.

Latest Posts by Rohail T.:

Modulated Quantum Batteries Overcome Efficiency Losses from Energy Coherence

Light’s Anisotropy Controls Heat Flow in Quantum Systems

Quantum Sensitivity Limits Now Calculable for Wider Range of Systems