Chaco Achieves 30% More Test Coverage with LLM-Based Pull Request Augmentation

Software projects continually evolve through pull requests, yet ensuring comprehensive testing of these changes remains a significant challenge. Researchers Zitong Zhou (UCLA), Matteo Paltenghi (University of Stuttgart), Miryung Kim (UCLA), and Michael Pradel (CISPA Helmholtz Center for Information Security) have identified a “last-mile” regression testing gap , lines of code modified in pull requests that often go untested despite existing test suites. Their new technique, Change And Cover (ChaCo), directly addresses this issue by leveraging large language models to augment tests specifically for uncovered code within each pull request. This targeted approach not only improves patch coverage, achieving full coverage for 30% of tested PRs at a low cost, but also generates tests that developers find relevant, well-integrated, and worthy of inclusion, as evidenced by positive human reviews and the merging of 8 out of 12 submitted tests , even uncovering and fixing previously unknown bugs.

The team achieved significant improvements in patch coverage by leveraging large language models (LLMs) to augment existing test suites, focusing specifically on lines of code modified within each PR that remained untested. ChaCo doesn’t aim for broad coverage increases, but rather concentrates on ensuring every line of changed code is rigorously tested before integration, thereby enhancing software quality and reducing the risk of regressions.

The study unveils a three-pronged approach to tackle the challenges of LLM-based test generation. Researchers identified that providing relevant test context is crucial for producing useful tests, and developed two techniques to extract this context from existing test functions, fixtures, and data generators. This ensures the LLM isn’t operating in a vacuum, but instead builds upon the established testing infrastructure of the project. Furthermore, the team carefully integrates the generated tests into the existing suite, matching structure and style to maintain consistency and readability, and provides a summary for developer review, promoting acceptance and collaboration.
Experiments show that ChaCo successfully achieved full patch coverage for 30% of the 145 pull requests examined from SciPy, Qiskit, and Pandas, at a cost of only $0.11 per PR, demonstrating both effectiveness and practicality. Human reviewers consistently rated the generated tests highly, awarding scores of 4.53/5.0 for worth adding, 4.2/5.0 for integration quality, and 4.7/5.0 for relevance to the pull request. Ablation studies confirmed the importance of test context, revealing a 2x increase in coverage when context-aware test generation was employed. The research establishes a tangible impact on real-world projects, with 12 submitted tests leading to the merging of 8 tests and the discovery and subsequent fixing of two previously unknown bugs. Scientists envision ChaCo being seamlessly integrated into continuous integration (CI) workflows, automating the final stage of regression test augmentation and ensuring a more robust and reliable software development process0.11. Researchers partitioned executable lines within a PR into covered (C) and uncovered (U) sets, defining patch coverage as |C|/|E|, where E represents all executable lines in the PR. If a PR achieved 100% patch coverage, it was considered fully covered, and ChaCo terminated; otherwise, the system proceeded to generate tests targeting lines within U.

The team engineered a two-stage process: PR-based test generation and test integration, with the goal of creating relevant tests that reuse existing project utilities like fixtures, markers, and data generators.To address the crucial challenge of providing suitable test context for the LLM, scientists presented two techniques to extract relevant content from existing test functions and data generators, enabling context-aware test generation. Experiments demonstrated that this test context led to a 2x increase in coverage, highlighting its importance.The approach leverages the PR title, description, and discussion comments to align generated tests with the PR’s intent, ensuring tests are relevant to the changes introduced.

Scientists harnessed an LLM to generate tests, but recognised that simply increasing patch coverage wasn’t enough for acceptance. Therefore, ChaCo carefully integrates augmented tests into the existing test suite by matching test structure and style, increasing the likelihood of maintainer acceptance. The system identifies the appropriate test file (ft), class (ct), and method (mt) for test placement, ensuring logical integration within the codebase. This integration step also prioritises reusing existing test utilities, further enhancing consistency and maintainability. The study pioneered a rigorous evaluation on real-world PRs, assessing test acceptance and gathering feedback from developers.

Human reviewers rated the generated tests highly, awarding scores of 4.53/5.0 for worth adding, 4.2/5.0 for integration quality, and 4.7/5.0 for relevance to the PR. This breakthrough delivers targeted test augmentation, focusing specifically on uncovered lines within PRs, unlike existing methods that aim for overall coverage improvement. Experiments revealed an average cost of only $0.11 per PR for this enhanced testing, highlighting its practicality for integration into existing workflows.

The team measured patch coverage, the fraction of modified lines covered by existing tests, and found ChaCo substantially increased this metric in real-world scenarios. Data shows that ChaCo’s approach leverages PR-specific context, offering developers augmented tests precisely when they are reviewing code changes. Researchers identified providing suitable test context as crucial for LLM-based test generation, and implemented two techniques to extract relevant content, including existing test functions, fixtures, and data generators. Tests prove that this context-aware approach leads to a 2x increase in coverage compared to variants without such features.

Scientists recorded overwhelmingly positive feedback from human reviewers, who rated the generated tests as worth adding with a score of 4.53 out of 5.0. Further assessments confirmed the tests were well integrated into existing suites, achieving a score of 4.2 out of 5.0, and highly relevant to the PR changes, scoring 4.7 out of 5.0. A contribution study saw 12 of ChaCo’s generated tests submitted to open-source projects, with 8 already merged and 4 currently under review, demonstrating real-world utility. Notably, ChaCo’s added tests exposed two previously unknown bugs in SciPy, both of which were confirmed and subsequently fixed, underlining the system’s ability to enhance software reliability. Measurements confirm that, compared to approaches lacking test context or runtime feedback, ChaCo achieves a 2x and 5.6x higher total coverage increment, respectively. ChaCo distinguishes itself by considering PR-specific patch coverage, providing developers with augmented tests precisely when they are reviewing code changes. Researchers addressed the challenge of providing sufficient test context for effective LLM-based test generation by extracting relevant information from existing test functions, fixtures, and data generators.

Furthermore, ChaCo prioritises developer acceptance by integrating new tests seamlessly into existing suites, matching structure and style, and providing a summary of additions for review. Evaluation across 145 PRs from SciPy, Qiskit, and Pandas demonstrated that ChaCo achieved full patch coverage in 30% of cases, identified previously unknown bugs, and received positive feedback from developers regarding test relevance and integration. The findings establish a practical and cost-effective method for automating fine-grained test augmentation within continuous integration workflows, with a per-PR cost of just $0.11. While the authors acknowledge that manual edits are sometimes necessary to align assertion styles or relocate tests, ChaCo effectively addresses these issues through its use of test context. Future research directions include further integration into CI workflows and exploring the potential for broader application across diverse software projects. The team also noted that the approach is limited by the need for existing test suites to provide sufficient context for the LLM to build upon.

👉 More information
🗞 Change And Cover: Last-Mile, Pull Request-Based Regression Test Augmentation
🧠 ArXiv: https://arxiv.org/abs/2601.10942

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

Geup Corrections Extend Supermassive Black Hole Lifetimes, Hawking Temperature Scales

Geup Corrections Extend Supermassive Black Hole Lifetimes, Hawking Temperature Scales

January 23, 2026
A study shows that Deep Research Agents regress on 27% of revisions.

Deep Research Agents Regress on 27% of Revisions, Study Demonstrates

January 23, 2026
Lightonocr-2-1b Achieves State-Of-The-Art OCR with a 1 Billion Parameter Model

Lightonocr-2-1b Achieves State-Of-The-Art OCR with a 1 Billion Parameter Model

January 23, 2026