Llm Test Generation Achieves 20.92% Coverage with Newer Large Language Models

The increasing prevalence of Large Language Models (LLMs) in software development has spurred considerable research into automated unit test generation, yet current techniques often struggle to produce consistently reliable and comprehensive tests. Michael Konstantinou, Renzo Degiovanni, and Mike Papadakis, from the University of Luxembourg and the Luxembourg Institute of Science and Technology, investigated whether recently developed test generation tools still offer a significant advantage given the rapid improvements in LLM capabilities. Their work addresses a critical question: do post-processing techniques designed to enhance test quality remain valuable when applied to more powerful LLMs, or are their benefits diminished by the improved baseline performance? The researchers replicated four state-of-the-art tools , HITS, SymPrompt, TestSpark, and CoverUp , and compared their effectiveness against a straightforward approach utilising current LLM versions across a substantial dataset of code. Results demonstrate that a plain LLM approach now surpasses previous state-of-the-art methods in key test effectiveness metrics, suggesting that the gains achieved by complex techniques may be offset by the inherent capabilities of newer models.

Utilising LLMs in isolation frequently yields tests that either fail to compile or do not achieve adequate code coverage, prompting the development of numerous techniques and tools to mitigate these shortcomings. The authors explore a hybrid approach combining LLMs with program analysis techniques to improve test generation, particularly for hard-to-cover branches, and examine different levels of granularity in test generation, including method and class-level tests. Evaluation was conducted using a benchmark (GitBug-Java) and comparisons made against state-of-the-art LLM-based test generation tools such as ChatUnitTest, TestArt, Aster, and LLMTest, alongside traditional techniques. The authors also explore using mutation testing to guide the LLM test generation process, acknowledging the issue of LLM “hallucinations” and employing consensus-building with multiple LLM agents to mitigate this.

The research touches upon the critical challenge of test oracle generation, investigating LLMs’ ability to generate accurate expected outputs. Findings demonstrate that hybrid approaches consistently outperform pure LLM approaches in terms of branch coverage and mutation score, highlighting mutation score as a reliable metric for evaluating test effectiveness. The paper provides a comprehensive overview of existing research in LLM-based test generation, covering tools like ChatUnitTest, TestArt, Aster, PIT, TestSpark, PRIMG, and Casmodatest. Experiments conducted on 393 Java classes and 3,657 methods revealed that the LLM approach delivered 49.95% line coverage, 35.33% branch coverage, and a 33.82% mutation score, demonstrably outperforming the compared tools. The research team measured test effectiveness using standard metrics, alongside the practical cost of LLM queries, and found that the LLM approach achieved improved coverage scores with a comparable number of queries to the more complex tools.

Further investigation revealed that applying the LLM at the class level significantly reduces costs, requiring only 1,562 LLM requests, although initial coverage was lower. Combining test suites generated at both class and method levels yielded a total coverage of 53.67% line coverage, 38.74% branch coverage, and 36.55% mutation score. The team refined this approach with a hybrid strategy of first generating class-level tests, then focusing on uncovered methods, achieving 52.30% line coverage, 38.84% branch coverage, and 36.76% mutation score while reducing LLM requests by approximately 20%. Measurements confirm that nearly 20% of generated tests failed to compile, and almost 43% failed execution, highlighting a remaining challenge for future development. This delivers a compelling demonstration of the power of current LLMs to generate effective test suites, potentially simplifying software testing processes.

Plain LLM Surpasses Existing Test Generation Tools

This research investigated the efficacy of current Large Language Model (LLM)-based automated unit test generation techniques, utilising more recent LLM versions than those employed in initial evaluations. Findings demonstrate that this simple prompting approach can surpass the performance of existing tools in key test effectiveness metrics, including line coverage, branch coverage, and mutation score, while maintaining comparable efficiency in terms of LLM queries. Further investigation revealed that the granularity of prompting significantly impacts test generation, with method-level prompting leading to more extensive test suites.

Combining class-level and method-level prompting proved complementary, resulting in a hybrid approach that enhances cost-effectiveness. The authors acknowledge a potential threat to validity stemming from data leakage and mitigated this through careful dataset selection and repository age analysis. Future work could explore strategies to address syntactical errors in generated tests, as a substantial proportion failed to compile successfully.

👉 More information
🗞 How well LLM-based test generation techniques perform with newer LLM versions?
🧠 ArXiv: https://arxiv.org/abs/2601.09695

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

Machine Learning Achieves Accurate Prediction of Hubble ACS/SBC Background Variation Using 23 Years of Data

Machine Learning Achieves Accurate Prediction of Hubble ACS/SBC Background Variation Using 23 Years of Data

January 21, 2026
AI Job Anxiety Confirmed in 25 Computer Science Students, Driving Adaptive Strategies

AI Job Anxiety Confirmed in 25 Computer Science Students, Driving Adaptive Strategies

January 20, 2026
Adaptive Runtime Achieves 96.5% Optimal Performance Mitigating GIL Bottlenecks in Edge AI

Adaptive Runtime Achieves 96.5% Optimal Performance Mitigating GIL Bottlenecks in Edge AI

January 20, 2026