Researchers are tackling the persistent difficulty of automatically generating unit tests for C programs, a challenge complicated by the disconnect between intended program behaviour and the intricacies of pointer arithmetic and memory management. Jaid Monwar Chowdhury from Bangladesh University of Engineering and Technology, Chi-An Fu from National Taiwan University, and Reyhaneh Jabbarvand from University of Illinois at Urbana-Champaign, in collaboration, present SPARC, a novel neuro-symbolic framework designed to address this issue. Their work overcomes the limitations of direct code synthesis by large language models, which often produce invalid or ineffective tests, through a four-stage process incorporating control flow analysis, a grounded operation map, targeted test synthesis, and iterative validation. Evaluation on a diverse set of 59 subjects demonstrates SPARC significantly improves test coverage and mutation scores, even matching or surpassing the performance of symbolic execution tools like KLEE, whilst also generating more readable and maintainable code, offering a scalable solution for testing established C codebases.

Within the silicon heart of countless devices, software errors lurk undetected. Current methods for finding these faults struggle with complex, established codebases. This new framework offers a way to automatically build tests for such systems, improving reliability and reducing the risk of hidden flaws. Scientists continue to confront substantial difficulties in automating unit test generation for C programs, stemming from the considerable disparity between intended program behaviour and the strict rules governing pointer manipulation and memory allocation.

While Large Language Models (LLMs) demonstrate considerable promise in generating code, a common failing, dubbed the ‘leap-to-code’ problem, sees these models prematurely produce code lacking grounding in the program’s underlying structure, constraints, and meaning. This often results in tests that cannot be compiled, incorrectly defined function signatures, limited coverage of program branches, and assertions that fail to accurately identify errors.

Researchers have developed SPARC, a neuro-symbolic, scenario-based framework designed to overcome this challenge through a four-stage process. Initially, SPARC performs Control Flow Graph (CFG) analysis to map the program’s execution paths. This is followed by the creation of an Operation Map, which anchors LLM reasoning within validated utility functions.

Path-targeted test synthesis then generates tests focused on specific execution paths, and finally, an iterative validation loop, utilising compiler and runtime feedback, refines the tests. Evaluation across 59 real-world and algorithmic subjects reveals SPARC’s effectiveness. Compared to simple prompt generation, SPARC achieves improvements in key metrics, exceeding baseline performance by 31.36% for line coverage, 26.01% for branch coverage, and 20.78% for mutation score.

In particular, SPARC’s performance matches or surpasses that of KLEE, a well-established symbolic execution tool, particularly on complex projects. Beyond simply generating more tests, SPARC exhibits a high rate of test retention, 94.3%, through its self-correction mechanisms. Yet, the benefits extend beyond purely quantitative measures. Developer assessments indicate that SPARC-generated code is considerably more readable and maintainable, suggesting a practical advantage for software engineers.

By aligning the reasoning of LLMs with the inherent structure of a program, SPARC offers a scalable solution for testing established C codebases, addressing a long-standing need for automated, reliable testing tools. Since manual test creation is time-consuming and prone to human error, a system like SPARC could markedly reduce development costs and improve software quality.

At the heart of SPARC lies a decomposition of the test generation problem into two distinct phases. First, static analysis is employed to derive high-level testing scenarios, and second, these scenarios serve as a blueprint for context-aware test synthesis. Unlike approaches that treat test generation as a single completion task, SPARC’s scenario-based method ensures the creation of tests with semantic meaning, featuring precise inputs and assertions that enhance both coverage and mutation scores.

For instance, the framework addresses issues like undefined path coverage, where tests focus only on typical scenarios, and ungrounded dependencies, where tests reference missing utilities. Also, SPARC tackles the problem of traceability, producing tests that are not merely ‘black boxes’ but provide clear diagnostic information. Instead of generic test names and superficial assertions, SPARC generates tests that explain the logic path taken, making them valuable as program comprehension tools and executable documentation. Beyond its immediate performance gains, analysis reveals that SPARC’s architecture allows for the use of cost-effective LLMs without sacrificing test quality, suggesting a pathway towards wider adoption in industrial settings.

SPARC markedly improves code coverage and fault detection across diverse C projects

Across 59 real-world and algorithmic C projects, SPARC achieved 31.36% greater line coverage than a baseline relying on simple prompt generation. Branch coverage also improved by 26.01% using SPARC, indicating a more thorough exploration of conditional logic within the tested code. Mutation score, a measure of a test suite’s ability to detect injected faults, rose by 20.78% with SPARC, demonstrating enhanced fault-finding capabilities.

These gains suggest SPARC generates tests that not only execute more lines of code but also more effectively identify potential errors. Performance comparisons extend beyond the baseline; on complex subjects, SPARC matched or exceeded the performance of KLEE, a symbolic execution tool. Specifically, SPARC demonstrated comparable coverage and fault detection rates, despite KLEE’s established position in the field of automated testing.

This parity is notable given KLEE’s reliance on compiling code before analysis, a prerequisite SPARC bypasses through its neuro-symbolic approach. The framework’s ability to refine generated tests is also apparent in its retention rate. Iterative repair processes allowed SPARC to successfully maintain 94.3% of initially generated tests, addressing compilation errors or logical flaws.

A developer study involving ten participants rated SPARC-generated code as more readable and maintainable. These subjective assessments, alongside objective metrics, highlight the practical benefits of SPARC’s structured test generation. Analysis of model scalability revealed that cost-efficient language models achieved performance levels matching those of larger, more computationally expensive models.

This finding suggests the architecture of the SPARC pipeline, rather than sheer model size, is the primary driver of test quality. By aligning LLM reasoning with program structure, the work provides a scalable path for testing legacy C codebases and achieving 100% code coverage in some instances.

Control flow and operational mapping for path-specific test generation

A Control Flow Graph (CFG) analysis initiates the SPARC framework, dissecting the C code to map all possible execution paths. This static analysis technique identifies the branching logic and potential routes through the function, forming the basis for targeted test generation. By representing the code as a graph, SPARC gains a structural understanding, moving beyond simple textual analysis.

Then, an Operation Map is constructed, grounding LLM reasoning in validated utility helpers. This map catalogues available functions and their intended behaviours, preventing the LLM from ‘hallucinating’ non-existent code or dependencies. Once the CFG and Operation Map are established, path-targeted test synthesis begins. Rather than attempting to generate complete tests directly, SPARC focuses on creating individual tests for each path identified in the CFG.

This decomposition allows for precise control over input values and assertion logic, ensuring each test exercises a specific code segment. The LLM receives a detailed scenario describing the path, including input constraints and expected outcomes, derived from the CFG and Operation Map. Yet, the process does not end with initial test generation. SPARC incorporates an iterative, self-correction validation loop, utilising compiler and runtime feedback to refine the tests.

Inside this framework, the LLM is guided to produce tests that are not merely compilable, but also semantically meaningful and traceable. For decades, vulnerabilities in C code have been a major source of security breaches, and thorough testing is a key defence.

Mapping program logic enhances automated C code testing with large language models

Scientists have long struggled to automate the testing of C code, a language still underpinning much of our digital infrastructure. Unlike more modern languages, C’s reliance on manual memory management and pointer arithmetic creates a significant hurdle for automated tools. Existing approaches often fail to produce tests that actually work, generating code riddled with errors or simply irrelevant to the program’s function.

This new work presents a system, SPARC, that demonstrably improves upon previous attempts at using large language models for this task. Instead of directly asking an LLM to write tests, SPARC first analyses the code’s structure and creates a ‘map’ of valid operations, grounding the LLM’s reasoning in the program’s logic. Achieving higher test coverage is not merely a technical win, but a step towards securing and maintaining critical systems.

The sheer volume of legacy C codebases means manual testing is impractical, and automated tools have historically fallen short. SPARC’s ability to generate more effective tests, and then iteratively repair those that fail, offers a path towards scalable industrial application. The reliance on LLMs introduces its own set of uncertainties, as these models can still produce unexpected or illogical outputs.

Once considered a distant prospect, the combination of neuro-symbolic techniques, blending the power of LLMs with traditional program analysis, appears to be bearing fruit. By validating LLM suggestions against the program’s structure, SPARC avoids many of the pitfalls that plague simpler approaches. The system’s performance on extremely complex or unusual code remains an open question.

At present, the evaluation focuses on line and branch coverage, alongside mutation scores, but real-world impact will depend on its ability to uncover subtle, security-critical bugs. Researchers will likely explore different ways to represent program structure and refine the ‘operation maps’ used to guide LLM reasoning. Also, the integration of more sophisticated feedback mechanisms, perhaps incorporating runtime analysis tools, could further improve test quality and repair rates. Rather than replacing human testers entirely, these tools promise to augment their capabilities, allowing them to focus on the most challenging aspects of software verification.

👉 More information
🗞 SPARC: Scenario Planning and Reasoning for Automated C Unit Test Generation
🧠 ArXiv: https://arxiv.org/abs/2602.16671

Tags:

branch coverage C programming compiler feedback Control Flow Graph Large Language Models line coverage mutation score neuro-symbolic testing path-targeted test synthesis runtime feedback.

Framework Improves Code Testing with Scenario Planning

SPARC markedly improves code coverage and fault detection across diverse C projects

Control flow and operational mapping for path-specific test generation

Mapping program logic enhances automated C code testing with large language models

Rohail T.

Latest Posts by Rohail T.:

Researchers Evaluate AI Reasoning with 786 Real-World Videos

Lasers Cool Atoms to below 100 nanoKelvin in Space

Robotic Hands Gain Adaptable Designs for Varied Tasks