AI Builds Analysed: 364 Maintainability Issues Found in Generated Build Code

Researchers are increasingly investigating the quality of code produced by artificial intelligence, but a crucial aspect has remained largely unexamined: the impact on build systems. Anwar Ghammam from University of Michigan-Dearborn and Mohamed Almukhtar from University of Michigan-Flint, along with et al., present the first large-scale empirical study of AI-generated build code quality, utilising the novel AIDev dataset of over 364 maintainability and security-related build smells. Their findings reveal that AI can both introduce quality issues , such as hardcoded paths , and surprisingly, eliminate existing problems through refactoring, with over 61% of AI-authored pull requests being readily accepted by developers. This dual impact highlights a critical need for new tools and techniques to assess and govern AI’s contribution to build systems code, ensuring robust and reliable software development in the future.

AI Build Code Quality from GitHub Pull Requests

Scientists have demonstrated a significant advancement in understanding the impact of AI coding agents on software development, specifically concerning build systems, a crucial yet previously understudied component of the software lifecycle. Researchers leveraged AIDev, the first large-scale, openly available dataset capturing agent-authored pull requests from real-world GitHub repositories, to conduct a comprehensive empirical study of AI-generated build code quality. This work addresses a critical gap in knowledge, as prior research has largely focused on AI-generated source code while neglecting the quality and maintainability of the build code these agents produce. The study meticulously investigated three key research questions: whether AI coding agents generate build code containing quality issues, commonly known as code smells; to what extent these agents can actually eliminate existing code smells from build code; and the rate at which AI-authored pull requests are accepted by developers.
Through analysis of 387 pull requests and 945 build files, the team identified 364 maintainability and security-related build smells, ranging in severity, indicating that AI-generated build code can indeed introduce quality issues such as a lack of error handling and the inclusion of hardcoded paths or URLs. However, the research also revealed a surprising capability: AI agents can, in certain instances, remove existing smells through refactoring techniques like Pull Up Module and Externalize Properties. Notably, the findings demonstrate a high level of developer acceptance, with over 61% of AI-generated pull requests being approved and merged with minimal human intervention. This suggests a growing trust in, and adoption of, AI-assisted build code generation within the software development community.

The research team identified specific refactoring actions performed by AI agents that contribute to the removal of code smells in build scripts, providing valuable insights into the agent’s ability to improve code quality. This dual impact, the introduction of some code smells alongside the removal of others, underscores the need for future research focused on AI-aware build code quality assessment. Such assessment would systematically evaluate, guide, and govern AI-generated build systems code, ensuring its reliability and maintainability. The researchers have made their dataset openly available, along with all manual annotations and code, to facilitate further investigation and replication of their findings, paving the way for more robust and trustworthy AI-driven software development practices.

AI Agent Build System Pull Request Analysis

andScientists initiated a comprehensive data mining study leveraging the AIDev dataset, a repository of approximately 933,000 agentic pull requests (PRs) originating from real-world GitHub repositories. The research focused on five AI agents, OpenAI Codex, Devin, GitHub Copilot, Cursor, and Claude Code, to investigate the quality of build code generated by these systems. To isolate build system modifications, the team meticulously filtered the AIDev dataset, initially retaining 632 PRs containing changes to Gradle, Maven, CMake, and Make build files. A subsequent manual inspection refined this selection, removing noise from hidden files and misclassified PRs, ultimately yielding a final dataset of 387 authentic build-system changes authored by AI agents.

The study pioneered the use of Sniffer, a static analyzer, to assess build code quality and identify 364 maintainability and security-related build smells across varying severity levels. Researchers extracted all changed or newly created build files from the studied PRs, amassing a total of 945 files, and then retrieved content snapshots before and after AI modifications. Each paired snapshot underwent Sniffer analysis to pinpoint the presence and severity of code smells, enabling a precise comparison to quantify the impact of AI-generated changes on build quality. This innovative approach allowed the team to determine whether changes introduced new smells, eliminated existing ones, or had no observable effect.

To delve deeper into instances where code smells were eliminated, two authors conducted a qualitative analysis via manual labeling of the AI builds. They examined commit histories and developer discussions to understand how smells were mitigated and whether quality-enhancing changes were incorporated into the PRs. Inter-rater agreement was measured using Fleiss’ Kappa, achieving a score of 0.76, indicating substantial consistency between the annotators. A consensus meeting resolved remaining disagreements, ensuring robust categorization of quality improvements. Furthermore, the team investigated the acceptance rate of all PRs, those introducing smells, eliminating them, and those with no impact, by examining commit histories and developer feedback to understand how AI-generated build code is evaluated and integrated. This rigorous methodology revealed that 824 out of 945 build files exhibited no smells either before or after agentic changes, suggesting that many AI modifications were simple updates with minimal impact on code quality. However, the study also identified 66 files where new smells were introduced and 31 files where existing smells were successfully removed, demonstrating a dual impact of AI-generated build code on software maintainability.

AI Code Impact on Build System Quality

Scientists achieved a significant breakthrough in understanding the impact of AI-generated code on software build systems, leveraging a novel dataset called AIDev. This work presents the first large-scale investigation into -authored pull requests (PRs), a total of, from real-world GitHub repositories, revealing both the potential benefits and challenges of integrating AI into the software development lifecycle. Researchers meticulously analysed these PRs to determine whether AI coding tools introduce quality issues into build code, to what extent they can eliminate existing code smells, and the rate at which these AI-driven changes are accepted by developers. Experiments revealed that, of the 945 build files modified by AI agents, 824 introduced no new code smells either before or after the changes.

A detailed manual inspection of 100 of these files indicated that the AI primarily implemented simple updates, such as version upgrades or minor structural edits, minimising the risk of introducing quality issues. However, in 66 build files, the AI-generated code did introduce new smells, while simultaneously improving 31 files by removing previously existing issues. A further 24 files remained neutral, with existing smells persisting after the AI modifications. Data shows a total of 364 maintainability and security-related build smells were identified across the 66 affected files, varying in severity.

Maintainability issues were dominant, with Wildcard Usage accounting for 97 occurrences of medium severity, potentially leading to unpredictable build outcomes. Lack of Error Handling was also prevalent, appearing 63times with medium severity, highlighting challenges in ensuring build reliability. Security concerns were also present, with Hardcoded Paths/URLs observed 25times, ranging from minor to severe depending on the context. Tests prove that Copilot introduced the most smells overall, 226 across 266 build files, resulting in the highest smell-introduction rate. In contrast, OpenAI Codex, despite modifying the largest number of files (326), introduced far fewer smells (53), suggesting distinct behavioural patterns among different AI assistants.

Notably, the research team discovered that 54 code smells were successfully removed across 31 improved build files. Approximately 65% of these improvements stemmed from intentional, quality-oriented changes, including refactoring techniques like Reformat Code, Remove Unused Code, and Externalize Properties. For example, Copilot successfully applied the Externalize Properties refactoring to a build. gradle file, replacing hard-coded URLs with dynamic values and moving credentials to project-level properties, enhancing security and maintainability. Remarkably, over 61% of the -PRs were approved and merged with minimal human intervention, demonstrating a high level of acceptance for AI-generated build code.

👉 More information
🗞 AI builds, We Analyze: An Empirical Study of AI-Generated Build Code Quality
🧠 ArXiv: https://arxiv.org/abs/2601.16839

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

Ui Remix Achieves Faster Mobile UI Design through Interactive Example Adaptation

Ui Remix Achieves Faster Mobile UI Design through Interactive Example Adaptation

January 29, 2026
Duwatbench Advances Multimodal Understanding with 1,272 Sample Arabic Calligraphy Benchmark

Duwatbench Advances Multimodal Understanding with 1,272 Sample Arabic Calligraphy Benchmark

January 29, 2026
Haste Achieves 64% Enhanced LLM Defence Against Evasive Attack Techniques

Haste Achieves 64% Enhanced LLM Defence Against Evasive Attack Techniques

January 29, 2026