Researchers are increasingly observing artificial intelligence coding agents submitting pull requests to open-source projects, transitioning from simple assistance to autonomous contribution. Ramtin Ehsani, Sakshi Pathak, and Shriya Rawal from Drexel University, alongside Abdullah Al Mujahid, Mia Mohammad Imran from Missouri University of Science and Technology, and Preetha Chatterjee, present a large-scale empirical study of over 33,000 AI-authored pull requests on GitHub to understand why so many fail to be merged. This work is significant because it identifies critical socio-technical and human-collaboration factors impacting AI’s success in software development, moving beyond simply measuring code changes and CI results to reveal nuanced reasons for rejection , including a lack of reviewer engagement and misalignment with project goals , and paving the way for more effective AI-human workflows.
AI Pull Requests, Acceptance and Rejection Rates
These agents are no longer simply assistants but are actively contributing code changes, responding to feedback, and participating in the software lifecycle as independent entities. As the volume of agentic contributions rapidly increases, a critical gap in understanding remains: how do these contributions behave in practice, and why are so many ultimately rejected? The study establishes that agentic PR failures are often rooted in socio-technical and human-AI collaboration challenges. These findings are crucial for improving the success of future agentic workflows and unlocking the full potential of AI in collaborative software development.
This work opens avenues for designing more effective agents that seamlessly integrate into human-led development processes, ultimately accelerating innovation and improving software quality. Examining merge outcomes across 11 task categories, including feature, other, the team assessed whether certain types of agent-generated contributions were more likely to be accepted. Analysis of code changes involved measuring both the total number of added and removed lines of code (#LOC Changes) and the number of files modified, providing insights into the scope and complexity of agentic contributions. Furthermore, the study meticulously tracked CI build results to determine the extent to which agent-authored PRs adhered to project quality standards and automated testing requirements. The investigation of review dynamics focused on understanding how human reviewers interacted with agent-authored PRs, including the number of revisions requested, the time taken to respond, and the overall level of engagement. Conversely, contributions targeting performance improvements or bug fixes exhibited the lowest acceptance rates, suggesting that these tasks demand a deeper understanding of complex codebases and nuanced problem-solving skills.,.
Pull Request Analysis of Coding Agent Contributions
The research team meticulously collected data from the AIDev-pop dataset to characterise agent contributions and identify factors influencing PR acceptance rates. This involved extracting data on PR labels indicating task type, calculating the size of code changes measured in lines of code added and deleted, and recording CI build statuses, pass or fail, for each PR. Experiments employed detailed analysis of review interactions, including the number of reviewers, comments posted, and revision rounds requested before a PR was either merged or closed. The study pioneered a method for categorising PRs into task types, documentation, CI, build updates, performance improvements, and bug fixes, to assess the correlation between task type and merge success.
This analysis involved manually reviewing PR descriptions, commit messages, and review comments to identify recurring themes and reasons for non-merging. This qualitative work complements the quantitative findings by providing nuanced insights into the socio-technical factors influencing PR acceptance. The approach enables a deeper understanding of how agents integrate into real-world development workflows, revealing that failures often stem from issues related to CI/CD integration, developer expectations, and project coordination. The system delivers a comprehensive assessment of agent performance, paving the way for more effective integration of AI coding agents into open-source development.
Agent PR success linked to task type
Data shows a clear correlation between task type and PR acceptance, suggesting agents excel in certain areas of software maintenance. The team measured that PRs with extensive code modifications were less likely to be accepted, highlighting the importance of incremental contributions. Results demonstrate that adherence to existing CI/CD processes is crucial for agentic PR success, as failures in this area often led to rejection. Further analysis revealed that the average number of files touched in rejected PRs was 2.3times higher than in merged PRs, indicating a substantial difference in scope. This analysis complemented the quantitative findings by uncovering reasons not captured by numerical metrics, such as a lack of meaningful reviewer engagement.
Scientists recorded that reviewer abandonment, where PRs received minimal human interaction before closure, was a prevalent rejection pattern. Data shows that duplicate PRs, unwanted feature implementations, and agent misalignment also contributed significantly to rejection rates. The breakthrough delivers insights into socio-technical and human-AI collaboration factors critical for improving future agentic workflows. Measurements confirm that successful integration of AI coding agents requires careful consideration of both technical aspects, like CI/CD compliance, and social aspects, like reviewer engagement. Tests prove that aligning agent contributions with developer expectations and project needs is paramount for achieving higher merge rates and fostering effective collaboration. This work establishes a foundation for developing more robust and collaborative AI-powered software development tools.
AI PR Success, Rejection and Task Type
The investigation quantitatively characterised merged and unmerged PRs, considering task type, code alterations, CI build outcomes, and review interactions. Further qualitative analysis of 600 PRs revealed a taxonomy of rejection patterns, complementing the quantitative data. These patterns included a lack of reviewer engagement, duplicate submissions, unwanted feature implementations, and misalignment with project goals. The findings demonstrate that unmerged PRs frequently involve larger, more extensive code changes and a higher incidence of CI/test failures. Rejections often stem from issues like reviewer abandonment, duplicate submissions, or the implementation of features not aligned with project needs.
The authors acknowledge limitations related to the specific repositories and agents studied, suggesting that generalisability may require further investigation. Future work should focus on enhancing agents’ ability to identify existing work, adhere to project norms, decompose tasks into smaller changes, and validate submissions against CI pipelines before submission. By characterising these failure patterns, this research provides a foundation for designing more context-aware and collaborative AI coding agents, ultimately informing the integration of such agents into real-world software development workflows.
👉 More information
🗞 Where Do AI Coding Agents Fail? An Empirical Study of Failed Agentic Pull Requests in GitHub
🧠 ArXiv: https://arxiv.org/abs/2601.15195
