Review of CodeLLMs and Agents Benchmarks in Software Engineering

Code large language models (CodeLLMs) and agents excel in software engineering tasks. A comprehensive review of 181 benchmarks from 461 papers across the software development life cycle (SDLC) reveals an imbalance: 60% focus on development, while requirements engineering (5%) and design (3%) are underrepresented. Python dominates as the primary language. The paper identifies current challenges and proposes future directions to bridge the gap between theoretical capabilities and real-world applications.

Code large language models (CodeLLMs) and agents are revolutionizing software engineering, yet there is a lack of comprehensive reviews evaluating their benchmarks. A team led by Kaixin Wang from Xi’an Jiaotong University, along with colleagues Tianlin Li, Xiao Yu Zhang, Chong Wang, Weisong Sun, Yang Liu, and Bin Shi from Nanyang Technological University, have addressed this gap in their paper titled ‘Software Development Life Cycle Perspective: A Survey of Benchmarks for CodeLLMs and Agents’. Their study reviews 181 benchmarks across 461 papers, revealing an imbalance—60% focused on software development, with minimal attention to requirements engineering (5%) and design (3%). Python was the dominant language. The authors also discuss challenges and propose future directions to enhance practical applications.

Developing methods to evaluate, detect, and enhance AI capabilities across diverse applications.

Recent advancements in computational methodologies have introduced novel approaches to address complex challenges in artificial intelligence and software engineering. Terry Yue Zhuo et al. (2024) developed Bigcodebench, a benchmark for code generation designed to handle diverse function calls and intricate instructions, thereby expanding the scope of evaluation for AI systems in coding tasks. This innovation allows researchers to assess models more comprehensively by incorporating varied scenarios that reflect real-world programming complexities.

In another significant contribution, Andy Zou et al. (2023) proposed universal and transferable adversarial attacks targeting aligned language models. By identifying vulnerabilities across different domains, this method provides insights into the robustness of AI systems, enabling developers to enhance security measures. The approach leverages adversarial examples that exploit model weaknesses, offering a practical framework for testing and improving the reliability of large language models.

👉 More information
🗞 Software Development Life Cycle Perspective: A Survey of Benchmarks for CodeLLMs and Agents
🧠 DOI: https://doi.org/10.48550/arXiv.2505.05283

Quantum News

Quantum News

There is so much happening right now in the field of technology, whether AI or the march of robots. Adrian is an expert on how technology can be transformative, especially frontier technologies. But Quantum occupies a special space. Quite literally a special space. A Hilbert space infact, haha! Here I try to provide some of the news that is considered breaking news in the Quantum Computing and Quantum tech space.

Latest Posts by Quantum News:

NASA Increases Artemis Program Missions, Aims for Annual Lunar Landings

NASA Increases Artemis Program Missions, Aims for Annual Lunar Landings

February 28, 2026
QED-C Announces Research Advances in Quantum Control Electronics

QED-C Announces Research Advances in Quantum Control Electronics

February 27, 2026
Sophus Technology to Showcase Quantum Solver Delivering Faster Optimization

Sophus Technology to Showcase Quantum Solver Delivering Faster Optimization

February 27, 2026