Code large language models (CodeLLMs) and agents excel in software engineering tasks. A comprehensive review of 181 benchmarks from 461 papers across the software development life cycle (SDLC) reveals an imbalance: 60% focus on development, while requirements engineering (5%) and design (3%) are underrepresented. Python dominates as the primary language. The paper identifies current challenges and proposes future directions to bridge the gap between theoretical capabilities and real-world applications.
Code large language models (CodeLLMs) and agents are revolutionizing software engineering, yet there is a lack of comprehensive reviews evaluating their benchmarks. A team led by Kaixin Wang from Xi’an Jiaotong University, along with colleagues Tianlin Li, Xiao Yu Zhang, Chong Wang, Weisong Sun, Yang Liu, and Bin Shi from Nanyang Technological University, have addressed this gap in their paper titled ‘Software Development Life Cycle Perspective: A Survey of Benchmarks for CodeLLMs and Agents’. Their study reviews 181 benchmarks across 461 papers, revealing an imbalance—60% focused on software development, with minimal attention to requirements engineering (5%) and design (3%). Python was the dominant language. The authors also discuss challenges and propose future directions to enhance practical applications.
Developing methods to evaluate, detect, and enhance AI capabilities across diverse applications.
Recent advancements in computational methodologies have introduced novel approaches to address complex challenges in artificial intelligence and software engineering. Terry Yue Zhuo et al. (2024) developed Bigcodebench, a benchmark for code generation designed to handle diverse function calls and intricate instructions, thereby expanding the scope of evaluation for AI systems in coding tasks. This innovation allows researchers to assess models more comprehensively by incorporating varied scenarios that reflect real-world programming complexities.
In another significant contribution, Andy Zou et al. (2023) proposed universal and transferable adversarial attacks targeting aligned language models. By identifying vulnerabilities across different domains, this method provides insights into the robustness of AI systems, enabling developers to enhance security measures. The approach leverages adversarial examples that exploit model weaknesses, offering a practical framework for testing and improving the reliability of large language models.
👉 More information
🗞 Software Development Life Cycle Perspective: A Survey of Benchmarks for CodeLLMs and Agents
🧠DOI: https://doi.org/10.48550/arXiv.2505.05283
