As artificial intelligence increasingly assists with software development, current evaluation methods struggle to keep pace with the evolving role of these systems. Tao Dong, Harini Sampath, Ja Young Lee, and colleagues at Google LLC address this challenge by shifting the focus from simply assessing code accuracy to understanding how AI agents behave in collaborative settings. Their work establishes a foundational taxonomy of desirable agent behaviours, identifying four key expectations, adherence to standards, code quality, effective problem-solving, and user collaboration, derived from detailed analysis of real-world software development rules. Crucially, the team also introduces the Context-Adaptive Behavior (CAB) Framework, which demonstrates how expectations for AI agent behaviour change depending on the specific task and timeframe, ultimately offering a more human-centered approach to designing and evaluating the next generation of collaborative AI tools.
AI Evaluation Beyond Functional Correctness
This is a comprehensive collection of research papers focused on evaluating AI, particularly Large Language Models (LLMs), in software engineering. The field is shifting from simply verifying if AI-generated code works to assessing broader dimensions like readability, maintainability, security, and collaboration potential, necessitating multi-dimensional benchmarks and real-world relevance. Evaluating AI agents presents challenges, including a lack of standardized metrics for subjective qualities like readability and the “last mile problem” where LLMs generate nearly-functional code requiring significant human refinement. Assessing agents engaging in iterative dialogue, and addressing concerns about responsible AI development, particularly regarding biases and potential harms, also require new evaluation frameworks.
Researchers are actively testing LLMs on tasks like bug reproduction and fixing, assessing code readability, and exploring how AI agents can autonomously complete complex software engineering tasks. Understanding how developers interact with AI tools, studying how developers learn from them in real-time, and even using LLMs as testers are key areas of investigation. Expert evaluation, adversarial benchmarking, multi-agent dialogue frameworks, and real-world data from platforms like GitHub are all employed. Emerging trends include a focus on agentic AI capable of proactive task planning, seamless integration with existing tools, and an emphasis on explainability and long-term impact assessment. The research highlights a growing recognition that evaluating AI in software engineering demands a holistic approach beyond functional correctness.
Human-Centered Taxonomy of AI Agent Behaviors
Researchers pioneered a new framework for evaluating AI agents in software engineering, moving beyond code correctness to assess collaborative behaviors. Analyzing 91 sets of user-defined agent rules, the team extracted key expectations for effective performance, resulting in a taxonomy defining four crucial behaviors: adherence to standards, ensuring code quality, effective problem-solving, and user collaboration. This provides a human-centered lens for evaluation, focusing on teamwork dynamics. Recognizing that expectations for agent behavior are not fixed, the researchers developed the Context-Adaptive Behavior (CAB) Framework.
This approach understands how expectations change based on the situation, considering both the “Time Horizon”, ranging from immediate needs to long-term goals, and the “Type of Work”, differentiating between enterprise production and rapid prototyping. These axes were empirically derived through expert interviews and analysis of a prototyping agent. This innovative framework enables nuanced evaluation of AI agents in diverse contexts, acknowledging that a successful partner adapts to the task and collaborator’s preferences. The work moves beyond identifying errors to framing evaluation around human-centric behaviors and establishing a systematic understanding of a “good” collaborative process.
AI Agent Behaviors for Software Development
Scientists established a foundational taxonomy of desirable behaviors for AI agents collaborating on enterprise software development, derived from analysis of 91 sets of user-defined agent rules. The four key expectations identified are adhering to standards, ensuring code quality, solving problems effectively, and collaborating with the user, providing a human-centered framework for evaluation. An LLM-based classification system accurately validated this taxonomy, achieving an F1-score of 83% (Precision: 81%, Recall: 85%). Experiments revealed significant similarities in behavioral expectations between enterprise software development and rapid prototyping, despite differences in expression.
Users consistently expect agents to follow best practices and engage in collaborative planning. They also expect agents to solve problems using contextual knowledge, reviewing conversation history or project documentation, and to proactively validate and learn from feedback. However, the research highlights distinct expectations specific to rapid prototyping, revealing a greater emphasis on expert roles and UI/UX quality. Users frequently prompted the prototyping agent to assume the persona of an expert, and focused on visual design, requesting a “modern and minimalist” aesthetic. They also requested explanations of the agent’s plans and decisions, indicating varying technical expertise and a desire for pedagogical insight.
Context-Adaptive AI Behaviors for Software Teams
This research addresses a significant gap in evaluating AI agents designed to collaborate on software engineering tasks. The team established a foundational taxonomy of desirable agent behaviors, identifying four key expectations: adherence to standards, ensuring code quality, effective problem-solving, and user collaboration. This provides a clear framework for understanding successful human-AI partnership. Building upon this, the researchers introduced the Context-Adaptive Behavior (CAB) Framework, which demonstrates how expectations for agent behavior change depending on the specific work and the time horizon of the project. The framework considers factors ranging from immediate production needs to long-term goals, and from routine tasks to rapid prototyping, revealing a nuanced understanding of human-AI interaction. While demonstrated in enterprise software engineering, further research is needed to explore its use in other areas like embedded systems or data science, to strengthen its comprehensive applicability.
👉 More information
🗞 From Correctness to Collaboration: Toward a Human-Centered Framework for Evaluating AI Agent Behavior in Software Engineering
🧠 ArXiv: https://arxiv.org/abs/2512.23844
