Researchers are increasingly concerned by the potential for sensitive data leakage, fraud and cybersecurity threats posed by rapidly advancing autonomous agents. Ee Wei Seah, Yongsen Zheng, and Naga Nikshith from Singapore AISI, alongside colleagues including Mahran Morsidi, Gabriel Waikin Loh Matienzo, and Nigel Gay from UK AISI, have spearheaded a crucial investigation into improving methodologies for evaluating these systems across diverse domains. This international collaboration, building on previous work from The International Network for Advanced Measurement, Evaluation and Science, represents a significant step towards establishing robust testing practices , vital as agents become globally deployed and interact with varied languages and cultures. Rather than focusing solely on agent performance, this study prioritises identifying methodological challenges in agentic evaluation itself, paving the way for more secure and reliable autonomous systems.
Scientists investigated agentic testing for common risks, specifically leakage of sensitive information. The study pioneered a dual-strand approach, simultaneously investigating common risks like sensitive information leakage and fraud, alongside critical cybersecurity vulnerabilities. Singapore AISI led the common risks strand, while UK AISI spearheaded the cybersecurity evaluations, ensuring comprehensive coverage of potential agentic failures. Researchers evaluated a diverse range of both open and closed-weight models against established public agentic benchmarks, employing a methodology prioritising the identification of methodological challenges over absolute performance metrics.
The team meticulously designed tasks to probe agent behaviour, focusing on how these systems handle varied languages and cultural contexts, including Farsi, Chinese, and Telugu. This linguistic diversity was crucial for assessing the robustness of agents deployed in a globalised world. Crucially, given the early stage of agentic testing, the primary objective was to identify methodological issues rather than comprehensively assess model capabilities or test results. Metrics centred on pass rates for tested Large Language Models (LLMs) and discrepancy rates between LLM judges and human annotation, providing a nuanced understanding of evaluation challenges. Experiments revealed that two agents were presented with tasks covering the two risk categories across eight languages, each equipped with a selection of tools to aid completion.
The agent’s complete trajectory, encompassing the initial prompt, tool calls with reasoning and inputs/outputs, and the final output, was meticulously evaluated. Responses were classified as either pass or fail, based on specific scenarios; for malicious tasks, failure was defined by either complete or partial execution of harmful activity. Evaluations were then conducted by two judge-LLMs and corroborated by human annotators, allowing for a comparative analysis of automated and human judgement. Tests proved the importance of understanding methodological issues in agentic testing, with the goal of informing future evaluation efforts and advancing the science of AI agent evaluations. Data shows that the collaborative effort marks a significant step forward in establishing common best practices for testing AI agents globally, fostering a more secure and reliable deployment of autonomous systems. This work builds upon two previous exercises, furthering the development of a nascent field crucial for responsible AI deployment. The exercise concentrated on two key risk areas: potential leakage of sensitive information and susceptibility to fraud, alongside cybersecurity vulnerabilities.
Participants evaluated both open and closed-weight models using established public benchmarks, prioritising an understanding of testing methodologies over definitive performance comparisons. Researchers acknowledge the early stage of agentic testing, highlighting the need for continued refinement of evaluation techniques as agent capabilities rapidly advance. This collaborative effort represents a significant step towards establishing a robust science of agentic evaluations. By focusing on methodological challenges, the study lays groundwork for more reliable and secure deployment of autonomous systems globally. Future work will likely centre on expanding the scope of testing to encompass a wider range of languages, cultures, and real-world scenarios, ensuring agents operate accurately and safely across diverse contexts.
👉 More information
🗞 Improving Methodologies for Agentic Evaluations Across Domains: Leakage of Sensitive Information, Fraud and Cybersecurity Threats
🧠 ArXiv: https://arxiv.org/abs/2601.15679
