Anthropic has released version 3.0 of Petri, its open-source AI alignment toolbox, with significant architectural changes that separate the “auditor” and “target” models into customizable components. This allows for more adaptable and nuanced testing of large language models for potentially harmful behaviors like deception and sycophancy; Petri has been integral to Anthropic’s internal evaluations for every Claude model released since Claude Sonnet 4.5. The UK’s AI Security Institute (AISI) has also utilized Petri as a major part of their own model evaluations, demonstrating its growing impact beyond research. A new add-on called “Dish” further enhances realism by simulating genuine model deployment conditions, addressing the issue of models recognizing they are being tested.
Petri 3.0 Enables Adaptable, Realistic AI Alignment Testing
Originally launched in October as an open-source toolbox, Petri allows for rapid assessment of large language models for problematic behaviors including deception and harmful cooperation, and has now undergone substantial architectural changes designed to broaden its applicability. These modifications decouple the “auditor” and “target” models, enabling users to customize testing scenarios with greater precision and flexibility, shifting from static evaluations to a more nuanced approach. The latest iteration addresses a critical challenge in AI safety: the ability of models to recognize they are in a test environment, thereby altering their behavior and masking potential misalignments. Anthropic researchers recognized that if a model is aware it’s being evaluated, researchers cannot accurately observe its general behavior, prompting the development of “Dish,” an add-on designed to simulate genuine model deployments.
Dish enhances realism by utilizing the model’s actual system prompt and surrounding software, known as the “scaffold,” to create a more authentic testing environment. This focus on realistic simulation is crucial for uncovering subtle but potentially dangerous tendencies that might otherwise remain hidden. Expanding its capabilities, Petri 3.0 now integrates with Bloom, another open-source alignment tool from Anthropic, allowing for both broad-ranging assessments, a strength of Petri, and in-depth analysis of specific behaviors.
To ensure continued independence and credibility, Anthropic has transferred the development of Petri to Meridian Labs, an AI evaluation nonprofit, mirroring a previous donation of the Model Context Protocol to the Linux Foundation. According to Anthropic, this move will help ensure that Petri remains independent of any AI lab, so its results will be seen as neutral and credible by those across the industry and beyond, positioning Petri alongside other tools like Inspect and Scout as part of a growing open technology stack for reliable AI model evaluation.
Bloom Integration Deepens Behavioral Assessment with Open-Source Tools
AI safety evaluation currently relies on a combination of internal testing and emerging open-source tools, though a consistent, universally adopted standard remains elusive. Organizations are increasingly recognizing the need for transparent and adaptable methods to assess potentially harmful behaviors in large language models. Anthropic’s recent advancements build upon this foundation, specifically through the integration of Bloom with their established alignment testing framework, Petri. This combination allows for a tiered approach to behavioral assessment, moving beyond broad evaluations to focus on in-depth analysis of specific, targeted behaviors. While Petri excels at identifying a wide range of misalignments, Bloom enables researchers to scrutinize particular actions with greater precision, offering a more nuanced understanding of model tendencies.
This deepened integration is exemplified by the latest iteration of Petri, version 3.0, which incorporates significant architectural changes designed to enhance its adaptability. A key improvement involves separating the “auditor” and “target” models, allowing for independent customization of each component. This flexibility is crucial for tailoring tests to specific model architectures and deployment scenarios, moving beyond static evaluation frameworks. The company notes that despite researchers’ efforts to make tests appear realistic, a model can often deduce from artificialities in the setup that it’s part of a test, highlighting the importance of creating believable scenarios to elicit genuine responses. Petri now joins other Meridian Labs tools like Inspect and Scout, forming a comprehensive technology stack available to labs, researchers, and governments.
You can read more about Petri 3.0 on the Meridian Labs blog.
Anthropic
