Researchers are increasingly focused on the responsible development of large language models, and a new study rigorously assesses the safety of Amazon’s Nova 2.0 Lite under the company’s Frontier Model Safety Framework (FMSF). Satyapriya Krishna, Matteo Memelli, and Tong Wang, all from Amazon, alongside colleagues including Abhinav Mohanty, Claire O’Brien Rajkumar, and Payal Motwani, present a comprehensive evaluation of this powerful model , capable of processing up to one million tokens of text, images, and video , across three high-risk domains: Chemical, Biological, Radiological and Nuclear (CBRN) threats, offensive cyber operations, and automated research and development. This work is significant because it details a robust methodology combining automated testing, expert analysis, and ‘uplift’ studies to determine if Nova 2.0 Lite meets stringent safety thresholds, contributing vital insights to the ongoing effort of aligning frontier model capabilities with responsible AI principles.
Nova 2.0 Lite, CBRN Risk Profile Evaluation
Scientists at Amazon have unveiled a comprehensive safety evaluation of Nova 2.0 Lite, a powerful new multimodal foundation model capable of processing text, images, and video with an impressive context length of up to 1 million tokens. This breakthrough enables the model to analyse extensive codebases, lengthy documents, and complex videos within a single prompt, representing a significant leap forward in artificial intelligence capabilities0.0 Lite demonstrates measurable capability gains over its predecessor, Nova 1.0, across all three high-risk domains, yet remains within established safety parameters. Researchers integrated reproducible automated benchmarks to quantify the model’s knowledge, alongside human-centric risk evaluations like expert red-teaming and uplift studies to identify potential real-world failure modes. Independent auditors from Nemesys Insights and Model Evaluation and Threat Research (METR) verified the internal findings, confirming the robustness of the evaluation process. The team achieved higher factual accuracy on both CBRN and cybersecurity knowledge tests, while maintaining safety through policy-tuned refusal behaviour, dynamic content filters, and continuous safeguard monitoring.
Experiments show Nova 2.0 Lite excels in complex procedural understanding, scoring 0.71 on the WMDP-CHEM benchmark and 0.82 on WMDP-BIO, surpassing Nova 1.0 Pro’s performance and approaching that of Nova Premier. On procedure-heavy benchmarks like PROTOCOLQA, the model attained a score of 0.49, significantly higher than Nova 1.0 Pro’s 0.34 and comparable to Nova Premier’s 0.48. Furthermore, on the challenging Virology Capabilities Test (VCT), where even expert virologists average only 22.1% with internet access, Nova 2.0 Lite achieved a score of 0.29, exceeding Nova 1.0 Pro and nearing Nova Premier’s performance. This work establishes a transparent template for future cross-organisational safety audits of frontier models, paving the way for responsible development and deployment of increasingly powerful AI systems.
Nova 2.0 Lite, CBRN, Cyber & AI
Scientists at Amazon meticulously evaluated Nova 2. This work focused on assessing critical risk profiles across three high-consequence domains: Chemical,0.0 Lite demonstrates measurable capability gains over Nova 1.0, exhibiting higher factual accuracy on CBRN and cybersecurity knowledge tests, yet remaining within established safety thresholds when performing complex tasks. This detailed methodology facilitated a comprehensive safety assessment, ultimately supporting the model’s safe public release under the current mitigation stack, which includes policy-tuned refusal behaviour, dynamic content filters, and continuous safeguard monitoring.
Nova 2.0 Lite excels in high-risk domain
Scientists achieved significant breakthroughs in evaluating Amazon’s Nova 2. Experiments revealed that Nova 2.0 Lite outperformed its predecessors in factual accuracy on CBRN knowledge tests, scoring 0.71 on WMDP-CHEM compared to 0.66 for Nova Premier and 0.63 for Nova 1.0 Pro. The model also demonstrated higher procedural understanding, achieving a mean solve rate of 0.49 in PROTOCOLQA, surpassing both Nova 1.0 Pro (0.34) and Nova Premier (0.48).
On the complex Virology Capabilities Test (VCT), Nova 2.0 Lite scored 0.29, slightly below Nova Premier’s 0.30 but significantly higher than Nova 1.0 Pro’s 0.15. These results confirm that Nova 2.0 Lite is a step forward in handling hazardous knowledge and complex procedures. In multimodal risk evaluation, the model showed balanced performance on chemical image understanding tasks, achieving accuracy levels between those of its predecessors. Independent auditors from Nemesys Insights verified these findings, ensuring the model remains within safety thresholds when performing real-world tasks such as executing end-to-end weaponisation workflows or autonomously conducting machine-learning research. Data shows that while Nova 2.0 Lite exhibits measurable capability gains over its predecessors, it remains safe for public release under Amazon’s current mitigation stack, which includes policy-tuned refusal behaviour and continuous safeguard monitoring. These breakthrough measurements provide a transparent template for future cross-organisational safety audits of frontier models, ensuring robust evaluation protocols are in place to protect against misuse.
The research demonstrates Nova 2.0 Lite’s capacity to process text, images, and video with a substantial 1M token context length, facilitating the analysis of complex data like codebases and lengthy documents. The findings establish that Nova 2.0 Lite is safe for public release according to Amazon’s FMSF, despite exhibiting increased knowledge and proficiency, particularly in CBRN and cybersecurity, compared to previous iterations.
Independent evaluation by the Model Evaluation and Threat Research (METR) group corroborated Amazon’s internal results, confirming the model does not meet the threshold for critical risk in Automated AI R&D. While the model’s capabilities have advanced, rigorous testing validated it remains within safe operational boundaries, posing no material uplift for weaponisation or offensive cyber activities. Researchers acknowledge limitations inherent in evaluating frontier models, and plan to refine evaluation and mitigation pipelines as new risks emerge. The authors highlight the importance of dynamic filters and continuous monitoring as key components of their safety approach. Future work will likely focus on expanding the scope of evaluation to encompass emerging threats and capabilities, ensuring ongoing safety and responsible deployment of advanced AI systems.
👉 More information
🗞 Evaluating Nova 2.0 Lite model under Amazon’s Frontier Model Safety Framework
🧠 ArXiv: https://arxiv.org/abs/2601.19134
