Multilingual LLM Evaluations of 6,000 Prompts Advance Global Model Safety

Researchers are increasingly focused on ensuring the safety and reliability of large language models (LLMs) as they become globally deployed. Akriti Vij, Benjamin Chua, and Darshini Ramiah, all from Singapore AISI, alongside En Qi Ng, Mahran Morsidi, Naga Nikshith Gangarapu et al, have spearheaded a crucial investigation into how LLM safeguards perform across a diverse range of languages. Their work details a joint multilingual evaluation exercise , involving international collaboration from ten regions , testing two open-weight models in ten languages, from Cantonese to Telugu, and assessing over 6,000 prompts for potential harms. This research is significant because it reveals substantial variations in LLM behaviour and evaluator reliability across languages, highlighting the urgent need for culturally contextualised evaluation methodologies and a shared framework for robust multilingual testing.

Singapore AISI spearheaded testing across all ten languages, while Australia and Japan independently ran tests in Mandarin Chinese and Japanese to assess the influence of the inference environment on the results obtained. Evaluations incorporated both LLM-as-a-judge methodologies and detailed human annotation, allowing for a comparative analysis of automated and human-led assessment techniques.

This approach enabled the team to explore the performance of LLM-as-a-judge against human evaluation in nuanced settings, identifying areas where automated evaluation excels and where human oversight remains essential. Experiments show that safeguard robustness varies significantly across languages and harm types, with jailbreak protections consistently proving the weakest, while intellectual property safeguards demonstrated greater strength. Non-English safeguards generally lagged slightly behind their English counterparts, though this disparity fluctuated depending on the specific harm category being assessed. Furthermore, the research unveiled valuable methodological insights for enhancing multilingual evaluations, including the necessity for culturally contextualised translations, rigorously stress-tested evaluator prompts, and clearly defined human annotation guidelines.
This comprehensive evaluation also examined the quality of model refusals, determining whether responses were overly cautious or provided helpful ethical alternatives, findings indicated that refusals in most languages included reasoning or suggested ethical options. The study’s findings are indicative and should be further explored, acknowledging limitations such as varying translation quality and the scope of the human annotator pool, but represent a significant step towards building more reliable and culturally sensitive AI systems for a global audience. This breakthrough opens exciting possibilities for developing AI that truly understands and respects the nuances of diverse cultures and languages.

Multilingual LLM Safety Evaluation Using Translated Prompts

Researchers employed a dual-evaluation approach, harnessing both LLM-as-a-judge automated assessment and detailed human annotation to validate model responses. This innovative methodology allowed for comparison of evaluator reliability between the automated and human systems, revealing crucial insights into the strengths and limitations of each approach. The team engineered a robust experimental setup, meticulously translating prompts and establishing clear annotation guidelines, detailed in Annex B and D, to maintain consistency across languages and evaluators. This process ensured culturally contextualised translations, addressing a key challenge in multilingual LLM evaluation.

The evaluation process involved systematically submitting each prompt to the two open-weight models and recording the responses. Human annotators then assessed these responses against pre-defined criteria, detailed in Annex C, categorising them as acceptable or unacceptable based on the presence of harmful content. LLM-as-a-judge evaluations were conducted concurrently, utilising a separate LLM to automatically assess the responses and provide a comparative score. This parallel assessment enabled the researchers to quantify discrepancies between automated and human judgements, identifying areas where automated evaluation requires refinement. Furthermore, the study stress-tested evaluator prompts to identify potential biases or ambiguities, improving the clarity and effectiveness of the evaluation process. The resulting data revealed significant variations in safeguard robustness across languages and harm types, highlighting the need for tailored safety measures for different linguistic and cultural contexts.

Multilingual Model Safety Varied by Language and Harm

Results demonstrate that non-English safeguards tended to lag slightly behind English, although this varied by harm type, with jailbreak protections proving the weakest and intellectual property safeguards the strongest. The research meticulously measured safeguard effectiveness, recording acceptability and refusal rates across all ten languages and five harm categories. Data shows that while models generally offered reasoning or ethical alternatives when refusing requests, languages such as French, Korean, Japanese, and Farsi exhibited a cultural tendency to avoid direct rejections, prioritising politeness. Tests prove that LLM-as-a-judge shows promise as a baseline evaluator, but human oversight remains crucial, particularly for nuanced or ambiguous harms.

Discrepancy rates between LLM-as-a-judge and human annotation were carefully documented to refine evaluation methodologies. Scientists recorded instances of mixed-language outputs in all languages except English and French, with Malay and Cantonese models occasionally confusing languages with similar counterparts. Further analysis revealed that in languages like Cantonese, Malay, Mandarin Chinese, Telugu, and Kiswahili, models sometimes provided initial warnings alongside partial or suggestive harmful instructions. Hallucinations and gibberish were more prevalent in lower-resourced languages, including Farsi, Telugu, and Kiswahili, indicating a need for improved data quality and model training in these contexts.

Measurements confirm that the exercise yielded key methodological learnings, including the necessity for culturally contextualised translations and stress-tested evaluator prompts. The team identified that literal translations are insufficient, and prompts must be adapted to each language and culture for accurate evaluation. Enhancing human annotations with clearer guidelines and multi-label evaluation schemes is also critical for capturing model limitations.,.

AI Safeguard Performance Varies by Language and Harm

Scientists have demonstrated the variable performance of current AI safeguards across multiple languages and cultural contexts. The findings reveal that safeguard robustness differs significantly not only between languages but also across the types of harm assessed. Variations were also observed in the reliability of evaluations conducted using large language models as judges compared to those performed by human annotators, highlighting a crucial methodological consideration. This research generated valuable insights for enhancing multilingual evaluations, including the importance of culturally appropriate translations, rigorously tested prompts for evaluators, and more precise guidelines for human annotation. Future research should continue to build upon these findings, fostering cooperation between the academic community and industry partners to ensure the safe and reliable deployment of these technologies globally. The study’s implications are significant, suggesting that a ‘one-size-fits-all’ approach to AI safety is insufficient and that localised, culturally sensitive evaluation is essential for responsible AI development.

👉 More information
🗞 Improving Methodologies for LLM Evaluations Across Global Languages
🧠 ArXiv: https://arxiv.org/abs/2601.15706

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

Deepseek-R1-Distill-Llama-70b Achieves 12-Dataset Benchmark for Causal Discovery

Deepseek-R1-Distill-Llama-70b Achieves 12-Dataset Benchmark for Causal Discovery

January 27, 2026
Large Language Models Achieve Fine-Grained Opinion Analysis with Reduced Human Effort

Large Language Models Achieve Fine-Grained Opinion Analysis with Reduced Human Effort

January 27, 2026
Qsmri Achieves Noninvasive Detection of Neuronal Activity in Human Brains

Qsmri Achieves Noninvasive Detection of Neuronal Activity in Human Brains

January 27, 2026