Researchers are increasingly focused on the problem of misalignment in large language models (LLMs), where failures in simultaneously addressing safety, values and cultural considerations result in unpredictable and potentially harmful outputs. Usman Naseem and Gautam Siddharth Kashyap, both from Macquarie University, alongside Ebad Shabbir, Sushant Kumar Ray, Abdullah Mohammad and Rafiq Ali, working with colleagues at DSEU-Okhla and the University of Delhi, present a new benchmark, MisAlign-Profile, designed to systematically characterise these complex trade-offs. Current evaluation methods typically assess these dimensions in isolation, offering limited understanding of their interplay. This work addresses this critical gap by introducing MISALIGNTRADE, a comprehensive dataset spanning 112 normative domains, and revealing significant misalignment trade-offs, between 12% and 34% , across various LLMs, thereby offering a crucial step towards building more reliable and human-aligned artificial intelligence systems.
Current LLMs often struggle to simultaneously satisfy these dimensions, leading to outputs that diverge from human expectations in real-world scenarios. Existing evaluation methods typically assess these aspects in isolation, offering limited insight into the inevitable trade-offs that occur when attempting to balance them. The research introduces MISALIGNTRADE, a dataset comprising 112 normative domains, encompassing 14 safety, 56 value, and 42 cultural categories. Each prompt within the dataset is meticulously classified by its domain and the type of misalignment present, whether it relates to an incorrect identification of an object, an inappropriate attribute assignment, or a flawed understanding of relationships between entities. This detailed semantic classification utilises models like Gemma-2-9B-it and Qwen3-30B-A3B-Instruct-2507 to ensure a nuanced assessment of model behaviour. The dataset construction employs a rigorous two-stage rejection sampling process to guarantee the quality of both misaligned and aligned responses. By benchmarking general-purpose, fine-tuned, and open-weight LLMs on MISALIGNTRADE, researchers have revealed significant misalignment trade-offs, ranging from 12% to 34% across these critical dimensions. This finding underscores the challenges in creating LLMs that consistently align with human expectations across diverse contexts and highlights the need for more sophisticated evaluation tools. General-purpose aligned models, H3FUSION and TRINITYX, demonstrate a strong balance, achieving Alignment Scores (AS) in the 79%, 86% range alongside low False Failure Rates (FFR) of approximately 10%, 20%. Conversely, safety-specific models attain high Coverage, reaching up to 97% on misaligned, safety subsets, but exhibit over-conservatism, evidenced by FFR exceeding 50%. Value- and cultural-specific models achieve moderate Coverage gains, approximately 85%, 93%, but display lower AS, around 60%, 70%, when assessed under cross-dimensional conditions. Open-weight models exhibit weaker Coverage, ranging from 64%, 70%, and moderate AS, suggesting a need for further refinement in handling nuanced ethical considerations. Performance systematically degrades on domain-specific subsets, particularly within cultural contexts, indicating substantial cross-dimensional interference. Analysis of semantic misalignment types reveals that object-level errors are the most robust, with AS reaching 78.6% under safety-value (S, V) conditions and remaining above 72.4% under value-culture (V, C) conditions, demonstrating stable entity-level reasoning. Attribute-level misalignment exhibits moderate degradation, with AS declining from 75.2% to 68.1% and FFR rising to 29.8%, reflecting increased sensitivity to normative conflicts. Relation-level misalignment presents the greatest challenge, with AS dropping to 63.5% and FFR exceeding 34% under V, C conditions. Mechanistic alignment profiles, visualized as radar plots, further illustrate these trends; object-level profiles are the most balanced, while attribute-level profiles show moderate contraction, and relation-level profiles are the most compressed, with elevated FFR and reduced AS. Dimension-specific models exhibit strong asymmetries, whereas general-purpose models maintain more uniform geometries, reinforcing the conclusion that misalignment arises from systematic internal trade-offs. Human annotators achieved an overall accuracy of 88.2% and a Macro-F1 score of 0.82, while Gemma-2-9B-it attained 83.6% accuracy and 0.82 Macro-F1, resulting in an accuracy gap of 4.6% across all dimensions. MisAlign-Profile initiates with the construction of MISALIGNTRADE, a dataset of misaligned and aligned prompts spanning 112 normative domains, forming the basis for evaluating trade-offs in safety, value, and cultural dimensions. The process begins with an initial prompt pool of approximately 64,108 prompts sourced from existing taxonomies including BEAVERTAILS, VALUECOMPASS, and UNESCO databases. Each prompt is enriched with a normative domain label and a semantic classification identifying the type of misalignment present. Semantic typing is achieved through a multi-domain classification task using the Gemma-2-9B-it model, assigning each prompt to one of three orthogonal categories: object, attribute, or relation misalignment. This classification determines whether the model fails to identify correct entities, assign appropriate characteristics, or interpret relationships accurately. A confidence threshold of 0.5 is applied to the model’s output probabilities to ensure reliable labelling. To address potential imbalances, the research expands underrepresented combinations of normative domains and semantic types using the Qwen3-30B-A3B-Instruct-2507 model, generating diverse and contextually grounded prompts. To maintain data quality, a SimHash fingerprinting technique is employed to detect and remove near-duplicate prompts, utilising a Hamming distance threshold of 10. This meticulous process culminates in a balanced query set, ready for benchmarking language models on their ability to navigate complex misalignment trade-offs. The pursuit of ‘alignment’ in LLMs has largely focused on isolated virtues, as if these could be bolted on like separate modules. This benchmark represents a crucial shift by acknowledging that these dimensions inevitably collide in the reality of human communication. For years, the field has treated misalignment as a technical glitch, a matter of refining training data or reward functions, rarely addressing the inherent trade-offs between competing values. What distinguishes this work is its attempt to map these trade-offs systematically, identifying where models falter when forced to navigate conflicting ones. The creation of a dataset spanning numerous normative domains provides a granular understanding of where models struggle. The reported misalignment rates are less important than the framework itself, offering a diagnostic tool for pinpointing specific areas of vulnerability. Defining and quantifying ‘value’ and ‘culture’ remains inherently subjective, and the benchmark relies on human-created pairings of aligned and misaligned responses. Furthermore, the models tested represent a snapshot of the rapidly evolving LLM landscape. However, MisAlign-Profile establishes a vital precedent, pushing the field beyond simplistic metrics and towards a more nuanced assessment of AI behaviour. Future work might explore how these trade-offs manifest across different languages and modalities, and crucially, how they impact real-world applications.
👉 More information
🗞 Can Large Language Models Make Everyone Happy?
🧠 ArXiv: https://arxiv.org/abs/2602.11091
