Researchers have discovered a surprising drawback to instructing Large Language Models (LLMs), revealing that attempts to better align these systems with human intent can paradoxically hinder their ability to solve tasks. Yunjia Qi, Hao Peng, and Xintong Shi, all from the Department of Computer Science and Technology at BNRist, alongside Xiaozhi Wang and Bin Xu, demonstrate this interference and introduce SUSTAINSCORE, a new metric for quantifying the extent to which instruction following diminishes task performance. Their experiments across mathematics, question answering, and code generation , even with advanced models like Claude-Sonnet-4.5 , show substantial performance drops when adding seemingly obvious constraints. This work is significant because it highlights a fundamental tension in LLM alignment and offers a valuable tool for evaluating and improving current post-training strategies, with code and data to be released for wider investigation.

The research team proposes a novel metric, SUSTAINSCORE, designed to quantify this interference between instruction adherence and task performance. This metric assesses the degree to which a model’s performance diminishes when presented with a self-evident constraint, one that is inherently satisfied by its original, successful output and directly extracted from it. Experiments conducted across mathematics, multi-hop question answering, and code generation reveal substantial performance drops when these self-evident constraints are introduced, even in advanced models like Claude-Sonnet-4.5.

The study establishes the generality of this interference across various constraint types and scales, validating its robustness and broad applicability. Researchers identified common failure patterns, observing that instances leading to failure allocate significantly more attention to the imposed constraints compared to successful attempts. This suggests an overemphasis on adhering to the constraints, potentially at the expense of accurate task completion. Furthermore, the team used SUSTAINSCORE to conduct an initial investigation into the impact of different post-training paradigms on this interference, providing empirical observations on current alignment strategies and their effectiveness.

This work opens new avenues for understanding the limitations of instruction following and its potential drawbacks for LLM performance. The researchers meticulously isolated the interference by utilising only instances where the model initially solved the task successfully, ensuring that the capability to handle the constraint already existed within the model. They then constructed self-evident constraints directly from these successful outputs, eliminating confounding factors related to inherent task difficulty. Through extensive experimentation, the team demonstrated that even seemingly trivial constraints can lead to measurable performance losses, with models ranging from 30 billion to 70 billion parameters retaining only 65% to 85% of their original performance.

The findings highlight a general lack of robustness in current LLMs when operating under constraints, a gap not fully captured by existing instruction-following benchmarks. Analysis of failure modes revealed two primary error types: reasoning errors, where constraints disrupt the correct solution path, and output specification errors, where the core solution is correct but fails to meet specific output requirements. By introducing a Constraint Attention Score, the team observed that failed cases exhibit a significantly higher focus on the constraints during generation, suggesting that excessive attentional allocation to constraints may contribute to the performance decline. Finally, preliminary analysis suggests that Reinforcement Learning (RL) may enhance both task performance and robustness, while supervised fine-tuning, though boosting general performance, appears more susceptible to this interference. The research team developed SUSTAINSCORE, a metric to quantify this interference, measuring the drop in task performance after inserting self-evident constraints into instructions. These constraints are derived directly from successful model outputs, ensuring the model inherently possesses the capability to solve the task under them. Experiments were conducted across mathematics, multi-hop question answering, and code generation, utilising models including Claude-Sonnet-4.5.

To isolate the impact of capability limitations, the study employed only instances where models initially solved tasks successfully without constraints. Researchers then constructed five constraint types, Structure, based on the model’s original successful responses. LLMs of varying sizes and training paradigms were evaluated, revealing substantial performance drops even with advanced models. Specifically, models with 30B to 70B parameters retained only 65% to 85% of their original performance when subjected to these constraints. This highlights a general lack of robustness in current models when operating under realistic, constrained conditions.

The work validated the reliability of SUSTAINSCORE through comprehensive analyses, confirming the performance decline wasn’t due to instruction design artifacts like increased context length. Researchers observed a sharp initial decline in performance with the addition of the first few constraints, followed by a plateau. Detailed analysis of failure patterns identified two dominant error modes: Reasoning Error, where constraints disrupt the correct reasoning path, and Output Specification Error, where the core solution is correct but output requirements aren’t met. Scientists introduced a Constraint Attention Score, measuring the proportion of attention allocated to constraints during generation.

Comparison of successful and failed cases revealed that failed attempts exhibited significantly higher constraint attention scores, suggesting excessive focus on constraints contributes to performance degradation. Preliminary investigations into training paradigms indicated that Reinforcement Learning (RL) may enhance both task performance and robustness, while supervised fine-tuning on long chain-of-thought data appears more susceptible to this interference. The research introduces SUSTAINSCORE, a novel metric designed to quantify this interference, measuring the drop in task performance after inserting self-evident constraints into instructions. Experiments conducted on current LLMs across mathematics, multi-hop question answering, and code generation demonstrate that adding these constraints leads to substantial performance declines, even in advanced models like Claude-Sonnet-4.5. The team measured significant performance losses, with models retaining only 65% to 85% of their original performance when subjected to the added constraints.

These results highlight a general lack of robustness in current models when executing tasks with constraints, a gap not fully captured by standard instruction-following benchmarks. To rigorously assess this interference, scientists conducted three comprehensive analyses, confirming the reliability of SUSTAINSCORE across diverse experimental setups. They validated that the observed performance decline wasn’t due to instruction design artifacts, such as increased context length, and found consistent degradation across various constraint categories. Further investigation revealed a sharp initial decline in performance with the addition of the first few constraints, followed by a plateau where degradation stabilised.

Analysis of failure patterns identified two dominant error modes: Reasoning Error, where constraints cause deviation from the correct reasoning, and Output Specification Error, where the core solution is correct but fails to meet output requirements. Researchers introduced a Constraint Attention Score, measuring the proportion of attention allocated to constraints during generation, and discovered that failed cases exhibited significantly higher scores than successful ones, suggesting excessive focus on constraints may be a contributing factor. Finally, the study examined the impact of different post-training paradigms, observing that Reinforcement Learning (RL) may enhance both task performance and robustness under constraints. Conversely, supervised fine-tuning on long chain-of-thought data, while boosting general performance, appears more susceptible to this degradation. The work formalises task robustness under constraints and provides practical insights for model alignment, encouraging the community to emphasise robustness and caution in instruction design to avoid inadvertently degrading core task performance.

Instruction Following Impairs Robust LLM Performance on unseen

Scientists have demonstrated that instruction following, intended to align large language models (LLMs) with human intent, can paradoxically hinder their ability to solve tasks. Researchers introduced a new metric, SUSTAINSCORE, to quantify this interference, measuring the decline in task performance when a self-evident constraint is added to an instruction. Experiments across mathematics, multi-hop question answering, and code generation revealed substantial performance drops even in advanced models like Claude-Sonnet-4.5, indicating a critical robustness gap. The findings suggest that LLMs struggle to maintain performance when faced with functional constraints that preserve the task goal but alter its execution.

Attention-based analyses indicate that the performance reduction may be linked to an excessive focus on these constraints rather than the task itself. Initial investigations also explored how different post-training paradigms influence this interference, offering preliminary insights into alignment strategies. The authors acknowledge limitations in the scope of validation, currently confined to specific domains and the English language. Future research should explore the generalizability of SUSTAINSCORE across diverse tasks and languages. Researchers caution that overly detailed instructions may inadvertently degrade model performance. They encourage the adoption of SUSTAINSCORE as a tool for developing more capable and reliable LLMs, fostering a focus on sustaining performance under realistic constraints. The metric represents a significant contribution towards evaluating LLM robustness and guiding the development of more effective alignment strategies for real-world applications.

👉 More information
🗞 On the Paradoxical Interference between Instruction-Following and Task Solving
🧠 ArXiv: https://arxiv.org/abs/2601.22047

Tags:

alignment strategies. code generation Instruction Following Large Language Models multi-hop QA post-training paradigms self-evident constraints SUSTAINSCORE Task performance

Instruction Following Paradoxically Reduces LLM Task Performance by Substantial Drops

Instruction Following Impairs Robust LLM Performance on unseen

Rohail T.

Latest Posts by Rohail T.:

Vi-Probe Achieves Perception Vs Recall Disentanglement Using Graded Visual Illusions

Ueval Benchmark Achieves Robust Multimodal Generation Evaluation with 1,000 Expert Questions

Low-Compute AI Models Pose Big Threats, Profiling over 5,000 Instances