Scientists have long observed that large language models are surprisingly susceptible to how prompts are structured, yet the reasons behind this sensitivity have remained elusive! Hyunjong Ok from POSTECH and Jaeho Lee, alongside their colleagues, now reveal a critical limitation in how these models process information, demonstrating that presenting context before questions and options consistently improves multiple-choice question answering accuracy by over 14% .Their research pinpoints ‘causal attention’ as the culprit , the standard causal masking prevents options from accessing vital contextual information when questions are presented first, effectively creating an information bottleneck.This discovery is significant because it highlights a fundamental architectural constraint impacting language model performance and suggests new avenues for improving their reasoning capabilities.

Prompt Ordering Impacts LLM Question Answering significantly

Scientists have demonstrated a surprising sensitivity in large language models (LLMs) to the structure of prompts, revealing a performance gap of over 14 percentage points between placing context before questions and options (CQO) versus the reverse order (QOC) in multiple-choice question answering! This consistent outperformance across diverse models and datasets prompted an in-depth investigation into the underlying mechanisms driving this phenomenon. The research team meticulously analysed architectural factors to pinpoint causal attention as the core reason, in QOC prompts, the causal masking inherent in decoder models prevents option tokens from accessing crucial context information, effectively creating an information bottleneck. The study unveils that this limitation hinders the model’s ability to leverage contextual evidence when selecting answers, leading to a significant drop in accuracy. Researchers systematically tested three competing hypotheses to explain the observed sensitivity, ultimately validating the causal attention mechanism and disproving alternative explanations related to biased training data or difficulties in recalling options within the prompt. Through carefully controlled experiments on 21 decoder-only LLMs, including models ranging from 0.5 billion to 9 billion parameters, the team consistently found that the QOC structure restricts information flow, forcing models to rely on prior assumptions rather than evidence from the provided context.

Attention Pruning Simulates Question-Option-First Prompting

Scientists investigated a performance disparity in large language models (LLMs) when answering multiple-choice questions, revealing a consistent 14%p advantage for prompts presenting context before questions and options (CQO) compared to those with questions and options first (QOC)! The research team meticulously analysed 21 decoder-only models across four datasets, LogiQA, SciQ, RACE-M, and RACE-H, to pinpoint the cause of this sensitivity. To simulate the constraints of QOC prompts, researchers implemented attention pruning, blocking option tokens from attending to context within the CQO framework; this was achieved by setting mask[i, j] = −∞ for all pairs where i ∈ Options and j ∈ Context, leaving other attention mechanisms unchanged! This intervention demonstrably reduced CQO accuracy from an average of 69.26% to 42.46%, with consistent performance drops observed across models like Qwen (28.1%), LLaMA (25.3%), and Gemma (26.9%).

Further experimentation involved activation patching, a technique designed to restore context-awareness in QOC prompts! The study pioneered a method of replacing option hidden states in QOC with corresponding states computed under the CQO template, specifically targeting layers 12, 23 in 24-layer models, normalized by network depth. This patching process, applied exclusively to option tokens and verified through exact string matching, increased QOC accuracy by an average of 6.0 points, with greater improvements seen in models initially exhibiting larger performance gaps. The team also explored a simpler approach, option repetition (QOCO), where options were repeated after the context, allowing them to attend to context under the causal mask without internal model modification.

This technique yielded an 8.2 point improvement in QOC accuracy, partially bridging the performance gap. The work employed a rigorous comparative analysis, demonstrating that encoder models, unlike their decoder counterparts, do not exhibit this prompt ordering sensitivity. Attention pathway analysis, detailed in Table 2, further supported the hypothesis that causal masking is the core mechanism driving the observed difference. By systematically intervening on model architecture and attention flows, the study established a clear link between causal attention and the information bottleneck created in QOC prompts, where contextual information becomes inaccessible to the options. This innovative methodology provides mechanistic insight into prompt sensitivity and offers practical guidance for optimising LLM performance in question answering tasks.

Context-first prompts boost LLM question answering performance significantly

Scientists have discovered a significant sensitivity in large language models (LLMs) to prompt structure, specifically the order of context, questions, and options in multiple-choice question answering! The research team meticulously investigated this phenomenon, revealing that presenting context before the question and options (CQO) consistently outperforms the reverse order (QOC) by over 14 percentage points across various models and datasets. This performance gap was measured using accuracy, quantifying the difference between CQO and QOC prompt orderings, denoted as ∆ = AccCQO − AccQOC. Experiments revealed that the core cause of this sensitivity lies within the causal masking inherent in many LLM architectures.

In QOC prompts, the causal mask prevents option tokens from attending to context tokens, effectively creating an information bottleneck where crucial contextual information becomes inaccessible to the options. The team observed a 14.72% performance gap between CQO and QOC using decoder-only models, while encoder-decoder models exhibited a much smaller 2.30% gap, and encoder-only models showed a near-zero 0.02% difference. These results strongly implicate causal masking as the primary driver of the observed performance disparity. Further analysis demonstrated that this isn’t simply a matter of models failing to recall options; QOC achieved similar, and sometimes even higher, option recall accuracy compared to CQO, with recall accuracy reaching as high as 96.9% for CQO and 95.2% for QOC.

Tests proved that removing the context entirely (resulting in a QO prompt) yielded performance nearly identical to QOC, with accuracy at 52.8%, confirming the context is effectively ignored in the QOC format. The team measured attention weights across layers, finding that option tokens receive zero attention from context tokens in QOC due to the causal mask. Detailed attention analysis showed that in CQO, attention to options declines as context is integrated, whereas in QOC, it rises. Gradient-based attribution further confirmed this, revealing a context attribution ratio of 0.797 for CQO versus only 0.335 for QOC, demonstrating that context tokens contribute substantially more to predictions when presented before the question and options. This breakthrough delivers a deeper understanding of LLM behaviour and has implications for prompt engineering,0.2 points.

👉 More information
🗞 Lost in the Prompt Order: Revealing the Limitations of Causal Attention in Language Models
🧠 ArXiv: https://arxiv.org/abs/2601.14152

Tags:

Large Language Models prompt engineering

Language Models Achieve 14% Performance Boost with Prompt Order Optimisation

Prompt Ordering Impacts LLM Question Answering significantly

Attention Pruning Simulates Question-Option-First Prompting

Context-first prompts boost LLM question answering performance significantly

Rohail T.

Latest Posts by Rohail T.:

Quantum Algorithms Optimise Wireless Networks Despite Complex Interference

Quantum Circuits Reveal Hidden Entanglement Changes with New Entropy Measures

Plant Light-Harvesting Boosted by Internal Electronic Mixing