Researchers have identified a concerning trade-off between context length, privacy and personalisation in large language models (LLMs). Shangding Gu, from the University of Oxford, and colleagues, in collaboration with researchers at University College London and the Alan Turing Institute, present a comprehensive study revealing a ‘scaling gap’ where increasing the context window of LLMs actually diminishes both their ability to personalise responses and protect private information. Their work introduces PAPerBench, a new benchmark comprising nearly 377,000 evaluation questions across contexts of 1,000 to 256,000 tokens, to systematically assess this phenomenon. This research is significant because it demonstrates an inherent limitation of current Transformer architectures, specifically, attention dilution, suggesting that simply scaling up context length does not automatically improve performance and may, in fact, be counterproductive for privacy-sensitive applications.
This scaling gap presents a significant challenge as developers strive for ever more powerful artificial intelligence.
Researchers have developed a new benchmark, PAPerBench, to investigate a critical limitation in large language models (LLMs), the trade-off between processing longer contexts and maintaining both privacy and personalization. Modern LLMs are increasingly used in applications demanding extensive contextual understanding, such as virtual assistants and personalised systems, yet the impact of extended context lengths on data security and individualised responses remains poorly understood.
This work addresses this gap by systematically evaluating how increasing input text affects an LLM’s ability to protect sensitive information while simultaneously tailoring its responses to specific users. The PAPerBench benchmark comprises approximately 29,000 instances, generating a total of 377,000 evaluation questions, with context lengths varying from 1,000 to 256,000 tokens, a token being a unit of text used by the model.
Extensive testing across several state-of-the-art LLMs reveals a consistent pattern: as context length increases, both personalization accuracy and privacy protection demonstrably decrease. This suggests that increasing the input capacity of these models does not automatically translate to improved performance in real-world applications requiring both individualised responses and data security.
Further analysis indicates that this degradation stems from an inherent limitation in the ‘soft attention’ mechanism used within the Transformer architecture, the foundation of many LLMs. As context length grows, the model’s attention becomes diluted, effectively diminishing the influence of crucial information within the input.
This ‘attention dilution’ creates a bottleneck, hindering the model’s ability to effectively process and utilise long-range dependencies. The findings highlight a fundamental scaling gap in current LLMs, suggesting that increasing context length alone is insufficient to unlock the full potential of these powerful tools and that new architectural innovations are needed to address this challenge. The release of PAPerBench provides a valuable resource for the research community, enabling reproducible evaluation and fostering further investigation into scalable privacy and personalization techniques for long-context LLMs.
Long Contexts Impair Personalisation and Privacy in Large Language Models
Evaluations across state-of-the-art large language models reveal consistent performance degradation in both personalization and privacy as context length increases, as demonstrated by the PAPerBench benchmark comprising approximately 29,000 instances and 377,000 evaluation questions. The research systematically studied context lengths ranging from 1,000 to 256,000 tokens, identifying a clear scaling gap where increasing context does not guarantee improved performance.
This work establishes a unified benchmark for assessing privacy and personalization jointly, enabling controlled analysis of long-context behaviour under realistic conditions. Error and reasoning-depth analyses pinpointed hallucination, structural violations, and brittle compositional privacy reasoning as dominant failure mechanisms. These failures explain why scaling context alone does not deliver robust privacy and personalization, highlighting inherent limitations within current transformer architectures.
The study’s theoretical analysis demonstrates that softmax attention induces a vanishing contribution of sparse informative tokens as context length increases, creating a representation bottleneck. This attention dilution is task-agnostic, suggesting the observed performance degradation extends beyond privacy and personalization to broader long-context scenarios.
Specifically, the research reveals that increasing context length does not monotonically improve personalization capabilities. Instead, models exhibit structural brittleness, often amplifying privacy risks and degrading personalization accuracy. The benchmark facilitates fine-grained analysis of these failure modes under distracting contexts, supporting reproducible evaluation and future research. This work provides actionable insights into the challenges of scaling personalization and privacy in long-context language models, moving beyond simple performance metrics to identify the underlying causes of degradation.
Evaluating Long-Context Performance via Joint Privacy and Personalisation Assessment
A large-scale benchmark, termed PAPerBench, was constructed to systematically investigate the interplay between context length, privacy leakage, and personalization effectiveness in large language models. The benchmark comprises approximately 29,000 distinct instances, each featuring context lengths varying from 1,000 to 256,000 tokens, resulting in a total of 377,000 evaluation questions.
This extensive dataset allows for controlled analysis of long-context behaviour and facilitates fine-grained assessment of model performance under realistic conditions. Instances were carefully designed to jointly evaluate both personalization quality and privacy risks, moving beyond benchmarks that address these aspects in isolation. The methodology centres on creating a diverse set of tasks requiring information leakage detection, accurate counting, and aggregate reasoning over sensitive data embedded within lengthy and potentially distracting contexts.
This approach deliberately challenges models to maintain privacy while simultaneously delivering personalised responses, mirroring the demands of real-world applications. By varying context length across the benchmark, the research team could isolate the impact of this parameter on both performance metrics. The choice of a broad token range, spanning five orders of magnitude, was crucial for uncovering potential scaling limitations and identifying the point at which performance begins to degrade.
To ensure robustness, evaluations were conducted across a range of state-of-the-art language models, providing a comparative analysis of their strengths and weaknesses in handling long-context inputs. The design of PAPerBench prioritises reproducibility, with all code and data publicly available to encourage further research and validation of the findings. This commitment to open science allows the wider community to build upon this work and explore novel approaches to scalable privacy and personalisation.
Longer contexts diminish language model performance on personalisation and privacy tasks
The relentless pursuit of ever-larger language models has largely focused on scaling up parameters and training data, with less attention paid to the implications of vastly increased context lengths. This new work shifts the conversation, demonstrating that feeding these models more information doesn’t automatically translate to better performance, and may, in fact, introduce significant trade-offs.
For years, the field has operated under the belief that longer contexts would unlock more nuanced understanding and reasoning in LLMs. The researchers have meticulously quantified a scaling gap, revealing that both personalization and privacy deteriorate as context windows expand, a finding that challenges this prevailing assumption. This benchmark, PAPerBench, provides compelling evidence that this isn’t necessarily true, highlighting an inherent limitation in the current ‘soft attention’ mechanisms used in transformer models.
The dilution of focus across increasingly lengthy inputs appears to be a fundamental bottleneck, hindering the model’s ability to effectively process and retain relevant information. This isn’t merely an academic concern. As LLMs become integrated into applications handling sensitive personal data, from healthcare to finance, the erosion of privacy with longer contexts is a serious risk.
While techniques like differential privacy and federated learning offer partial solutions, they don’t address this core issue of attention dilution. Future work must explore alternative attention mechanisms or architectural innovations that can maintain focus and preserve both personalization and privacy at scale. The challenge now is to move beyond simply increasing context length and instead focus on optimising how LLMs utilise the information they receive.
👉 More information
🗞 Long Context, Less Focus: A Scaling Gap in LLMs Revealed through Privacy and Personalization
🧠 ArXiv: https://arxiv.org/abs/2602.15028
