Researchers are increasingly questioning the assumption that simply scaling up the parameters of large language models (LLMs) guarantees improved performance. Ruishan Guo, Yibing Liu, and Guoxin Ma, from Baidu Inc. and Tsinghua University, alongside Yan Wang, Yueyang Zhang, Long Xia et al., demonstrate a counterintuitive ‘Size-Fidelity Paradox’ where increasing the size of the compression component in a compressor-decoder setup actually reduces the faithfulness of reconstructed contexts, despite a decrease in training loss. This work, conducted across LLMs ranging from 0.6 billion to 90 billion parameters, reveals that larger models are prone to overwriting factual information with their own prior knowledge and drifting towards paraphrasing rather than accurate reproduction. By isolating the effects of model size, the study highlights that it is not parameter count alone, but the amplified semantic capacity and generative uncertainty associated with scaling, which underpin this breakdown in expected scaling laws for faithful open-ended generation.
Larger language models exhibit reduced fidelity during lossy context compression
Scientists have uncovered a counterintuitive paradox in the scaling of large language models, demonstrating that increasing model size can actually diminish the faithfulness of reconstructed contexts during lossy context compression. This research, spanning models from 0.6 billion to 90 billion parameters, reveals a “Size-Fidelity Paradox” where larger compressors, despite achieving lower training loss, exhibit reduced accuracy in preserving original information.
The study identifies two primary factors driving this phenomenon: knowledge overwriting, where models substitute source facts with their own pre-existing beliefs, for example, replacing “the white strawberry” with “the red strawberry”, and semantic drift, characterised by paraphrasing or restructuring content instead of verbatim reproduction, such as altering “hit” to “hit”. By maintaining a fixed model size and focusing on the properties of compressed context representations, researchers pinpointed that the issue isn’t simply parameter count.
Instead, excessive semantic capacity and amplified generative uncertainty accompanying scaling are the core culprits. The increased rank of context embeddings facilitates the intrusion of prior knowledge, while higher entropy in token prediction distributions promotes content rewriting. This work complements existing evaluations of context compression, highlighting a breakdown in the expected scaling laws for faithful preservation during open-ended generation.
Detailed analysis involved the design of two diagnostic question-answering tasks specifically isolating knowledge overwriting and semantic drift. Extensive experiments across Qwen and Llama model families systematically confirmed both the existence and generality of the Size-Fidelity Paradox. Surface-level reconstruction scores remained high as model size increased, yet question-answering accuracy, a direct measure of fidelity, significantly degraded, revealing a divergence not captured by conventional metrics.
Further mechanistic investigation revealed that semantic capacity and generative uncertainty are key drivers of this paradox. Larger models exhibit higher effective rank, dispersing representations across broader semantic subspaces, increasing the potential for interference from parametric knowledge. Simultaneously, lower conditional entropy reflects confident generative priors that readily override ambiguous encodings, creating a fundamental trade-off between complex reasoning and rigid fidelity. This research establishes a principled evaluation framework and underscores the need to reconsider scaling strategies for applications demanding faithful information preservation.
Evaluating Faithful Reconstruction via Compressor-Decoder Experiments and Error Analysis
Researchers investigated a counterintuitive phenomenon termed the Size-Fidelity Paradox within the context of language model compression. The study employed a compressor-decoder setup to assess how model size impacts faithful context reconstruction, utilising models ranging from 0.6 billion to 90 billion parameters.
Experiments involved compressing and reconstructing factual passages, then evaluating both surface-level reconstruction accuracy and question answering performance to detect errors. Qualitative analysis focused on instances where larger compressors exhibited failures in two modes: knowledge overwriting and semantic drift.
Knowledge overwriting manifested as the replacement of specific details with the model’s prior beliefs, such as identifying a “blue-banded bee” as a “honey bee”. Semantic drift involved distortions of causal relationships, exemplified by reversing the roles of bee and flower in pollination descriptions.
Quantitative analysis revealed that while reconstruction scores remained high as model size increased, question answering accuracy, a measure of factual preservation, significantly decreased. This divergence indicated that larger compressors prioritised their own semantic capacity over accurately preserving the original context.
To understand the underlying causes, researchers probed the internal properties of the compressor models. They discovered that increased model size correlated with higher effective rank in context embeddings, facilitating the intrusion of prior knowledge. Simultaneously, larger models demonstrated lower conditional entropy, indicating stronger generative priors that promoted rewriting rather than verbatim reproduction.
This work highlights a fundamental trade-off between the complex reasoning abilities enabled by scale and the rigid fidelity required for faithful reconstruction, challenging conventional scaling laws in open-ended generation. The study introduces diagnostic tasks to specifically assess knowledge overwriting and semantic drift, offering a more robust evaluation framework for context compression techniques.
Larger language models exhibit reduced reconstruction fidelity and increased knowledge alteration
Across models ranging from 0.6B to 90B parameters, research demonstrates a Size-Fidelity Paradox in context compression where increasing compressor size can reduce the faithfulness of reconstructed contexts despite decreasing training loss. Diagnostic question answering tasks were designed to isolate knowledge overwriting and semantic drift, revealing that larger compressors exhibit these failure modes more frequently than smaller models.
Experiments across Qwen and Llama language model families systematically confirmed the existence and generality of this paradox. Specifically, the study observed that larger compressors replaced source facts with their own prior beliefs, exemplified by substituting “blue-banded bee” with “honey bee” in reconstruction tasks.
Furthermore, semantic drift occurred as larger models restructured content instead of reproducing it verbatim, changing “hit” to “hit”. Quantitative analysis revealed a divergence between reconstruction scores and question answering accuracy as model size increased, indicating a prioritisation of semantic capacity over faithful preservation of source context.
The research established that the increased rank of context embeddings facilitates prior knowledge intrusion, while higher entropy over token prediction distributions promotes rewriting of the original content. These findings underpin a breakdown in scaling laws for faithful preservation during open-ended generation, suggesting that parameter count is not the sole determinant of performance. The work highlights that excessive semantic capacity and amplified generative uncertainty accompanying scaling are key factors contributing to the observed paradox.
Knowledge overwriting and semantic drift explain diminished fidelity in large language models
The observation of a Size-Fidelity Paradox challenges the assumption that simply increasing the number of parameters in a language model will automatically improve its performance in lossy context compression. Experiments across models ranging from 0.6 billion to 90 billion parameters reveal that larger compressors, while achieving lower training loss, actually produce less faithful compressed representations of the input context.
This counterintuitive finding indicates that scaling parameters does not consistently lead to better results when dealing with compressed information. Further investigation identifies two primary factors driving this paradox: knowledge overwriting and semantic drift. Larger models demonstrate a tendency to replace factual details with their own pre-existing knowledge, and to paraphrase or restructure content rather than reproduce it verbatim.
These issues are not inherent to specific model architectures, as the paradox persists even when pairing compressors from the Llama family with decoders from the Qwen family, suggesting the problem lies within the scaling of the compressor’s representation space itself. The research introduces diagnostic question answering tasks to isolate and measure these fidelity degradations, offering a more nuanced assessment of context compression effectiveness than standard reconstruction evaluations.
The authors acknowledge that the observed fidelity loss is intrinsic to the scaled compressor’s representation space. Future work could focus on developing design principles that address these limitations, potentially exploring methods to constrain semantic capacity or reduce generative uncertainty in larger models. These findings highlight that scaling laws are not universally applicable and that alternative approaches may be necessary to achieve desired behaviours in specific domains like lossy context compression.
👉 More information
🗞 When Less is More: The LLM Scaling Paradox in Context Compression
🧠 ArXiv: https://arxiv.org/abs/2602.09789
