Researchers are increasingly recognising a powerful link between data compression and artificial intelligence, particularly within the rapidly evolving field of large language and multi-modal models. Xin Jin, Jinming Liu, and Yuntao Wei, all from the Eastern Institute of Technology, Ningbo, alongside Junyan Lin, Zhicheng Wang, Jianguo Huang et al, demonstrate how efficient compression techniques aren’t merely about reducing file sizes, but fundamentally underpin model intelligence. This paper offers a comprehensive unification of established visual coding, the basis of technologies like H.264 and H.265, with the emerging field of visual token technology used in generative AI, revealing shared principles of semantic information fidelity and computational cost. By bridging these two areas, the authors not only synthesise insights from both but also forecast next-generation codecs and tokens, potentially paving the way for a standardised, highly efficient token technology applicable across a broad spectrum of intelligent tasks, from AIGC to embodied AI.

Compression as intelligence benchmark for models

Scientists increasingly recognise a compelling principle: “Compression Tells Intelligence”. This perspective suggests that intelligence fundamentally relies on forming compact and effective representations of the world by identifying and exploiting patterns within data. The recent success of Large Language Models (LLMs) [2, 3] strongly validates this concept, with their capabilities stemming from their ability to compress linguistic data into powerful internal representations. Consequently, compression efficiency has evolved from a simple engineering metric to a benchmark for a model’s depth of understanding and intelligence.

Classical visual coding, grounded in information theory, has a long history of success, producing international standards from JPEG to H.265/HEVC. These technologies excel at minimising statistical redundancy to achieve high pixel-level fidelity, forming the foundation of our modern multimedia ecosystem. Visual tokens, emerging alongside generative AI and Multimodal Large Language Models (MLLMs), prioritise extracting crucial semantic information for downstream tasks like visual question answering or image generation. Both classical coding and vision technology share the same objective: to find an optimal balance between information fidelity and computational cost.

These two technical families have evolved almost independently, pursued by different academic communities (Signal Processing vs. Machine Learning), based on different theoretical principles (Information Theory vs. Representation Learning), and evaluated by different criteria (e.g., visual quality vs downstream task accuracy). Classical coding aims to reduce data size for efficient storage and transmission, saving bandwidth. In contrast, visual token technology seeks to create compact sequences of representations to reduce the computational cost of learning and processing by large-scale models like Transformers.

Classical codecs, optimised to minimise bit-rate against signal fidelity, offer unparalleled compression efficiency, but their representations aren’t inherently designed for direct use in AI model architectures. Conversely, visual tokens are designed to produce compact feature sets that reduce computational load and improve model performance, yet currently lack the theoretical rigor and compression rates of traditional methods. Bridging this gap is essential for a deeper understanding of the trade-off between compression efficiency and model performance, paving the way for the next generation of visual intelligence. Finally, we demonstrate the significant potential of compression technology, particularly the rapidly developing field of visual token skills, for system-level real-world applications, including MLLMs, AIGC, and Embodied AI.

Transform, Quantisation and Entropy Coding Techniques are fundamental

Classical visual coding seeks to create compact representations of visual data, minimising the required bits while preserving essential information, facilitating efficient storage and transmission across digital platforms. All realisations share three technique primitives: transformation for decorrelation/energy compaction, quantization for discretisation and rate control, and entropy coding for lossless compression of syntax symbols. The foundational principles of nearly all visual coding systems, both traditional and learned, revolve around these core techniques. Transformation is employed to decorrelate the visual data and compact its energy into a smaller set of coefficients.

Common transforms include the Discrete Cosine Transform (DCT) in JPEG and many video codecs [19, 20], and more recently, learned non-linear transforms using autoencoders in neural codecs. Quantisation is the process of reducing the precision of the transformed coefficients, the primary source of lossy compression [21, 22], crucial for controlling the bitrate. Entropy coding [23, 24, 25], such as Huffman coding or arithmetic coding, is the final stage, where the quantized symbols are losslessly compressed by assigning shorter codes to more probable symbols. Traditional codecs, like JPEG, JPEG 2000, HEVC, VVC, etc , are built upon hand-crafted modules individually optimised, typically following a block-based hybrid coding framework, especially for video.

This architecture involves prediction (either spatial for intra-frames or temporal for inter-frames), transformation of the residual, quantization, and entropy coding. Learned codecs [16, 18, 25], referred to as neural codecs, replace the hand-crafted components of traditional codecs with deep neural networks. These architectures are trained end-to-end, typically using an autoencoder framework for the transform and learned priors for entropy modelling. This data-driven approach allows for more powerful and adaptive modelling of complex visual data. JPEG is the canonical lossy image standard: images are divided into 8×8 blocks, transformed by a DCT, coefficients are zig-zag scanned, codebook-indexed, and then entropy coded. Researchers further investigated specific codec instances, including JPEG, JPEG2000, Balle2017, and Cheng2020 for images, and HEVC, VVC, DVC, and DCVC series for video.

Coding and tokens share efficiency trade-offs, impacting both

Scientists have demonstrated a unifying framework connecting classical visual coding and emerging visual token technology, both striving for efficient representation learning while minimising computational cost. This research establishes that both fields fundamentally address an efficiency-fidelity trade-off, utilising distinct information measures, functional roles, and optimisation criteria, yet sharing a common objective of balancing redundancy reduction with context modelling. The authors synthesise bidirectional insights, suggesting coding principles can enhance token systems and semantic modelling from tokens can inspire next-generation codecs tailored for machine tasks. The significance of these findings lies in the potential to bridge the gap between established image and video compression techniques with the rapidly evolving landscape of large language models and artificial intelligence. By recognising the underlying similarities, researchers can leverage decades of progress in visual coding to improve the performance and efficiency of token-based systems used in areas like AIGC and embodied AI. Future work will focus on developing unified tokenisers, enabling token communication across different platforms, and extending this framework to encompass emerging data types such as 3D and 4D information.

Compression Boosts LLM Performance via Token Budgets, enabling

Researchers have demonstrated a groundbreaking link between compression techniques and the intelligence of large language models (LLMs) and multi-modal LLMs (MLLMs). Research reveals that compression efficiency directly correlates with improved model performance, prompting a unified examination of visual coding and vision token technology. Experiments utilising the LLaVA-v1.5 (7B) model under varying visual-token budgets yielded remarkable results. At a 25% token retention rate (144 tokens), the QPID method achieved an average accuracy of 92.55% compared to FastV at 92.55% and PruMerge at 96.13%. Further compression to 12.5% (72 tokens) saw QPID reach 94.17% accuracy, surpassing both FastV (85.51%) and PruMerge (91.72%) across all tasks.

Remarkably, even at an extreme 6.25% retention (36 tokens), QPID maintained 90.22% accuracy, significantly outperforming prior methods as the token budget tightened. Data shows that QPID’s success stems from a combination of entropy-driven selection and adaptive quadtree allocation. This innovative approach identifies and retains a compact, non-redundant subset of visual tokens, distributing them strategically across the scene rather than concentrating them in high-attention areas. Quantitative analysis confirms that QPID consistently delivers the best overall accuracy at all three tested budgets.

Ablation studies isolate the contributions of each component, revealing that removing information-density scoring resulted in the largest performance drop, while omitting quadtree partitioning also degraded results. The team measured MME-Perception and ScienceQA accuracy against actual TFLOPs, demonstrating that QPID consistently leads across all token budgets. This breakthrough delivers a framework, CoTAM, that tailors codecs specifically for MLLMs, analysing internal information flow to optimise compression and preserve semantic integrity. Researchers have pioneered a unified framework bridging visual coding and visual token technology, analysing them through information theory, functionality, optimisation, and objective perspectives. Scientists assessed Shannon Entropy versus Semantic Entropy, redundancy reduction versus context modelling, and the trade-off between Rate-Distortion and the Information Bottleneck. The study’s innovative approach enables a deeper understanding of the relationship between compression efficiency and model performance, potentially paving the way for next-generation visual intelligence and standardisation of token technology akin to existing codecs like H.264/265.

👉 More information
🗞 Compression Tells Intelligence: Visual Coding, Visual Token Technology, and the Unification
🧠 ArXiv: https://arxiv.org/abs/2601.20742

Tags:

AIGC H.264/265 Large Language Models multi-modal large models representation learning! semantic information fidelity vision token technology visual coding

Quantum Evangelist

Latest Posts by Quantum Evangelist:

The Jobs That Survive AI Will Be the Ones That Matter Most

Robots Learn to Walk and Manipulate Objects by Watching Humans Perform Tasks

New Probability Theory Bridges Quantum Computing and Classical Randomness