AI Learns to Compress Data Using Language Models for Perfect Reconstruction

Efficient lossless compression represents a critical challenge in managing ever-increasing data volumes and associated storage demands. Mahdi Khodabandeh, Ghazal Shabani, and Arash Yousefi Jordehi, all from the Department of Computer Engineering at University of Guilan, alongside Seyed Abolghasem Mirroshandel and colleagues, present a novel approach to this problem utilising discrete latent transformers and reinforcement learning. Their research details a method for compressing data into sequences of tokens, avoiding the information loss often associated with dense vector representations common in existing techniques. This preservation of token structure, achieved through a Reinforcement learning framework applied to a T5 language model, allows for improved compression ratios and maintains data integrity, offering a significant advancement over traditional dictionary-based and statistical methods and promising more robust compression solutions for diverse applications.

Scientists have demonstrated that efficient lossless compression is essential for minimising storage costs and transmission overhead while preserving data integrity. Traditional compression techniques, such as dictionary-based and statistical methods, often struggle to optimally exploit the structure and redundancy in complex data formats. Recent advancements in deep learning have opened new avenues for compression, but many existing approaches depend on dense vector representations that obscure the underlying token structure. Researchers propose a novel lossless compression method that leverages Reinforcement Learning applied to a T5 language model architecture. This approach enables the compression of data into sequences of tokens rather than traditional vector representations, preserving the token-based structure and aligning more closely with the original data format. By training the model using an off-policy Reinforcement Learning algorithm, they optimise sequence length to minimise redundancy and enhance compression efficiency. This preservation allows for higher compression ratios while maintaining semantic integrity, functioning independently of external grammatical or world knowledge. The system effectively compresses data without requiring explicit content understanding, paving the way for more robust and practical compression solutions across various applications. Logical error rates reached 2.914% per cycle, demonstrating the efficacy of the reinforcement learning approach to lossless data compression. Data compression transforms data into a format that occupies less storage space while maintaining acceptable accuracy, recognised as a key approach to information encoding and termed bit-rate reduction in networking. Morse Code, introduced in 1848, is regarded as the earliest form of modern data compression, with most lossless compression techniques utilising statistical models and mathematical approaches. Multimedia techniques leverage specific features of the original file to eliminate redundant data, and patterns in data can be exploited to achieve compression. The Text-to-Text Transfer Transformer (T5) language model is a top-performing architecture in Natural Language Processing (NLP) benchmarks, featuring an encoder-decoder transformer design. It is pre-trained on extensive datasets, including the C4 dataset, as well as task-specific datasets such as bug fixing and code summarization, providing a strong foundation for understanding and generating data. The model handles inputs of different lengths and can be easily scaled up by adjusting the number of parameters, ranging from 60 million to 11 billion, with multiple configurations available. This work develops and evaluates a novel, lightweight lossless data compression framework that combines the pattern recognition capabilities of Large Language Models (LLMs) with the adaptive decision-making of Reinforcement Learning (RL) techniques, while remaining computationally tractable for execution on typical personal computers. Unlike resource-intensive neural compression approaches, the presented framework is based on the T5 architecture that operates entirely in the discrete token space, enabling effective compression with low memory and processing requirements. By leveraging an off-policy RL algorithm, the approach optimizes sequence length to minimise redundancy while preserving semantic integrity, prioritizing practicality and accessibility over achieving the highest possible compression ratios. Compression strategies typically require customization based on the specific characteristics of tasks or datasets, and generalizing RL policies across tasks or compression rates without retraining remains a challenge. However, this approach demonstrates strong generalisation across different datasets and compression scenarios without task-specific fine-tuning. Another key consideration in RL-based compression algorithms is the trade-off between multiple objectives such as reducing model size, maintaining accuracy, and meeting latency or energy constraints. This multi-objective optimisation requires carefully designed reward functions. The research departs from conventional auto-encoder-based techniques by maintaining a token-based structure that better aligns with the original data format, enabling higher compression ratios without relying on predefined grammatical rules or external world knowledge. The achieved compression ratio may not surpass state-of-the-art deep learning-based compression models, but the ability to run efficiently on personal computers makes this approach a viable and scalable solution for real-world applications where resource constraints are a concern. This research offers a novel perspective by combining the strengths of LLM and deep reinforcement learning. The key contributions include the development of a compression technique that dynamically optimizes compression strategies, integration of the T5 model with RL, designing a model that compresses data through discrete Intermediate Representation (IR), a scalable framework comparable to state-of-the-art machine learning-based compression techniques, utilisation of LLMs to leverage contextual features for compression, and a framework that ensures efficient execution without requiring specialised computational resources. To promote reproducibility and facilitate further research, the source code will be released publicly upon acceptance of this paper, alongside a user-friendly implementation such as a well-documented AP.
LZ77 is a dictionary-based lossless compression algorithm that utilizes a dynamically sliding window, consisting of a search buffer and a look-ahead buffer. The search buffer stores previously seen sequences of symbols, while the look-ahead buffer contains upcoming symbols. The. GZIP is a strong variation of LZ77 that encodes the identified repeated pattern with Huffman coding. Another fundamental method in entropy coding is arithmetic coding, which differs from traditional prefix-based coding schemes like Huffman coding by representing the entire message as an interval within [0,1). The algorithm creates a probability model based on the frequency of each symbol and dynamically divides the interval, achieving near-optimal compression efficiency. Range encoding is a variant of arithmetic coding that eliminates both contextual and alphabetic redundancies in digital messages while maintaining similar compression effectiveness. Like arithmetic coding, range coding represents a message as a progressively refined interval, but uses fixed-precision integer arithmetic, simplifying computation and making it practical for software implementations. Neural Network Compression (NNCP) is a lossless data compression algorithm that leverages sequence modelling techniques to model the probability distribution of the input data. The first version of NNCP employs a Long Short-Term Memory (LSTM) to detect sequential dependencies, allowing it to predict the likelihood of upcoming symbols with high accuracy, and encodes the probability distribution using arithmetic coding. This approach achieved compression rates competitive with traditional statistical and dictionary-based algorithms. The second version replaces the LSTM backbone with a Transformer-based model, improving efficiency in encoding and decoding by capturing long-range dependencies and leveraging parallelization benefits, with modifications including changing the activation functions from Rectified Linear Unit (ReLU) to Gaussian Error Linear Units (GELUs). The relentless pursuit of better data compression addresses the escalating cost of storing and transmitting the world’s exponentially growing digital footprint. This new work offers a compelling departure by applying reinforcement learning to the architecture of large language models, focusing on preserving the fundamental structure of data by treating it as a sequence of tokens rather than collapsing it into abstract vector representations. This is a crucial shift, as it sidesteps information loss inherent in many conventional methods and aligns more closely with how data is originally organised. The use of reinforcement learning allows the system to adaptively optimise compression, learning to identify and eliminate redundancy without needing prior knowledge of the data’s content or format. However, the reliance on pre-trained transformer models introduces significant computational overhead, and the system’s performance on truly novel data types remains an open question. Future work will likely focus on reducing model size without sacrificing compression performance and exploring techniques for transfer learning to broaden the range of applicable data formats, ultimately aiming to unlock a new generation of compression technologies capable of handling the data deluge of the 21st century.

👉 More information
🗞 Seq2Seq2Seq: Lossless Data Compression via Discrete Latent Transformers and Reinforcement Learning
🧠 ArXiv: https://arxiv.org/abs/2602.12146

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

New Plasma Solver Accurately Models Complex Behaviour at Sonic Speeds

New Plasma Solver Accurately Models Complex Behaviour at Sonic Speeds

February 16, 2026
Ai’s Hidden Flaws Revealed by New Tool Detecting Critical Reasoning Errors

Ai’s Hidden Flaws Revealed by New Tool Detecting Critical Reasoning Errors

February 16, 2026
Machine Learning Struggles to Grasp Complex States of Matter, Research Confirms

Machine Learning Struggles to Grasp Complex States of Matter, Research Confirms

February 16, 2026