The escalating computational demands of large language models are driving research into efficient data compression techniques, and a new study proposes a method for globally compressing contextual information. Dian Jiao, Jiaxin Duan, and Shuai Zhao, all from China Electronics Cloud Technology Co., Ltd., alongside Jiabing Leng, Yiran Zhang, Feng Huang et al., introduce VIST2, a novel approach that interleaves text and visual encodings to reduce the number of tokens required during both prefilling and inference. This research is significant because it tackles computational costs at a token-by-token level, unlike previous partial compression methods. Through a multi-stage training process involving optical language modelling and instruction tuning, the team demonstrates substantial improvements in speed, memory usage, and computational efficiency on long-form writing tasks, achieving up to a threefold increase in first-token generation speed with a fourfold compression ratio. The researchers are making their code and datasets publicly available to facilitate further investigation in this area.

Global Context Compression with VIST2 Transformers

Recent advances in vision-language models applied to end-to-end optical character recognition (OCR) have revealed a promising new approach to compressing textual information with minimal loss. This discovery has motivated earlier research focusing on rendering input text into images for prefilling, effectively reducing the number of tokens and alleviating the quadratic increase in computational demands associated with Attention mechanisms. However, this partial compression method has proven ineffective in saving computational or memory costs during token-by-token inference. Consequently, scientists have investigated global context compression, a technique designed to save tokens during both prefilling and inference stages.

This work introduces VIST2, a novel Transformer architecture that interleaves chunks of input text with their corresponding visual encodings, relying exclusively on visual tokens within the pre-context to predict the subsequent text token distribution. The team renders text chunks into sketch images and trains VIST2 through a multi-stage process, beginning with curriculum-scheduled pretraining for optical language modeling, followed by modal-interleaved instruction tuning. Extensive experiments were conducted using VIST2 models scaled from 0.6 billion to 8 billion parameters to optimise the training process and hyperparameters. The resulting models, achieving a 4:1 compression ratio, demonstrate significant improvements over baseline models in long writing tasks.

Specifically, the research team observed an average 3x speedup in first-token generation, a 77% reduction in memory usage, and a 74% reduction in floating-point operations per second (FLOPS). This innovative approach effectively bridges the gap between partial and global compression techniques, offering a substantial advancement in efficient language modelling. VIST2’s architecture features a “sandwich” design, common in vision-language models, comprising a visual encoder and a large language model backbone connected by a modal aligner. The visual encoder processes images rendered from unstructured text, compressing information into a dense visual latent space. This iterative visual-text transformation, termed Optical Language Modelling, enables both long-text understanding and long-text generation, addressing limitations found in existing partial compression methods. The researchers have made their code and datasets publicly available to facilitate further research in this area.

Visual Interleaving for Long-Text Language Modelling

The study introduces VIST2, a novel large language model architecture designed for global context compression, and details the innovative methodology employed to achieve both long-text understanding and generation. Researchers engineered a system that interleaves text chunks with their corresponding visual encodings, relying exclusively on visual tokens within the pre-context to predict subsequent text token distributions. This approach addresses limitations in existing partial context compression methods, which only support prefilling and not inference. To implement VIST2, the team developed a ‘sandwich’ architecture comprising a pretrained Vision Transformer (ViT-L/16) as a visual encoder and a language model backbone, connected by a modal aligner.

Text chunks are first rendered into grayscale images, and the ViT-L/16 processes these images, yielding m = ⌊H 16⌋×⌊W 16⌋ visual tokens per image, each with a dimension of dv. The modal aligner, a multilayer perceptron, then projects these visual tokens into the language model’s embedding space, resulting in tokens of dimension dlm. A sparse causal attention mechanism is central to the design, restricting token visibility to ensure visual tokens function as contextual memory while preventing information leakage to future textual content. Positional encoding is unified across modalities, converting to Pos(j) = X |v The study pioneered a multi-stage training recipe to address asynchronous parameter optimization and modifications to attention layers.

Initial pre-training focused on image captioning with the visual encoder and language model frozen, followed by training with a multi-turn OCR (MT-OCR) task, updating only the visual encoder and modal aligner. The MT-OCR task involved a curriculum schedule with increasing difficulty, single image, 2-4 images, and over 4 images, designed to optimise the model’s ability to compress text into images and recover essential textual information. Experiments, scaled from 0.6B to 8B parameters, demonstrated a 3 speedup in first-token generation, a 77% reduction in memory usage, and a 74% reduction in FLOPS with a 4 compression ratio, highlighting the efficacy of the methodology.

👉 More information
🗞 Global Context Compression with Interleaved Vision-Text Transformation
🧠 ArXiv: https://arxiv.org/abs/2601.10378

Tags:

attention computations global context compression instruction tuning long writing tasks. optical language modeling sketch images text compression token prediction VIST2 visual tokens

Global Context Compression Achieves 77% Reduction with Interleaved Vision-Text Transformation

Global Context Compression with VIST2 Transformers

Visual Interleaving for Long-Text Language Modelling

Rohail T.

Latest Posts by Rohail T.:

Accurate Quantum Sensing Now Accounts for Real-World Limitations

Quantum Error Correction Gains a Clearer Building Mechanism for Robust Codes

Protected: Models Achieve Reliable Accuracy and Exploit Atomic Interactions Efficiently