Large language models currently generate text one word at a time, a process that limits their overall efficiency, but researchers are now exploring ways to dramatically speed up this process. Chenze Shao, Darren Li, and Fandong Meng, from WeChat AI at Tencent Inc, along with Jie Zhou, present a new approach called Continuous Autoregressive Language Models (CALM) that predicts sequences of meaning rather than individual words. This innovative method compresses chunks of text into continuous vectors, allowing the model to generate text much more quickly, and reconstructs the original text with remarkable accuracy. The team also developed a new framework for training and evaluating these models, demonstrating significant improvements in performance and computational efficiency, and establishing a pathway towards ultra-efficient language generation.
Asymptotic Unbiasedness of Weighted Batch Sampling
This research presents a novel sampling algorithm designed to draw samples from a probability distribution. The core idea involves using an initial sampler to generate a batch of samples, then weighting those samples based on how frequently they appear within the batch. The team rigorously proves that this algorithm is asymptotically unbiased, meaning that as the batch size increases, the probability of sampling a particular value converges to its true probability within the target distribution. This is essential for ensuring the algorithm generates representative samples. The research establishes that the expected number of calls to the initial sampler needed to generate a single sample remains bounded, suggesting computational efficiency. This innovation addresses limitations of sequential processing in large language models by compressing sequences of tokens into single continuous vectors using a high-fidelity autoencoder, achieving reconstruction accuracy exceeding 99. 9%. The autoencoder utilizes a 128-dimensional latent vector for a chunk of four tokens, enabling robust representation of information. To build a robust latent space, the researchers implemented a clipping technique and dropout during autoencoder training, forcing the autoencoder to learn redundant representations and infer masked tokens from context. They then developed a likelihood-free framework to address the challenges of training and evaluating models in the continuous domain, utilizing a lightweight generative head to efficiently predict the next vector in the sequence. This innovation addresses a core limitation of existing large language models, which are constrained by their sequential, token-by-token processing. A high-fidelity autoencoder compresses sequences of up to K tokens into a single continuous vector, with the original tokens accurately reconstructed at over 99. 9% accuracy, significantly improving computational efficiency. The team also developed a likelihood-free framework, enabling robust training, evaluation, and controllable sampling within this continuous domain. They mathematically proved that an approximate algorithm, utilizing a batch of N samples, is asymptotically unbiased, meaning that as the batch size approaches infinity, the output distribution converges to the true target distribution. Researchers demonstrate that representing sequences as continuous vectors, rather than discrete tokens, significantly improves computational efficiency without sacrificing performance. By compressing chunks of tokens into single vectors and reconstructing them with high accuracy, CALM reduces the number of generative steps required, offering a pathway to more efficient large language models. Experiments confirm that CALM achieves a superior performance-compute trade-off, establishing a new scaling axis for language modeling focused on increasing the semantic bandwidth of each generative step. The researchers developed a comprehensive toolkit to support this approach, including a high-fidelity autoencoder, a novel training objective, and a metric for evaluating language model performance in this continuous domain. While further architectural and algorithmic optimizations are possible, particularly regarding the autoencoder’s focus on reconstruction and the potential for a more integrated generative model, this work demonstrates promising results and opens new avenues for efficient language generation.
👉 More information
🗞 Continuous Autoregressive Language Models
🧠 ArXiv: https://arxiv.org/abs/2510.27688
