Researchers have long assumed that parameters within language models are largely interchangeable, successfully predicting performance based on model size and compute. However, Reza T Batley and Sourav Saha, both from Virginia Polytechnic Institute and State University, alongside Reza T Batley and Sourav Saha et al, challenge this notion, demonstrating that parameter allocation in smaller language models is surprisingly inefficient. Their work introduces Leviathan, a novel architecture employing a continuous embedding generator instead of traditional discrete lookup tables. Evaluating Leviathan on the Pile dataset, the team reveals consistent performance gains over standard LLaMA-style models, and crucially, shows a markedly superior effective parameter capacity, behaving as if it possesses significantly more parameters than it actually does. This research offers a pathway to building more powerful and efficient small language models, potentially reshaping the landscape of natural language processing.
Despite incurring a moderate throughput overhead of 23-51%, which decreases with scale, the gains in sample efficiency demonstrably outweigh this cost. Notably, Leviathan exhibits an effective capacity 1.5 to 2.1times greater than its actual parameter count. At the 421M scale, Leviathan achieved the validation loss of a roughly 725M parameter dense model, showcasing its ability to perform competitively with significantly larger models. Depth, denoted as L, was either fixed in iso-body runs or increased in isoparametric runs to restore near-isoparametricity, with Leviathan’s generator module substituting the input embedding matrix. All models were implemented in JAX/Flax and trained from scratch using AdamW with gradient clipping, optimizing performance with a sequence length of 512 and a batch size of 512.
Data was sourced from the Pile dataset, streamed via a dataloader with a 10,000 sequence shuffle buffer to randomize distribution. Input text underwent tokenization using the o200k base tokenizer from tiktoken, a vocabulary of cardinality 200,018 padded to 200,376, motivating a decomposition into three-dimensional coordinates via base-59 decomposition, reducing indexing parameters from 200,376 to 177. Crucially, matched Dense-Leviathan pairs processed identical token streams, ensuring consistent data exposure during training. Power laws, of the form L(N) = AN−α + b and L(D) = BD−β + b, were fitted to the dense models, using an irreducible loss term, b, of 1.69, aligning with established scaling practices0.47 compared to the baseline’s 0.38, suggesting a widening advantage with increasing parameter count.
Leviathan achieves superior parameter efficiency consistently, outperforming many
At the 109M scale, Leviathan demonstrated a validation loss equivalent to a dense model of 230M parameters, representing a 2.11x effective size multiplier. Even at the 421M scale, where the embedding tax is reduced, Leviathan maintained a substantial 1.72x effective size advantage, corresponding to a 724M parameter dense model. Results demonstrate that the advantage of Leviathan, relative to its dense baseline, increases approximately monotonically with tokens seen during training. Specifically, Leviathan-60M exhibited continued growth in advantage even after processing 100×N tokens. Specifically, the research indicates improved parameter and data scaling exponents compared to dense baselines, suggesting Leviathan extracts more benefit from both increased model size and additional training data. While the current analysis is limited to the specific parameter range studied, the consistent outperformance of Leviathan indicates a systematic improvement in parameter efficiency.
🗞 A Separable Architecture for Continuous Token Representation in Language Models
🧠 ArXiv: https://arxiv.org/abs/2601.22040
The shift from traditional discrete lookup tables to a continuous embedding generator fundamentally alters the parameter space optimization problem. Discrete embeddings mandate that the model learn an isolated, high-dimensional vector for every unique token index, leading to redundancy and an inherent quantization bottleneck. Leviathan’s continuous approach maps the vocabulary into a smooth, mathematically governed latent space, allowing the model to interpolate meaningful representations for tokens that were not explicitly seen during training. This functional continuity dramatically improves gradient flow during backpropagation, enabling the model to generalize beyond simple rote memorization of index-value pairs.
The observed power-law relationship for both Leviathan and the dense baselines ($L(N) = AN^{-\alpha} + b$) provides crucial insights into model scaling mechanics. Specifically, the comparison of the scaling exponents ($\alpha$ and $\beta$) helps quantify how rapidly the residual loss term, $b$, contributes to performance regardless of increased capacity. The divergence between these exponents suggests that while both architectures adhere to general scaling principles, the efficiency gained by Leviathan is not merely additive but represents a structural improvement in how model capacity translates into predictive performance.
Architecturally, the implementation of the base-59 decomposition method is a clever form of index compression. By mapping the large integer vocabulary size onto a smaller set of coordinates, the authors circumvent the need for a prohibitively large input embedding matrix. This effectively achieves parameter reduction in the initial layer without sacrificing the representational power, maintaining the necessary dimensionality while ensuring that the subsequent generator module has a far smaller, more manageable weight matrix to parametrize.
From an industrial deployment standpoint, the elevated effective capacity is paramount for resource-constrained environments. A model demonstrating a $2.1\times$ multiplier means that a limited computational budget can access a functional performance level previously requiring a physical model size exceeding available hardware memory. This suggests that the primary bottleneck in state-of-the-art NLP might not be compute power per se, but the inefficiency inherent in the physical parameterization of deep network architectures.
