Researchers have long assumed that parameters within language models are largely interchangeable, successfully predicting performance based on model size and compute. However, Reza T Batley and Sourav Saha, both from Virginia Polytechnic Institute and State University, alongside Reza T Batley and Sourav Saha et al, challenge this notion, demonstrating that parameter allocation in smaller language models is surprisingly inefficient. Their work introduces Leviathan, a novel architecture employing a continuous embedding generator instead of traditional discrete lookup tables. Evaluating Leviathan on the Pile dataset, the team reveals consistent performance gains over standard LLaMA-style models, and crucially, shows a markedly superior effective parameter capacity, behaving as if it possesses significantly more parameters than it actually does. This research offers a pathway to building more powerful and efficient small language models, potentially reshaping the landscape of natural language processing.
Despite incurring a moderate throughput overhead of 23-51%, which decreases with scale, the gains in sample efficiency demonstrably outweigh this cost. Notably, Leviathan exhibits an effective capacity 1.5 to 2.1times greater than its actual parameter count. At the 421M scale, Leviathan achieved the validation loss of a roughly 725M parameter dense model, showcasing its ability to perform competitively with significantly larger models. Depth, denoted as L, was either fixed in iso-body runs or increased in isoparametric runs to restore near-isoparametricity, with Leviathan’s generator module substituting the input embedding matrix. All models were implemented in JAX/Flax and trained from scratch using AdamW with gradient clipping, optimizing performance with a sequence length of 512 and a batch size of 512.
Data was sourced from the Pile dataset, streamed via a dataloader with a 10,000 sequence shuffle buffer to randomize distribution. Input text underwent tokenization using the o200k base tokenizer from tiktoken, a vocabulary of cardinality 200,018 padded to 200,376, motivating a decomposition into three-dimensional coordinates via base-59 decomposition, reducing indexing parameters from 200,376 to 177. Crucially, matched Dense-Leviathan pairs processed identical token streams, ensuring consistent data exposure during training. Power laws, of the form L(N) = AN−α + b and L(D) = BD−β + b, were fitted to the dense models, using an irreducible loss term, b, of 1.69, aligning with established scaling practices0.47 compared to the baseline’s 0.38, suggesting a widening advantage with increasing parameter count.
Leviathan achieves superior parameter efficiency consistently, outperforming many
At the 109M scale, Leviathan demonstrated a validation loss equivalent to a dense model of 230M parameters, representing a 2.11x effective size multiplier. Even at the 421M scale, where the embedding tax is reduced, Leviathan maintained a substantial 1.72x effective size advantage, corresponding to a 724M parameter dense model. Results demonstrate that the advantage of Leviathan, relative to its dense baseline, increases approximately monotonically with tokens seen during training. Specifically, Leviathan-60M exhibited continued growth in advantage even after processing 100×N tokens. Specifically, the research indicates improved parameter and data scaling exponents compared to dense baselines, suggesting Leviathan extracts more benefit from both increased model size and additional training data. While the current analysis is limited to the specific parameter range studied, the consistent outperformance of Leviathan indicates a systematic improvement in parameter efficiency.
👉 More information
🗞 A Separable Architecture for Continuous Token Representation in Language Models
🧠 ArXiv: https://arxiv.org/abs/2601.22040
