Leviathan Achieves Superior Language Model Capacity with Sub-Billion Parameters

Researchers have long assumed that parameters within language models are largely interchangeable, successfully predicting performance based on model size and compute. However, Reza T Batley and Sourav Saha, both from Virginia Polytechnic Institute and State University, alongside Reza T Batley and Sourav Saha et al, challenge this notion, demonstrating that parameter allocation in smaller language models is surprisingly inefficient. Their work introduces Leviathan, a novel architecture employing a continuous embedding generator instead of traditional discrete lookup tables. Evaluating Leviathan on the Pile dataset, the team reveals consistent performance gains over standard LLaMA-style models, and crucially, shows a markedly superior effective parameter capacity, behaving as if it possesses significantly more parameters than it actually does. This research offers a pathway to building more powerful and efficient small language models, potentially reshaping the landscape of natural language processing.

Despite incurring a moderate throughput overhead of 23-51%, which decreases with scale, the gains in sample efficiency demonstrably outweigh this cost. Notably, Leviathan exhibits an effective capacity 1.5 to 2.1times greater than its actual parameter count. At the 421M scale, Leviathan achieved the validation loss of a roughly 725M parameter dense model, showcasing its ability to perform competitively with significantly larger models. Depth, denoted as L, was either fixed in iso-body runs or increased in isoparametric runs to restore near-isoparametricity, with Leviathan’s generator module substituting the input embedding matrix. All models were implemented in JAX/Flax and trained from scratch using AdamW with gradient clipping, optimizing performance with a sequence length of 512 and a batch size of 512.

Data was sourced from the Pile dataset, streamed via a dataloader with a 10,000 sequence shuffle buffer to randomize distribution. Input text underwent tokenization using the o200k base tokenizer from tiktoken, a vocabulary of cardinality 200,018 padded to 200,376, motivating a decomposition into three-dimensional coordinates via base-59 decomposition, reducing indexing parameters from 200,376 to 177. Crucially, matched Dense-Leviathan pairs processed identical token streams, ensuring consistent data exposure during training. Power laws, of the form L(N) = AN−α + b and L(D) = BD−β + b, were fitted to the dense models, using an irreducible loss term, b, of 1.69, aligning with established scaling practices0.47 compared to the baseline’s 0.38, suggesting a widening advantage with increasing parameter count.

Leviathan achieves superior parameter efficiency consistently, outperforming many

At the 109M scale, Leviathan demonstrated a validation loss equivalent to a dense model of 230M parameters, representing a 2.11x effective size multiplier. Even at the 421M scale, where the embedding tax is reduced, Leviathan maintained a substantial 1.72x effective size advantage, corresponding to a 724M parameter dense model. Results demonstrate that the advantage of Leviathan, relative to its dense baseline, increases approximately monotonically with tokens seen during training. Specifically, Leviathan-60M exhibited continued growth in advantage even after processing 100×N tokens. Specifically, the research indicates improved parameter and data scaling exponents compared to dense baselines, suggesting Leviathan extracts more benefit from both increased model size and additional training data. While the current analysis is limited to the specific parameter range studied, the consistent outperformance of Leviathan indicates a systematic improvement in parameter efficiency.

👉 More information
🗞 A Separable Architecture for Continuous Token Representation in Language Models
🧠 ArXiv: https://arxiv.org/abs/2601.22040

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

Geonorm Achieves Consistent Performance Gains over Existing Normalization Methods in Models

Geonorm Achieves Consistent Performance Gains over Existing Normalization Methods in Models

February 1, 2026
AI Tools Boost Developer Productivity and Quality: Insights from 147 Professionals

Halo Achieves 0.01% Loss with Hybrid Linear Attention for Long Contexts

February 1, 2026
Quantum Quench Accelerometer Achieves Enhanced Sensitivity Via Nanoparticle Dynamics

Holographic Entanglement Achieves Pure States Via Measurement and Minimal Surfaces

February 1, 2026