AI Model Predicts Protein Fitness Using Just Sequence, Rivaling Complex Methods

Scientists are tackling the longstanding challenge of unifying protein language models capable of both accurate fitness prediction and efficient protein generation. Furkan Eris, an independent researcher, alongside collaborators, present a novel causal protein language model named Proust, which demonstrates a compelling ability to bridge this divide. This 309 million-parameter model achieves competitive fitness estimation on ProteinGym substitutions, rivalling masked language models trained with significantly greater computational resources. Proust also establishes a new state-of-the-art performance on indel benchmarks and approaches the accuracy of more complex, awareness-based methods on viral fitness, all while retaining the generative capabilities absent in traditional masked language models. These advancements position Proust as a powerful tool for protein engineering and design, offering insights into representation learning and potential for further scaling of capabilities.

This innovation addresses a longstanding trade-off in the field, where masked language models excel at evaluating protein stability but lack generative capabilities, while causal models can generate sequences but historically underperform in fitness prediction.

Proust bridges this gap through a series of architectural improvements borrowed from large language model research, offering a single model for both tasks. Trained on 33 billion tokens using 40 B200 GPUs over 40 hours, the research demonstrates a significant leap in efficiency and performance. The study reveals that Proust attains a Spearman correlation coefficient of 0.390 on ProteinGym substitutions, matching the performance of much larger masked language models like ESM-2-650M, but with 62times fewer training FLOPs.
Furthermore, on indel prediction, measuring insertions and deletions, Proust establishes a new state-of-the-art, surpassing models up to 20times its size. On EVEREST viral fitness benchmarks, Proust approaches the accuracy of structure-aware methods, achieving this performance using only sequence information.

These results position Proust as a powerful tool for protein engineering and design, offering a balance between predictive power and generative capacity. Key to Proust’s success is the implementation of grouped-query attention with shared key/value projections, cross-layer value residuals, depthwise causal convolutions, and the Muon optimizer with Newton-Schulz orthogonalization.

These architectural choices enable the model to achieve 19% multi-front utilisation on B200 GPUs with 131,000-token batches, significantly improving training efficiency. Beyond performance, the research provides insights into the model’s internal workings, revealing that the variance of per-position entropy can predict the effectiveness of retrieval augmentation, a technique used to enhance predictions with evolutionary information.

Analysis using a ‘logit lens’ demonstrates that early layers of Proust focus on abstracting input embeddings, middle layers integrate contextual information, and later layers converge on final predictions. The model actively suppresses certain amino acids, tryptophan and cysteine, reflecting their rarity and specialised roles in protein structure.

Importantly, the standard deviation of per-position entropy correlates with the benefit of retrieval augmentation, suggesting a potential heuristic for optimising test-time compute allocation and determining when external data is most valuable. This work introduces a causal PLM with competitive performance, an efficient architecture, and a novel method for guiding test-time scaling.

Proust model architecture and training methodology leverage a recurrent neural network with long short-term memory

A 309 million-parameter causal protein language model, Proust, was developed to bridge the performance gap between masked language models and causal models. The research involved training Proust on a corpus of 33 billion tokens using 40 B200 GPUs over a period of time, leveraging architectural innovations borrowed from large language model research.

These innovations included grouped-query attention with shared key/value projections, cross-layer value residuals, and depthwise causal convolutions, all designed to enhance performance and efficiency. Proust’s performance was rigorously evaluated on the ProteinGym substitutions benchmark, achieving a Spearman correlation coefficient of 0.390, a result competitive with masked language models that required 50 to 200times more computational resources.

Further assessment on indels demonstrated Proust establishing a new state-of-the-art performance, surpassing models up to 20times larger in parameter count. On the EVEREST viral fitness benchmarks, the model approached the accuracy of structure-aware methods while relying solely on sequence information.

The training process employed the Muon optimizer with Newton-Schulz orthogonalization, enabling the use of higher learning rates and maintaining training stability. Logit lens analysis was then performed to examine the model’s internal representations, revealing that early layers abstract input embeddings, middle layers integrate contextual information, and late layers converge towards final predictions. This analysis also showed that the standard deviation of per-position entropy correlates with the effectiveness of retrieval augmentation, suggesting a potential heuristic for optimizing test-time computation.

Proust demonstrates competitive protein function prediction with substantially reduced computational cost through a novel approach

Proust, a 309 million-parameter causal pre-trained language model, attained a Spearman correlation coefficient of 0.390 on ProteinGym substitutions, matching ProGen2-6.4B and ProGen3-3B while requiring 41 to 213times less training compute. ESM2-650M achieved a correlation of 0.414 at 62times the compute, and E1-600M reached 0.420 at 229times the compute, demonstrating Proust’s computational efficiency.

Across all tested models, Proust either matched or exceeded performance from causal and generative language models, with higher correlations generally requiring masked language models and substantially more computational resources. Breaking down performance by functional category, Proust excelled on activity assays, achieving a Spearman correlation of 0.42, and binding assays, with a correlation of 0.41, where sequence patterns strongly correlate with function.

Performance decreased on stability assays, reaching 0.34, as these assays rely more heavily on three-dimensional structural context, which sequence models capture indirectly. This pattern aligns with observations across pre-trained language models, indicating that stability prediction remains a challenge without explicit structural information.

On ProteinGym indels, Proust achieved a Spearman correlation of 0.521, surpassing all compared models including ProGen2-6.4B (ρ = 0.432) and RITA-1.2B (ρ = 0.450), despite utilizing 10 to 200times less training compute. The gap between Proust and larger models was more pronounced for indels than for substitutions, suggesting that architectural efficiency is particularly important for this task.

Final perplexity reached 10.85 on the validation set and 11.04 on the training set, indicating a lack of overfitting during the 4-step training process. Evaluating on EVEREST viral fitness benchmarks, Proust achieved a mean Spearman correlation of 0.40, approaching SaProt at 0.44, which incorporates structure tokens from Foldseek.

Proust exhibited lower cross-assay variance, with a standard deviation of 0.10 compared to 0.14, 0.22 for baseline models, suggesting more consistent performance across different assays. Retrieval augmentation, using ColabFold, added approximately 4 hours of wall-clock time when run sequentially on the 217 ProteinGym substitution assays.

Efficient protein modelling via architectural innovation and sequence entropy analysis enables accurate structure prediction

Proust, a 309 million-parameter causal protein language model, achieves competitive performance with masked language models while retaining generative capabilities. Trained on 33 billion tokens, it demonstrates strong results on protein substitution and indel benchmarks, surpassing models up to twenty times larger in some cases.

Specifically, Proust matches the performance of significantly larger models on ProteinGym substitutions and establishes a new state-of-the-art on indel prediction. This model’s efficiency stems from architectural innovations borrowed from large language models, including grouped-query attention, cross-layer value residuals, and depthwise causal convolutions.

Interpretability analyses reveal that Proust effectively distinguishes between constrained and variable positions within protein sequences, aligning with established patterns of evolutionary conservation and structural roles. Furthermore, the standard deviation of per-position entropy correlates with the effectiveness of retrieval augmentation, suggesting a potential heuristic for optimising computational resources during testing.

Limitations acknowledged by the developers include a slight lag in stability prediction compared to structure-aware models and the possibility that larger model sizes may further improve performance. Future research will explore pre-training on concatenated homologous sequences to potentially enhance retrieval-augmented performance, leveraging the model’s existing long-context pretraining capabilities.

👉 More information
🗞 No Generation without Representation: Efficient Causal Protein Language Models Enable Zero-Shot Fitness Estimation
🧠 ArXiv: https://arxiv.org/abs/2602.01845

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

Cancer Diagnosis Boosted by AI That Reads Tissue Like Genetic Code

Cancer Diagnosis Boosted by AI That Reads Tissue Like Genetic Code

February 11, 2026
Interface Modelling Breakthrough Halts Artificial Shrinkage in Computer Simulations

Interface Modelling Breakthrough Halts Artificial Shrinkage in Computer Simulations

February 11, 2026
New Nanoscale Device Doubles Light Frequency with Exceptional Precision and Efficiency

New Nanoscale Device Doubles Light Frequency with Exceptional Precision and Efficiency

February 11, 2026