Scientists are increasingly focused on mitigating pathological repetition within protein language models, a phenomenon that hinders both structural integrity and functional potential. Jiahao Zhang, Zeqing Zhang, and Di Wang, all from Mohamed bin Zayed University of Artificial Intelligence, alongside Lijie Hu, present the first systematic investigation into this issue. Their research establishes quantitative metrics for assessing repetition at both motif and homopolymer levels, demonstrating a clear correlation between repetitive sequences and reduced folding reliability. To combat this, they introduce UCCS (Utility-Controlled Contrastive Steering), a novel method that guides protein generation using carefully curated datasets to specifically target and reduce repetition without compromising the protein’s ability to fold correctly. This work represents a significant step towards reliable protein generation via language models, positioning repetition control as a crucial area for future development.

Repetitive sequence generation compromises protein structure and function by limiting conformational diversity

Scientists have identified a critical flaw in protein language models, a tendency towards pathological repetition during protein sequence generation. Unlike text generation where repetition simply reduces readability, in proteins it severely undermines structural integrity and functional viability. This work presents the first systematic study of this repetition phenomenon, proposing new quantitative metrics to characterise both motif-level and homopolymer repetition and demonstrating its detrimental impact on folding reliability.
To address this challenge, researchers developed Utility-Controlled Contrastive Steering, or UCCS, a method that steers protein generation using a carefully constructed dataset. Instead of attempting to reduce repetition through simple comparisons of high and low-repetition sequences, UCCS constructs contrastive sets that maximise differences in repetition while maintaining tight control over structural utility.

This disentanglement process yields steering vectors that specifically target repetition without compromising the protein’s ability to fold correctly. These vectors, injected during the generation process, consistently reduce repetition without requiring model retraining or the use of heuristic decoding methods.

Experiments utilising both ESM-3 and ProtGPT2 models, tested against the CATH, UniRef50, and SCOP datasets, demonstrate that UCCS outperforms existing decoding penalties and baseline methods. The results show a substantial lowering of repetition alongside the preservation of crucial AlphaFold confidence scores, indicating improved structural prediction accuracy.

This research establishes repetition control as a central challenge for protein language models and highlights dataset-guided steering as a principled approach for generating reliable protein sequences. By formally defining this failure mode and establishing robust evaluation metrics, this study provides a foundation for future advancements in de novo protein design and structural prediction. The UCCS method offers a significant step towards generating functional and stable proteins through artificial intelligence.

Quantifying protein sequence degeneracy and structural plausibility through composite scoring functions enables improved protein design and engineering

A unified repetition score, R(x), and a utility score, U(x), were initially established to formally define and evaluate pathological repetition in protein language models. The repetition score integrates token-level entropy, motif-level n-gram diversity, and homopolymer penalties into a single degeneracy measure.

Simultaneously, the utility score leverages structural proxies from AlphaFold, specifically pLDDT and pTM scores, to quantify the plausibility of generated sequences remaining foldable. These metrics facilitated the formulation of repetition control as a constrained optimisation problem, aiming to reduce degeneracy without compromising structural plausibility.

Subsequently, Utility-Controlled Contrastive Steering (UCCS) was developed as a representation-level intervention designed to disentangle repetition from structural characteristics. Instead of contrasting high- versus low-repetition sequences directly, contrastive sets were constructed to maximise differences in repetition while maintaining tight control over structural utility.

This careful construction yielded steering vectors specifically targeting repetition without negatively impacting foldability. These vectors were then injected at inference to consistently reduce repetition without requiring model retraining or heuristic decoding adjustments. Experiments were conducted utilising both ESM-3 and ProtGPT2 models, evaluating performance across the CATH, UniRef50, and SCOP datasets.

The research demonstrated that UCCS consistently outperformed baseline methods, including decoding penalties, in substantially lowering repetition rates. Importantly, this reduction in repetition was achieved while preserving AlphaFold confidence scores, indicating maintained structural reliability and biological viability of the generated proteins. This work establishes repetition control as a central challenge for protein language models and highlights dataset-guided steering as a principled approach for reliable protein generation.

Quantifying and mitigating pathological repetition via utility-controlled contrastive steering offers improved generative model control

Researchers introduced quantitative metrics to characterise motif-level and homopolymer repetition, establishing a unified repetition score R(x) integrating token entropy, n-gram diversity, and homopolymer penalties. This score, alongside a utility score U(x) derived from AlphaFold structural proxies, allowed formulation of repetition control as a constrained optimisation problem balancing degeneracy reduction with structural plausibility.

The developed utility score U(x) quantifies the foldability of generated sequences using metrics like pLDDT and pTM. To address pathological repetition in protein language models, a new method called Utility-Controlled Contrastive Steering (UCCS) was proposed, designed to disentangle repetition from structural plausibility.

UCCS constructs contrastive datasets that maximise differences in repetition while tightly controlling for structural utility, yielding steering vectors that specifically target repetition without degrading foldability. Experiments utilising ESM-3 and ProtGPT2 across CATH, UniRef50, and SCOP datasets demonstrate that UCCS consistently reduces repetition during inference without requiring retraining or heuristic decoding.

Evaluations across the three datasets, CATH, UniRef50, and SCOP, show that UCCS outperforms decoding penalties and other baseline methods in lowering repetition while preserving AlphaFold confidence scores. The research establishes repetition control as a central challenge for protein language models and highlights dataset-guided steering as a principled approach for reliable protein generation. Representative cases of natural proteins versus PLM generations, arranged by sequence length, visually demonstrate the reduction in repetition artifacts achieved by the method, with generated sequences exhibiting more diverse and structurally coherent patterns.

Mitigating pathological repetition enhances protein structure prediction accuracy

Scientists have established repetition control as a central challenge in protein language models (PLMs) and demonstrated a principled approach to reliable protein generation. Pathological repetition frequently occurs during protein sequence generation with PLMs, which, unlike simple redundancy in text, compromises the structural integrity and potential functionality of the designed proteins.

This work presents the first systematic investigation into this repetition, introducing quantitative metrics to characterise it at both the motif and homopolymer levels and demonstrating its detrimental impact on folding reliability. To address this issue, researchers developed Utility-Controlled Contrastive Steering (UCCS), a method that disentangles repetition from structural utility through dataset-guided activation edits.

UCCS employs contrastive sets designed to maximise differences in repetition while maintaining structural integrity, resulting in steering vectors that specifically target repetition without negatively affecting foldability. Experiments utilising both ESM-3 and ProtGPT2 models across diverse datasets, CATH, UniRef50, and SCOP, show that UCCS consistently outperforms existing methods like decoding penalties in reducing repetition and preserving confidence scores.

The authors acknowledge that the effectiveness of UCCS varies with the parameter α, with performance improvements plateauing beyond a certain value, and that the optimal steering layer differs between model architectures. Future research could explore strategies to optimise α and layer selection for various PLMs. This work establishes a foundation for more reliable protein generation by addressing a critical limitation of current PLMs and paving the way for improved protein design and engineering.

👉 More information
🗞 Controlling Repetition in Protein Language Models
🧠 ArXiv: https://arxiv.org/abs/2602.00782

Tags:

protein design

AI Steers Protein Design Away from Errors That Ruin Function and Stability

Repetitive sequence generation compromises protein structure and function by limiting conformational diversity

Quantifying protein sequence degeneracy and structural plausibility through composite scoring functions enables improved protein design and engineering

Quantifying and mitigating pathological repetition via utility-controlled contrastive steering offers improved generative model control

Mitigating pathological repetition enhances protein structure prediction accuracy

Rohail T.

Latest Posts by Rohail T.:

Quantum Circuits Reveal Hidden Entanglement Changes with New Entropy Measures

Plant Light-Harvesting Boosted by Internal Electronic Mixing

Modulated Quantum Batteries Overcome Efficiency Losses from Energy Coherence