Single-cell RNA sequencing data consistently reveal robust statistical structures, prompting the creation of complex foundation models like TranscriptFormer to generate gene expression embeddings. Huan Souza and Pankaj Mehta, both from the Department of Physics and the Faculty of Computing and Data Science at Boston University, demonstrate that comparable performance can be achieved using significantly less computationally intensive methods. Their research establishes state-of-the-art or near state-of-the-art results on established benchmarks for single-cell analysis, even surpassing foundation models when analysing novel cell types and organisms. This work underscores the importance of thorough benchmarking procedures and suggests that the fundamental biological characteristics defining cell identity may be effectively captured through simple, linear representations of single-cell gene expression data.
Understanding the complex language of cells has long relied on computationally expensive methods. Now, a surprisingly simple approach, using basic mathematical representations of gene activity, matches and even surpasses the performance of advanced artificial intelligence models. This discovery suggests we may have been overlooking the inherent clarity within biological data itself.
Scientists are increasingly reliant on single-cell RNA sequencing (scRNA-seq) to map cellular diversity across tissues and organisms. Advances in this technology have yielded expansive “cell atlases”, datasets containing gene expression profiles from hundreds of millions of individual cells, with volumes expected to grow rapidly. This surge in data has prompted development of large-scale foundation models, such as TranscriptFormer, which employ transformer-based architectures to create generative models of gene expression by embedding genes into a latent vector space.
These embeddings have demonstrated state-of-the-art performance on tasks including cell-type classification and predicting disease states. However, a fundamental question remains: can comparable performance be achieved without the computational demands of these deep learning-based representations. Recent work suggests that the essential features of cellular identity may be encoded within the statistical properties of gene expression, offering a pathway to understanding cells across different types, tissues, and species.
This perspective has driven the creation of various statistical approaches for analysing scRNA-seq data, mirroring the success of protein language models in predicting protein structure from sequence. Researchers have demonstrated that simple, interpretable pipelines, relying on careful data normalization and linear methods, can attain state-of-the-art or near state-of-the-art results on benchmarks used to evaluate single-cell foundation models.
In particular, these pipelines even outperformed foundation models on out-of-distribution tasks, successfully analysing novel cell types and organisms not present in the original training data. Establishing whether these models truly learn biological structure, or reflect pre-existing patterns in the data, has proven difficult. Observations that simpler methods can perform well across diverse single-cell analyses have prompted investigation into the complexity of scRNA-seq data itself and the level of representational sophistication needed to capture biologically relevant variation.
Unlike protein sequences, which are discrete and tightly constrained by biophysics, scRNA-seq data are sparse, noisy, and subject to technical variations like dropout effects and batch-specific artifacts. Inspired by these considerations, a systematic comparison was undertaken between simple pipelines and large-scale foundation models on common downstream tasks.
Results indicate that carefully chosen pre-processing and normalization procedures allow for state-of-the-art performance using low-complexity linear representations of gene expression. This performance often surpasses that of foundation models, despite requiring far fewer computational resources and possessing minimal free parameters. These findings suggest that much of the biologically relevant structure within current scRNA-seq benchmarks is already accessible through these simpler representations, implying that evaluation tasks primarily reflect the intrinsic properties of the data rather than the discovery of genuinely new biological insights.
Normalisation, dimensionality reduction and graph construction for single-cell transcriptomic analysis
Single-cell RNA sequencing (scRNA-seq) data was analysed using carefully designed normalization and linear methods to assess performance on established benchmarks. Initially, raw count matrices from scRNA-seq experiments underwent size factor normalization, a process correcting for differing library sizes between cells to ensure fair comparison of gene expression levels.
Following normalization, data underwent log transformation and scaling to unit variance, preparing the data for subsequent linear analyses. This pipeline prioritizes interpretability and computational efficiency, contrasting with the deep learning approaches commonly used for generating latent representations of gene expression. A key methodological element involved constructing a nearest neighbour graph for each benchmark dataset.
Specifically, cells were embedded into a principal component analysis (PCA) space, reducing dimensionality while retaining major sources of variation. Then, k-nearest neighbours were identified based on Euclidean distance within this PCA space, establishing relationships between cells and forming the basis for graph-based analyses. By focusing on linear methods and PCA, the work avoids the complexities of non-linear dimensionality reduction techniques, maintaining transparency and facilitating biological interpretation.
Performance was evaluated across four distinct downstream tasks: cross-species cell annotation, discrimination between healthy and infected cells, cell type classification, and extraction of gene-transcription factor interactions. For cross-species annotation, a transfer learning approach was implemented, utilising labels from a source species to predict cell types in a target species.
Cell type classification employed a simple k-nearest neighbours classifier, leveraging the previously constructed nearest neighbour graph to assign cells to known categories. A linear support vector machine (SVM) was trained to distinguish between healthy and infected cells based on their gene expression profiles. The extraction of gene-TF interactions involved identifying genes whose expression correlated with the activity of known transcription factors, providing insights into regulatory networks.
ScTOP facilitates accurate cross-species spermatogenesis cell-type annotation rivaling complex foundation models
The research demonstrates that a simple linear method, scTOP, achieves cross-species cell-type annotation with macro F1 scores comparable to those obtained by the TranscriptFormer foundation models. Specifically, scTOP attained F1 scores exceeding 0.6 on several species pairings within the spermatogenesis dataset, a performance level previously achieved only by complex, parameter-rich models.
Detailed analysis of the transfer matrix reveals scTOP’s ability to accurately classify testis cell types across mammals, mirroring the performance of TF-Exemplar and TF-Metazoa. Yet, scTOP consistently outperformed foundation models when transferring knowledge between humans and other organisms, achieving F1 scores consistently above 0.5, while the foundation models struggled.
For instance, scTOP successfully classified chimpanzee, rhesus, and marmoset cell types with high accuracy, despite evolutionary distances of up to 40 million years. Examining the data more closely, the number of genes retained after restricting datasets to orthologous genes varied markedly, with the human dataset initially containing 34,168 genes reduced to approximately 14,000 shared orthologs across species.
Despite this inherent loss of information, scTOP’s performance remained competitive, suggesting that the key biological signals are preserved within the reduced gene set. The method constructs a reference basis using normalized pseudo-bulk expression profiles, averaging the expression of cells with the same source label to mitigate noise inherent in scRNA-seq data.
By converting mRNA counts to z-scores reflecting gene rank ordering within each cell, the normalization procedure effectively eliminates batch effects, ensuring independent normalization across all cells. Also, the study shows that scTOP’s classification relies on projecting target cell expression profiles onto this established basis, assigning labels based on the largest linear projection, a process requiring no free parameters. Beyond cross-species transfer, much of the biologically relevant structure in scRNA-seq benchmarks is accessible through these low-complexity linear representations, challenging the necessity of complex foundation models for capturing cell identity.
Linear models rival complex architectures in single-cell transcriptomic analysis
Scientists are beginning to realise that bigger isn’t always better when it comes to analysing single-cell data. For years, the field has raced towards ever more complex foundation models, employing techniques borrowed from natural language processing to decode the hidden structure of gene expression. These models, often reliant on computationally expensive transformer architectures, promised to unlock a deeper understanding of cellular identity and function.
However, recent work demonstrates that surprisingly simple linear methods can achieve comparable, and in some cases superior, performance on standard benchmarks. This isn’t simply a story of parsimony triumphing over complexity, but exposes a critical need for more rigorous evaluation of these foundation models. Once hailed as the future of single-cell analysis, their advantage now appears to lie more in their capacity to consume data than in their ability to extract genuinely new biological insight.
By focusing on careful data normalisation and utilising established linear techniques like Principal Component Analysis and Linear Discriminant Analysis, researchers have shown that the core information defining cell types is often readily accessible without resorting to elaborate computational machinery. The implications extend beyond a mere methodological preference.
The ability to accurately classify cells, predict disease states, and even extrapolate findings across species doesn’t necessarily require massive computational resources. This opens the door for wider accessibility, allowing researchers with limited computing power to participate in this rapidly evolving field. This work also excels at handling data from novel cell types and organisms, a common stumbling block for many foundation models trained on restricted datasets.
Several avenues warrant exploration. Beyond refining these linear pipelines, attention should turn to understanding why these simpler methods perform so well. The underlying reasons remain unclear, and further investigation into the inherent structure of single-cell gene expression data is needed. Instead of solely pursuing ever-larger models, the field might benefit from a renewed focus on data quality, careful experimental design, and a deeper appreciation for the biological signals already present within the data itself.
👉 More information
🗞 Parameter-free representations outperform single-cell foundation models on downstream benchmarks
🧠 ArXiv: https://arxiv.org/abs/2602.16696
