Researchers have developed a novel bioinformatics tool to accurately estimate recent effective population size, a crucial parameter in understanding evolutionary processes. Chris C R Smith from Indiana University, leading the work, presents LinkedNN, a neural network model that automatically calculates linkage disequilibrium-related features based on the genomic distance between polymorphisms. This innovative approach surpasses the performance of current methods and those relying on summary statistics, achieving greater accuracy with fewer sequenced individuals and variant sites. The significance of LinkedNN lies in its potential to advance molecular ecology applications, particularly when dealing with sparse and unphased data, offering a powerful resource for population genetic studies. The program is freely available as a Python package and open-source code for wider use and further development.
Scientists have developed a new bioinformatics tool, LinkedNN, designed to accurately estimate recent effective population size using patterns of linkage disequilibrium, the non-random association of alleles at different locations in the genome. This innovative method leverages a neural network to automatically compute features related to linkage disequilibrium decay as a function of genomic distance between genetic polymorphisms, offering an advancement over existing techniques.
The research addresses a critical need for tools capable of extracting these features directly from polymorphism data, a process historically laborious and requiring manual parameterization. LinkedNN performs well even with limited data, making it valuable for molecular ecology applications where genomic information is often sparse and unphased. A core innovation lies in a novel neural network architecture that learns linkage disequilibrium-related features directly from single nucleotide polymorphisms (SNPs) as a function of genomic distance.
Unlike convolutional neural networks, which struggle to capture long-range correlations in genomic data, LinkedNN employs a specialised “LD layer” that efficiently processes SNP pairs across a continuum of genomic distances. This is achieved through a unique sampling strategy and the use of radial basis functions to transform inter-SNP distances, allowing the network to represent genomic features with greater precision.
By automating feature extraction, the tool bypasses the need for arbitrary binning of SNP pairs, a common limitation of traditional approaches. Evaluations on simulated data demonstrate that LinkedNN consistently outperforms both current convolutional neural networks and summary statistic-based regression tools in estimating recent effective population size.
The open-source program, readily installable as a Python package, and its associated code are publicly available, facilitating wider adoption and further development within the scientific community. Sixteen of the 64 coefficients exhibited maximum values between 5 × 10⁵ and 5 × 10⁶ base pairs, highlighting this range as particularly important for the analysis.
This distance exceeds the mean spacing between consecutive SNPs, approximately 2 × 10⁴ base pairs, suggesting the model effectively utilizes information from more distant SNP pairs than traditional convolutional neural networks with smaller kernel sizes. Applying the LD layer to publicly available harbor porpoise data, the study estimated a recent Ne of 1,411, with a range of 1,119 to 1,659 across 100 repetitions.
The population size change was inferred to have occurred 42.1 generations ago (with a range of 28.8 to 53.2), corresponding to approximately 501.0 years, assuming a generation time of 11.9 years. The older Ne was estimated at 5,921 (ranging from 5,273 to 6,867). This inferred Ne is smaller than previous estimates obtained using sequentially Markovian coalescent-based methods, which may struggle with very recent demographic history.
The method begins by strategically sub-sampling pairs of single nucleotide polymorphisms (SNPs), acknowledging that the total number of possible pairs increases quadratically with genome size. To efficiently survey genomic distances, log-uniform index jumps, denoted as ∆i, are drawn between ordered polymorphisms, creating pairs with roughly log-uniform physical distances, d Genotypes are then encoded as the count of the minor allele for each individual, providing input to a shared-weights, position-wise layer.
This layer extracts initial features from the genotype data, irrespective of genomic position, utilising 64 output features and rectified linear unit activation for all trainable layers except the final one. Subsequently, features from each SNP pair are combined and processed through two additional layers to compute preliminary genetic features, g. This approach circumvents limitations of convolutional neural networks (CNNs), which struggle with the non-uniform distribution of SNPs along chromosomes and can erode genotype information in deeper layers.
By directly modelling LD decay as a function of genomic distance, the network captures correlations between SNPs across a continuum of genomic distances, offering a more nuanced representation of LD patterns than grid-based methods. The entire system is implemented using the PyTorch framework, facilitating flexible model development and training. Researchers have long struggled to accurately estimate the effective population size of species, a crucial metric for conservation efforts and understanding evolutionary processes.
Traditional methods rely on extensive genomic data and complex statistical modelling, often proving impractical when dealing with sparse or incomplete datasets common in molecular ecology. LinkedNN represents a step forward by leveraging neural networks to infer population size from linkage disequilibrium with remarkable efficiency. What distinguishes LinkedNN is its ability to deliver robust estimates even with limited data, outperforming existing approaches that demand larger sample sizes and fully phased genomic information.
This expands the scope of population genetic studies to taxa and situations previously considered intractable. However, the reliance on linkage disequilibrium introduces inherent limitations. The accuracy of the method is still tied to the underlying assumptions about the genome and the evolutionary history of the species being studied. Furthermore, while the neural network architecture demonstrates impressive performance on simulated data, validation with real-world datasets is essential to confirm its generalizability. Future work might explore incorporating additional genomic features or developing adaptive training strategies to account for the unique characteristics of different populations and species.
👉 More information
🗞 LinkedNN: a neural model of linkage disequilibrium decay for recent effective population size inference
🧠 ArXiv: https://arxiv.org/abs/2602.13121
