PixCell, a diffusion-based generative model trained on a large histopathology dataset, synthesises realistic images applicable to cancer research. These synthetic images facilitate data augmentation, privacy-preserving data sharing, and virtual staining, even inferring molecular marker results from standard H\&E staining. The trained models are publicly available.
The increasing availability of digitised histological slides presents both opportunity and challenge for cancer research. While these datasets facilitate detailed analysis, annotated data remains limited and data sharing is often constrained by regulatory hurdles. Researchers are now exploring generative artificial intelligence models to address these issues, synthesising realistic images to augment existing datasets and potentially infer information not directly observable. A collaborative team, comprising Srikar Yellapragada, Alexandros Graikos, Zilinghan Li, Kostas Triaridis, Varun Belagali, Saarthak Kapse, Tarak Nath Nandi, Ravi K Madduri, Prateek Prasanna, Tahsin Kurc, Rajarsi R Gupta, Joel Saltz, and Dimitris Samaras, detail their development of PixCell, a diffusion-based generative foundation model for digital histopathology, in their paper of the same name.
Diffusion Model Generates Synthetic Histology Slides, Addressing Data Challenges in Cancer Research
The increasing digitisation of pathology slides is generating substantial datasets with potential for advancing cancer diagnosis and research. However, limitations in annotated data, alongside data privacy concerns, present significant obstacles. Researchers have developed PixCell, a diffusion-based generative foundation model, to address these challenges. Trained on the PanCan-30M dataset – comprising 69,184 haematoxylin and eosin (H&E)-stained whole slide images (WSIs) representing multiple cancer types – PixCell offers a novel approach to data augmentation and analysis.
Diffusion models function by learning to reverse a process that gradually adds noise to data, ultimately enabling the generation of new, realistic samples. PixCell employs a progressive training strategy and self-supervision – a technique where the model learns from the inherent structure of the data without requiring manual labels – to facilitate large-scale training. This eliminates the need for extensive, costly annotation. The model generates diverse, high-quality images across various cancer types and can serve as a substitute for real data when training other machine learning models.
Synthetic images generated by PixCell offer advantages regarding data sharing. Compared to clinical images, they present fewer regulatory hurdles, fostering collaboration and accelerating research. Researchers validated PixCell’s capabilities through several applications. Mask-guided image generation – where specific areas of the image are targeted for modification – facilitates data augmentation, improving performance on tasks such as cell segmentation – the automated identification of cells within an image – and enhancing the accuracy of diagnostic algorithms.
Furthermore, PixCell infers immunohistochemistry (IHC) staining from H&E images. IHC uses antibodies to detect specific proteins in tissue samples, providing crucial molecular information. PixCell leverages the structural information within the routinely used H&E staining to predict molecular marker expression, potentially reducing the need for costly and time-consuming IHC testing.
The PanCan-30M dataset underpinning PixCell comprises WSIs from a variety of sources, encompassing a broad range of organs and cancer types, including lung, kidney, colon, breast, liver, pancreas, prostate, skin, thyroid, and uterus. Data originates from established resources such as The Cancer Genome Atlas (TCGA) and the Genotype-Tissue Expression (GTEx) project, ensuring comprehensive coverage and generalisability.
The publicly released trained models accelerate research in computational pathology, providing a valuable resource for the scientific community. By addressing the challenges of data scarcity and privacy, PixCell empowers researchers to unlock the full potential of digital pathology and advance cancer research.
👉 More information
🗞 PixCell: A generative foundation model for digital histopathology images
🧠 DOI: https://doi.org/10.48550/arXiv.2506.05127
