Single-cell Bio Foundation Models: A beginner’s overview

Mariana Quiroga Londoño, Ph.D.
Helical
Published in
4 min readAug 5, 2024

Imagine being able to decode the unique molecular blueprint of every single cell in the human body, unveiling the mysteries of our biology at a remarkable level of detail. This exciting advancement is becoming possible through the integration of AI across various domains, including molecular biology.

In this short post, we will give you an overview of the most promising open-source single-cell foundation models that you should test and integrate into your research!

DALL·E 2’s vision of single-cell bio foundation models

Challenges in single-cell RNA-seq analyses

If the Human Genome Project provided us with the book of life, single-cell analyses show us how each cell reads this book. These analyses shed light on the roles of individual cells in development, disease progression, and response to treatments. However, the high-dimensional and large-scale nature of single-cell data presents significant analytical challenges. Researchers face hurdles in integrating and interpreting vast datasets, extracting meaningful features, and dealing with differences due to technical effects that can obscure true biological signals. Moreover, overcorrecting for batch effects can be equally problematic, as it may eliminate genuine biological variation, further complicating data analysis.

Single-cell foundation models are well positioned to address those challenges.

What are single-cell Foundation Models?

Foundation models are large-scale AI models pre-trained on vast amounts of data, which can be fine-tuned for a variety of specific tasks. Unlike traditional supervised learning models that rely on labeled datasets, foundation models utilize self-supervised learning, where the model learns to predict parts of the data from other parts without the need for manual labels. This allows them to understand and generate complex patterns and features from sequencing data, making them highly adaptable and powerful for numerous applications, particularly in cases where annotated datasets from early development stages, rare species, and diseases are scarce. As the name suggests, single-cell foundation models are pre-trained on unlabelled single-cell data.

Overview of some of the most promising open-source models

  • Geneformer: a foundation transformer model pre-trained on Genecorpus-30M, a corpus comprising approximately 30 million single-cell transcriptomes from a broad range of human tissues. This open-source foundation model under Apache-2.0 license seems to perform best on most benchmarks.
  • UCE (Universal Cell Embeddings): this model creates a universal biological representation space for cells, leveraging a self-supervised learning approach on cell atlas data from diverse species. UCE creates an atlas of over 36 million cells, with more than 1,000 uniquely named cell types, from hundreds of experiments, dozens of tissues, and eight species. Being the only open-source single-cell foundation model (MIT license) being trained on multiple species, it is particularly interesting for tasks such as cross-species integration.
  • scGPT: pre-trained on data from over 33 million human cells under non-disease conditions, encompassing a wide range of cell types from 51 organs or tissues and 441 studies. This open-source model is available under MIT license.

Addressing current limitations

While these models show great promise, they often lie in decentralized GitHub repositories, and users need to delve deeply into the accompanying literature to utilize them effectively. Additionally, integrating these models into existing workflows, specific applications, and ensuring compatibility with various data formats can be challenging.

Helical’s open-source package aims to simplify this by providing standardized tools and resources.

What’s next?

As new models emerge, the need for standardization and benchmarking increases. The future potential of these models is vast, including multi-modality understanding of complex biological systems and insights into molecular mechanisms across species. Besides cell type classification, applications such as identifying biomarkers and finding novel drug targets are just the beginning of what could be achieved.

How to get started

To get started quickly, it is easiest to look at existing example notebook for specific use cases and go from there. In this free google colab notebook and quick-start Python tutorial, for example, we compare two leading RNA Foundation Models for cell type classification: Geneformer and UCE.

About Helical

Helical is an open-core platform for computational biologists and data scientists to effortlessly integrate single-cell & genomics AI Bio Foundation Models in early-stage drug discovery.

Follow or subscribe to stay up-to-date with the latest developments in Bio Foundation Models.

https://www.helical-ai.com/

--

--

Mariana Quiroga Londoño, Ph.D.
Helical
Editor for

Computational Biologist @ Helical. Bridging Biology and AI!