gLM2 and Multimodal Foundational Models for Genomics
The Open Meta Genomics (OMG) corpus is a new, massive genomic dataset that is transforming the field of bioinformatics. Here I explain what makes OMG special and how it’s being used to train gLM2, a powerful new genomic language model that could lead to a holistic computational modeling of life.
The OMG Corpus
The OMG corpus is a collection of metagenomic sequences, which are DNA segments extracted directly from environmental samples like soil or the human gut. Unlike curated databases like UniProt, metagenomic data captures the immense diversity of microbial life, including many organisms never cultivated in a lab. This makes it an ideal source for training a foundational model that can “understand” biology in a more comprehensive way.
Assembling the OMG corpus was a complex undertaking due to the nature of metagenomic data. The researchers had to overcome several challenges:
- Accessibility: Metagenomic data is spread across different repositories, making it difficult to download and compile.
- Preprocessing: Raw metagenomic sequences need extensive cleanup, including gene calling (identifying protein-coding segments) and quality filtering to remove errors.
- Deduplication: Metagenomic datasets often contain duplicates and…