gLM2 and Multimodal Foundational Models for Genomics

The Open Meta Genomics (OMG) corpus is a new, massive genomic dataset that is transforming the field of bioinformatics. Here I explain what makes OMG special and how it’s being used to train gLM2, a powerful new genomic language model that could lead to a holistic computational modeling of life.

Picture generated by the author via ChatGPT, with custom hand-made edits.

The OMG Corpus

The OMG corpus is a collection of metagenomic sequences, which are DNA segments extracted directly from environmental samples like soil or the human gut. Unlike curated databases like UniProt, metagenomic data captures the immense diversity of microbial life, including many organisms never cultivated in a lab. This makes it an ideal source for training a foundational model that can “understand” biology in a more comprehensive way.

Assembling the OMG corpus was a complex undertaking due to the nature of metagenomic data. The researchers had to overcome several challenges:

  • Accessibility: Metagenomic data is spread across different repositories, making it difficult to download and compile.
  • Preprocessing: Raw metagenomic sequences need extensive cleanup, including gene calling (identifying protein-coding segments) and quality filtering to remove errors.
  • Deduplication: Metagenomic datasets often contain duplicates and…

--

--

Advances in biological science
Advances in biological science

Published in Advances in biological science

AdBioS is a science communication platform that aims to explain ground-breaking science in the field of biology, medicine, biotechnology, neuroscience and genetics to literally everyone. Scientific understanding has too many barriers, let's break them down!

LucianoSphere (Luciano Abriata, PhD)
LucianoSphere (Luciano Abriata, PhD)

Written by LucianoSphere (Luciano Abriata, PhD)

https://www.lucianoabriata.com | Scientific writing, technology integrator, programming, biotech, bioinformatics.| Have a job for me? Contact me in ES FR EN IT

No responses yet