Genetic and genomic information overload and databases

Discovery Matters
Discovery Matters
Published in
3 min readOct 28, 2022

In today’s biopharma industry, we have a lot of information at our fingertips, so how do we make sense of it all? The answer: genetic and genomic databases.

In this week’s episode of Discovery Matters, Conor and Dodi spoke to two experts who are making sense of this information overload by creating genetic and genomic databases.

One of those experts was Dr Artem Babaian, a computational biologist and now Assistant Professor leading The Laboratory for RNA-Based Lifeforms at the University of Toronto. Artem explained how he and his team uncovered 100,000 novel viruses in old genetic data that could help us predict future pandemics.

Let’s dive into the data

During the COVID-19 pandemic, Artem, his colleague Jeff Taylor, and a dream team of bioinformaticians started to uncover 100 000 novel viruses in old genetic data that could help scientists predict future pandemics.

The team formed purely of volunteers, decided that it was their job as scientists to give back and try to fight the pandemic in any way that they could. At the beginning, Artem reached out to these individuals and in the spirit of collaboration they jumped on board.

In Artem’s words, the team’s objective was not “making a bunch of money, or anything like that, or commercializing it. We were trying to help like it was a war effort, right. A lot of scientists switched their labs to COVID-19. It was like this is a huge societal problem and this was our part.”

The team did 2 000 years’ worth of computing in around 11 days. It centralized $3.9 — $14.9 billion worth of sequencing data, which was generated over the previous 13 years from our entire planet’s laboratory samples, environmental samples from the forest, and other strange places. All this data was freely available for any scientist to use and reuse, but now the team have centralized it into one place. Within this database you can find all sorts of oddities, including anal swabs of penguins in Antarctica.

“At the moment, we’re both sensitivity scarce and data scarce. But, we’re on track to characterize over 100 million RNA viruses by 2030. This is what hyper exponential growth looks like. It was 15 000 in 2020, 145 000 in 2021, and then 2030 I want to hit 100 million. And that might actually make a real dent in our virome.”– Artem

A vast trove of genomic data

The future of this database is to eventually be able to develop a search engine to functionalize the data very quickly and efficiently. Now the team has a proof of concept where you can take an RNA virus, for example a virus that shows up in the serum of a child in a Cambridge hospital, and it would take two minutes and cost less than two cents to connect that virus to a virus that was sampled in a camel in Sub Saharan Africa in 2007.

Clearly, there is a lot more data to come, but don’t let that overwhelm. Let’s begin small by listening to the full conversation ‘Genetic and genomic data’ on Discovery Matters.

For more information:

· Listen: ‘Genetic and genomic data’ on Discovery Matters.

· Read: Edgar, R.C., Taylor, J., Lin, V. et al. Petabase-scale sequence alignment catalyses viral discovery. Nature 602, 142–147 (2022). https://doi.org/10.1038/s41586-021-04332-2

· Read: Sun, S., Miller, M., Wang, Y. et al. Predicting embryonic aneuploidy rate in IVF patients using whole-exome sequencing. Hum Genet 141, 1615–1627 (2022). https://doi.org/10.1007/s00439-022-02450-z

--

--

Discovery Matters
Discovery Matters

Insights on matters of discovery that advance life sciences. Brought to you by creatives, scientists, and leaders at Cytiva.