Introduction to Genomics and GCP

What is genomics?

You, me, your uncle, your neighbor, my grandmother and everyone of us are made of trillions of cells. Each cell has something called DNA (or genetic material) which defines how you look, how athletic you are, what diseases you are susceptible to etc. The sum total of all your genetic material is called genome and the study of genome is called genomics.

Why should you care about genomics?

Image a world where your doctor takes a cotton swab collects your DNA, runs a test on the DNA in a couple of minutes, identifies the diseases that you’ll be susceptible to 20–30 years down the line and recommends medicines/diet/lifestyle change. Super helpful. Isn’t it?

Yeah. It’s definitely helpful. Why doesn’t everyone get the tests done?

Your doctor basically did 2 things: sequencing and analysis. Sequencing means knowing the genome sequence. A human genome consists of 3 billion base-pairs (building blocks). At the end of sequencing you’ll know the sequence of your base pairs. A tremendous progress has been made in this front. Consider this: The first human genome sequencing was done back in 2003. It took 15 years and $ 3 billion. Today the genome sequencing takes just about a day and costs less than $1000. This trend is expected to continue.

The second step is the analysis of the sequence. Not much progress has been made in this area for a reason. It takes about 100 GB to store the genome data of one person. Researchers will need access to thousands of such samples to identify anomalies, risk factors and recommend drugs. This calls for a lot of storage, computing capacity and easy access to genome repositories, which did not exist in the past or was available to only the top research institutes.

Enter GCP

Today Google is at the forefront of innovation in storage and computing. Naturally it is one of the best suited to provide solutions in storage and compute space. Google Genomics, a part of GCP suite, addresses the needs of researchers.

Interoperability: Implementation of the open standard from the Global Alliance for Genomics and Health is interoperable across multiple genome repositories. This means your code works on data from multiple sources. Eg: Data from National Center for Biotechnology and Information (US) and the European Biometrics Institute.

Security and Collaboration: GCP meets or exceeds ISO 27001, HIPAA and PCI requirements and has the same security that protects gmail/google apps. This makes it easier to securely share data and collaborate easily.

Storage: The cost of storage is getting cheaper every day and this trend is expected to continue. It costs about $25 to store on genome on GCP and a it costs just 25 cents to store a compressed version of the genome.

Computing: One of the underlying google technologies that are powering google genomics is Bigquery. Bigquery is know for returning results from massive databases at interactive speeds. This makes it easier for researchers to do exploratory data analysis in a matter of a few seconds.

Few years ago only researchers with some of the best institutes/organizations would have access to the storage and computing power required to analyze genome data — one genome at a time. Today, with GCP, even a high schooler with a laptop can analyze millions of genome at a time. No wonder GCP will supercharge genomics research.