Google Cloud for Genomics

Preethi Gowrishankar

Published in

Slalom Data & AI

7 min readAug 24, 2020

Building a scalable, reproducible, and secure data processing pipeline on the cloud.

By Preethi Gowrishankar, Ryan Gumpper, and Doug White

Introduction

With the proliferation of genetic testing making DNA sequencing much more accessible, the next challenge is finding a way to utilize all that data meaningfully. A single human genome, when its ~3 billion base pairs are sequenced, creates raw files that can measure ~200 GB. Processing and condensing this amount of data and pulling out meaningful insights about the genetic makeup of an individual from it has been a limiting factor in the bioinformatics pipeline.

We can leverage Google Cloud Platform tools in addition to open source technologies to take what has historically been a piecemeal and time intensive data processing pipeline and make it faster, scalable, cost-effective, and reproducible while improving its accuracy. Here, we created a pipeline that takes in raw sequencing data, analyzes it to identify meaningful variants, and outputs these results as an interactive visualization in a web app for the user.

Cloud Architecture Overview

The GCP services we utilized for this post are:

· Cloud Storage — GCP’s object storage service

· Cloud Life Sciences — A suite of GCP services and tools for managing, processing, and transforming life sciences data including DeepVariant and Variant Transforms

— DeepVariant — An analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data using Compute Engine

— Variant Transforms — An open source tool for transforming raw genomic sequencing data using Cloud DataFlow and loading it in BigQuery for use in advanced analytics

· Compute Engine — Infrastructure as a Service offering that allows users to launch virtual machines on Google’s infrastructure

· Cloud DataFlow — Streaming and batch data processing pipeline service

· BigQuery — A serverless cloud data warehouse that also hosts a number of publicly available datasets

· App Engine — Platform as a Service offering for developing and running applications

The advantage in utilizing GCP tools for this solution lies in the fact that Google Cloud has taken it’s regular offerings and created customized life sciences solutions from them — such as the Life Sciences API and DeepVariant from the Pipelines API or Variant Transforms from Cloud Dataflow. This significantly lowers the time and effort needed to get a genomics pipeline build on the cloud. These services can be further customized to meet individual needs, but the pipeline outlined in this blog post used the out of the box genomics tools prebuilt by GCP.

Process Flow

There are three key processes run on this pipeline: alignment, variant calling, and data transformation.

Alignment — First we needed to run Gene Alignment, the process by which raw reads of sequenced DNA are pieced together to form the full genome sequence. The GCP Life Sciences service has optimized the industry standard Genome Analysis Toolkit (GATK) Pipeline by the Broad Institute for use on Google Cloud. This optimized pipeline runs using Compute Engine and Cloud Storage. It costs approximately $5 to process an entire genome from start to finish on the GCP pipeline.

We pulled raw sequencing data stored in Cloud Storage into the GATK pipeline for alignment. The aligned output is returned to Cloud Storage to be read in to the next portion of the pipeline, which will pull out the meaningful variants.

Variant Calling — Variant Calling distills a full sequence down to it’s meaningful variants. The GATK pipeline allows us to run this process if we are starting with unaligned reads. However, Cloud Life Sciences has developed DeepVariant, which uses a deep neural network to identify genetic variants by treating it as an image classification problem when given a set of aligned reads.

After the Alignment process, our pipeline pulls in the aligned sequence data from Cloud Storage and runs the Variant Calling process via DeepVariant using Docker on Compute Engine. It costs about $2 to call a whole genome, and takes approximately 5 hours to complete per genome.

Data Transformation — The final outputs from variant calling are then run through the Variant Transforms pipeline, which is a Cloud DataFlow pipeline optimized for converting variant call format (vcf) files into queryable tables in BigQuery. This data transformation and loading process takes less than 10 minutes using this pipeline, whereas previously, standardization of this data would have been a manual process spanning several hours.

User Interface Overview

Our final application is built on a Flask backend, and it pulls together processed genetic data from BigQuery. It then integrates that data with other freely available data sources such as the Worldwide Protein Data Bank (wwPDB) and NCBI. This additional data allows us to pull in population abundance and the 3-D structure of the protein the gene encodes.

The final web application provides an interactive interface to view the results of the variant analysis and contextualize it. An end user inputs the PDB Accession code and three visuals are returned:

· 3D structure of the protein, where the variants are located. Segments of the gene that may have clinical significance are highlighted in red.

· Sequence specific alignment to the most similar protein. This allows us to determine whether sequences are ubiquitous at each position.

· Number of total known variants by chromosome position and clinical significance.

The demo below shows this process:

The protein structure displayed on the landing page is the SARS-COV-2 prefusion spike protein, which is a key target for COVID-19 vaccine and therapy development. Then, the PDB code for human hemoglobin is entered along with the beginning and ending position of interest. The results page pulls up the structure of hemoglobin with variants of interest highlighted in red, along with visuals depicting sequence specific alignment and clinical significance.

Security

While the application currently doesn’t contain any sensitive medical data, we wanted to secure the system in a way that ensures HIPAA compliance. The Health Insurance Portability and Accountability Act (HIPAA) governs the privacy and security of personal health information and protects the distribution of electronic protected health information (ePHI). While there are several components needed to ensure a system is HIPAA compliant, for this application we focused on access controls and transmission security.

In order to make the application HIPAA compliant, we implemented a two-factor authentication scheme using Twilio for SMS based authentication. We choose Twilio as it has support for Multifactor authentication, and integration with RSA security options all of which are managed seamlessly via a rest API. Once the user enters their login information, the system will send a verification code over SMS, which the user then enters. This issues a token which grants the user access to the application until it expires. In addition to leveraging token-based authentication with expiration, this application ensures data is encrypted at rest. In our current application we use a simple SQLite database to store all of our account and security information. All data going into the database in encrypted with RS256 (RSA encoding algorithm) using a customer accession function to wrap the database in our flask application.

Clinical Significance Determination

The clinical significance of a mutation refers to the degree to which that mutation causes disease or other physically present traits that would impact clinical care. The National Center for Biotechnology Information (NCBI) maintains data on the clinical significance of every known mutation. We set up an ETL process utilizing the NCBI APIs to extract this data. A random forest classifier is then used to determine the clinical significance of a variant of interest from the genome we had fed into the earlier pipeline. Currently this model is specific to the gene that it is trained on, and there remains the challenge of limited availability clinical significance data.

Conclusion

In building a cloud-native pipeline for genomic analysis, we consolidated a process that has historically been expensive and time intensive into a single pipeline that is replicable and cost-effective. We had the flexibility to mix and match pieces of the pre-built and optimized Cloud Life Sciences tools with more custom services like Compute Engine and App Engine while incorporating external data sources and code. This modularity allows scientists to flexibly utilize the pieces of the pipeline that best suit their purposes without fear of having to give up their own processes in other sections. Additionally, this offering helps create more processes that allow for standardization when attempting to replicate results.

This same flexibility allows for the ability to build and customize to fit a variety of use cases — the basic services used in this demo can be configured and used to implement a number of solutions such as real-time data analytics or intelligent IoT platforms.

Google Cloud for Genomics

Written by Preethi Gowrishankar