Interpreting 23andme Raw Genome Data with Google Genomics and BigQuery

Mike Kahn
Google Cloud - Community
7 min readMay 7, 2017

This article is assuming you already have used a service like 23andme to obtain your raw genome data or you are interested in learning how Google Genomics and BigQuery can help process and draw insights from your genome. If you haven’t used 23andme I highly recommend it if you want to learn more about your family history, predisposition to illnesses and if you have any level of curiosity about your genetic makeup. Its important personal information to have today if you like to be well informed. You can also connect with genetic relatives that you have never met before that are also using this service. Some of my 3rd cousins contacted me recently which was very cool.

23andme allows you to browse and download your raw genome data containing your raw genotype data which can give you additional insight into your DNA beyond the data used in the main 23andme service. The 23andme browse tool is nice and reads your raw data and links to the dbSNP page (example) which can give you a lot of technical detail each dna marker in your genome. For this article we’ll see how Google Genomics and BigQuery acts as a similar engine and tool to digest and query the large dataset that is your genome. The wonderful thing about BigQuery is that it can take querying your genome to the next level by making is very easy to query this large dataset in applications and integrate with existing dna data providers such as dbSNP. There are services that do this work already for you such as Promethease and dna.land and they do an amazing job, but maybe you want to do this work on your own. Read on.

To begin your quest learning more about your genome on Google cloud Platform, you can take the raw data (after turning it into an acceptable format) load it into Google Genomics which is a pipeline to create a dataset based off common genome data types (vcd, fastq, or BAM). Then you can export to BigQuery as a dataset to begin querying with native SQL statements. In this post ill be explaining some of the working parts of doing this mostly going over the Google Genomics load variants how-to guide and taking it a step further.

If you are just getting started and this seems interesting to you I suggest doing this when you have a few hours to a half of a day. Genome exploration is still somewhat new but there are a lot of resources and tons of depth to this work. Its really cool and interesting what insights we can pull from our genome. Just look at the popular markers on SNPedia:

SNPedia popular SNPs

Heres how the whole process will look:

Variant Analysis on Google Cloud Platform

Convert 23andme genome raw data to .vcf

In your 23andme account grab your raw data zip

https://you.23andme.com/tools/data/

get plink to convert your 23andme txt file to .vcf, a format acceptable to Google Genomics

Convert the 23andme txt to .vcf via cli

https://www.biostars.org/p/102109/

./plink — 23file genome_Michael_Kahn_v4_Full_20170417050941.txt — snps-only no-DI — recode vcf

Follow the loading Google Genomics variants guide

https://cloud.google.com/genomics/v1/load-variants

Upload the vcf to a storage bucket. Make sure to have the Cloud SDK installed and configured so you can use gsutil. You can also upload this via the web UI if you would like.

$ gsutil cp plink.vcf gs://kahn-personal/

Create a genomics dataset

$ gcloud alpha genomics datasets create --name my-genomeCreated dataset [my-genome, id: 13234483586757595452].

Find your dataset id

$ gcloud alpha genomics datasets listID NAME13234483586757595452 my-genome

Create a variantset, note the IDs

$ gcloud alpha genomics variantsets create --dataset-id 13234483586757595452 --name my-variantsetCreated variant set [my-variantset, id: 11311180976008760690] belonging to dataset [id: 13234483586757595452].$

Import your .vcf from plink from your google cloud storage bucket

$ gcloud alpha genomics variants import --variantset-id 11311180976008760690 --source-uris gs://kahn-personal/plink.vcfdone: falsename: operations/CJ2ctcO1DhDv6q3IBRjs-syA1I__1H4

Check the import to make sure it went okay:

$ gcloud alpha genomics operations describe operations/CJ2ctcO1DhDv6q3IBRjs-syA1I__1H4done: falsemetadata:‘@type’: type.googleapis.com/google.genomics.v1.OperationMetadataclientId: ‘’createTime: ‘2017–05–04T18:39:43Z’events: []labels: {}projectId: mike-kahn-personalrequest:‘@type’: type.googleapis.com/google.genomics.v1.ImportVariantsRequestformat: FORMAT_VCFinfoMergeConfig: {}normalizeReferenceNames: falsesourceUris:- gs://kahn-personal/plink.vcfvariantSetId: ‘11311180976008760690’name: operations/CJ2ctcO1DhDv6q3IBRjs-syA1I__1H4$$ gcloud --format=’default(error,done)’ alpha genomics operations describe operations/CJ2ctcO1DhDv6q3IBRjs-syA1I__1H4done: false

It takes a little while so be patient here. You can check the status of the import in the GCP portal.

After about 10 or so minutes my import was done.. true!

$ gcloud --format=’default(error,done)’ alpha genomics operations describe operations/CJ2ctcO1DhDv6q3IBRjs-syA1I__1H4done: trueMichaels-iMac:plink_mac mkahnimac$

Export Variants to Big Query

In the BigQuery web UI, create new dataset with your dataset ID that you used before

$ gcloud alpha genomics variantsets export 11311180976008760690 genome_variants --bigquery-dataset 13234483586757595452done: falsename: operations/CJ2ctcO1DhDs863IBRja_uecj7b9xgsMichaels-iMac:plink_mac mkahnimac$

Check the status of the export operation

$ gcloud --format=’default(error,done)’ alpha genomics operations describe operations/CJ2ctcO1DhDs863IBRja_uecj7b9xgsdone: falseMichaels-iMac:plink_mac mkahnimac$

This took about 10-15 minutes for me

$ gcloud --format=’default(error,done)’ alpha genomics operations describe operations/CJ2ctcO1DhDs863IBRja_uecj7b9xgsdone: true

After this is done exporting spent a bit of time understanding the BigQuery variants schema so you can understand how to read your genome. To start interpreting your genome you want to look at the names and the reference_bases, alternate_bases (genotype) at a minimum.

https://cloud.google.com/genomics/v1/bigquery-variants-schema

Compare the table details preview with the 23andme text file to acclimate yourself a bit. Now your genome is loaded into BigQuery and ready to be interpreted.

Do Some Basic Exploring!

When I first started exploring my genome I was interested in my ancestry. I wanted to find out how close I was genetically to the Kohanim (Cohen, HaCohen, Kahn) high tribe and more specifically Aaron the brother of Moses. I found the haplogroups that people shared today related to this ancestry were J1c3 and also J2a (J-M410) which were related to my 23andme reported parental haplogroup of J-M172 .

Per the wikipedia article J-M172 is further divided into a few contemporary clades of J-M410 and J-M12. So my genetic makeup was closer to Kohanim than these two further down the branch.

So looking further I found a research group specifically on the J2-M172 haplogroup which was given to me by 23andme.

This group conveniently listed the SNP (markers) that are found in genome raw data for this hapolgroup.

J2-M172: rs2032604 (M172) T G

  • J2a-M410: N/A
  • J2a1-L26: N/A
  • J2a1a-M47: i3000069 (M47) G A, rs13447376 (M322) C A
  • J2a1b-M67: rs2032628 (M67) A T, i3000079 (M67) A T

To further validate the genetic group I can query my genome using BigQuery to search for these markers.

Using Big Query to look up browse my 23andme raw data

The sky is the limit from this point on. Now I can write sql queries to look for all of the identified SNP markers related to my genetic ancestors to easily give me a reading of how close my ancestry is to the Kohanim tribe. This is amazing. This is such a small sample of the work that is being done on genomics to find out more about our ancestors and decode our genome.

Whats next?

So now that your 23andme raw data is loaded into Google BigQuery you can do a number of things with this data. Resources like SNPedia allows your to query their DB for genotype groups such as for diseases like parkinson’s or alzheimers.

Now you can quickly query your genome’s SNPs (“snips”) which are common genetic variations among people.

SNP’s (single nucleotide polymorphisms) are biological markers in your DNA that can help locate genes associated with disease and lifestyle. They are the most common genetic variations between people. SNPs can help track inheritance of disease genes and in the future studies will identify SNPs associate with diseases such as diabetes and cancer. Pretty cool.

Check out SNPedia https://www.snpedia.com/index.php/SNPedia for creating queries with BigQuery.

Also check out the google genomics cook book http://googlegenomics.readthedocs.io/en/latest/

Other community resources here:

Going Further

Use Google Datalab notebooks for your BigQuery work and to put together visualizations based upon your genome data.

Integrate with 3rd party services such as SNPedia and launch applications on Google App Engine to allow for easy exploration of genome data.

Google Cloud Datalab notebook with BigQuery

Pricing

Genomics: Storage- $0.022/GB per month

BigQuery: Storage- $0.02/GB per month/$0.01/GB per month long term, $5 per TB queried, first 1TB Free

Datalab: Free

My genome variant table in BigQuery was about 80MB. So, it is ridiculously cheap for doing variant analysis on GCP! Awesome. Happy exploring.

Check my blog for more updates.

--

--

Mike Kahn
Google Cloud - Community

Field Engineering Manager, Databricks. All views and opinions are my own. @mkahn5