Building a Personal Genome Data Warehouse with Google Cloud, 23andMe and Family Tree DNA
This is a followup to my post from May 2017 Interpreting 23andMe Raw Genome data with Google Genomics and BigQuery.
TLDR: In this post I will use Cloud Dataprep to clean up my Family Tree DNA and 23andMe raw data, import that raw data into BigQuery for extended genome SNP identification and Cloud Datalab for working with my genomic data. Ideally I will gain a better understanding of my genotyping raw data and I will have a stronger dataset for learning more about my DNA in the future.
Overview / Why I’m doing this
Our genomic data can be used to predict disease risk, side effects to pharmaceuticals and tell us more about our who we are and what our future may hold for us. After using 23andMe to explore my raw data with Google BigQuery last year I wanted more data and to be able to do more with it. I imagined a data warehouse of my DNA with multiple raw data sources. This way in the future when we know more about DNA markers (SNPs) I can easily check my genetic data taken from multiple services to verify any suspicions, risks, or concerns I may have about my health.
In my previous post interpreting 23andMe raw Genome data with Google Genomics and BigQuery I took the 23andMe raw data txt file, converted to vcf, and used the Google Genomics load variants pipeline to load into BigQuery for analysis. In this post I will prepare 2 different raw data genomes (23andMe and FTDNA) and copy directly to BigQuery for analysis.
There is a service called Promethease that will take your raw genome data and match it with SNPs found on SNPedia for $12. If you do not want to pay the $12 and the idea of having a personal data warehouse of your genome data appeals to you, you can set one up with Google Cloud quite easily.
If was going to build a personal genome data warehouse I would need more data on my DNA. This time I tried Family Tree DNA’s Family Finder service. After having a second set of raw data I would be able to cross reference any data I found in my 23andMe raw data with another data source, hopefully giving me increased confidence in anything that I found. In the future I will add Ancestry raw data, covering all 3 consumer priced genetic testing services. Ideally this data would allow me to possibly better anticipate or plan for any health issues that may occur in my future. Would this second source give me a more accurate dataset? I think so. Both sources use different sequencing:
23andMe — Uses a customized Illumina Omniexpress+ array that includes about 600,000 SNPs.
Family Tree DNA Family finder — Uses Illumina OmniExpress microarray chip. It includes 711,424 total SNPs. Only about 13,913 of these have annotations in SNPedia.
A nice write up and charge to the many options for DNA analysis Neo.life Your Guide to Getting Sequenced
Family Tree DNA Raw Data Download
Family Tree DNA gives you your raw data in a few forms. It can be a bit confusing.
FTDNA 3 types of raw data types
- Autosomal Raw Data
- X Chromosome Raw data
- Raw Data Concatenated
Every cell in the human body has DNA that is tightly packed into structures called chromosomes. There are 23 pairs of chromosomes of which 22 pairs are called autosomes and the 23rd pair is called allosome or sex chromosomes. The X chromosome spans about 155 million DNA base pairs and represents approximately 5 percent of total DNA in cells (source).
The concatenated file from FamilyTree DNA combines the autosomal and X raw data file so if you are looking to gain insight into your raw data you’ll want to use the concatenated file.
Two different builds in FTDNA (build 36 and build 37)
A build is a reference system used to describe a kind of typical human genome fully sequenced. Build 37 is supposed to be more accurate in terms of where SNPs actually are located than build 36 (source). A build is a Genome assembly, as more is learned about the human genome, new Genome assemblies are released.
- Each build is a version of human genome reference
- Newer builds fine tune SNP positions
- If manually comparing 2 raw data files, you would need to make sure they are using the same build number
- 23andMe raw data is build 37, so to manually compare your FTDNA raw data file to it you should use your Build 37 raw data
For comparing FTDNA with another raw data source like 23andMe, Build 37 Raw Data Concatenated is the one you want to use.
More on reference genome and build 37 here.
Raw data formats and size
Family Tree DNA gives you a gzip with a CSV so it’s easy to load into BigQuery for analysis. My Family Tree DNA raw data was about 6.5MB compressed and about 22MB uncompressed (CSV). My 23andMe raw data was 5MB compressed and about 15MB uncompressed (txt).
Download your raw data from 23andMe and Family Tree DNA here:
Using this data in Google Cloud BigQuery
Once you have the CSV of your raw data you could really import right into BigQuery. I’ll start with importing the Family Tree DNA raw data.
You can do this either via the BQ UI or via the bq CLI tool
mikekahn@cloudshell:~ (mikekahn-sandbox)$ bq --location=US load --source_format=CSV personalgenome.ftdna gs://pers-pixelbook/genome-ftdna.csv RSID:string,CHROMOSOME:string,POSITION:integer,RESULT:string
Waiting on bqjob_r53ef916ab2e7d062_000001642f40dac6_1 … (0s) Current status: DONE
BigQuery error in load operation: Error processing job ‘mikekahn-sandbox:bqjob_r53ef916ab2e7d062_000001642f40dac6_1’: Error while reading data, error message: CSV table encountered too many errors, giving up. Rows: 1;errors: 1.
Please look into the error stream for more details.Failure details:- gs://pers-pixelbook/genome-ftdna.csv: Error while reading data,error message: Could not parse ‘POSITION’ as int for field POSITION(position 2) starting at location 0
Result: Unable to import the FTDNA CSV dataset with the position field as an integer. So this led me to start thinking that something was wrong with my data, the POSITION rows should be all numbers.
I went ahead and imported with all fields as STRING data type to move forward and see whats going on here.
mikekahn@cloudshell:~ (mikekahn-sandbox)$ bq --location=US load --source_format=CSV personalgenome.ftdna gs://pers-pixelbook/genome-ftdna.csv RSID:string,CHROMOSOME:string,POSITION:string,RESULT:string
Waiting on bqjob_r7ffb8f22a9f140ad_000001642f461ba5_1 … (6s) Current status: DONE
Success, but ultimately I do not want all columns to be STRING data type. So let’s explore the data a bit in BigQuery.
I found some issues with the CSV that was provided from Family Tree DNA making it not fully cleansed and in need of some changes.
- The RESULT column has “ — ” for rows without a value
2. Last 2 rows in the dataset had the column names
To clean this dataset and standardize fields and columns across my 2 different genome datasets I’ll use Cloud Dataprep.
Enter Google Cloud Dataprep by TRIFACTA
Since some duplicate rows needed to be removed as well as some other changes in this dataset I thought this would be a great time to use Google Cloud Dataprep. Since I am using two different raw data sources I should make sure they have the same columns and column types and formatting so querying both will be easier. I can easily take care of this work in Dataprep.
Create a flow and add a dataset
The first step in in Dataprep is creating a flow and adding a dataset. This is pretty simple, give it a name, then either upload a dataset or import one in GCS or an existing table in BigQuery. I will add datasets for both my FTDNA and 23andMe raw data in the same flow.
And now my second dataset (23andMe)
Create a recipe
After the flow is created and datasets are imported add a new recipe to one of your datasets.
Now you can edit your recipe
Great. It will take a moment to load your dataset. Now that your dataset is imported and your recipe is started, you can explore your dataset a bit and add steps or transformations to the recipe.
For this first dataset, I need to make sure the position columns are INT data type. I can also explore my dataset by mousing over the categories displayed in Dataprep.
Here is the work that I will do on the first FTDNA dataset.
- Remove symbols in all columns
- Rename columns
- Remove column name row at top and bottom of dataset
- For null values in RESULT, change to N/A
Here’s how the recipe looks:
Browsing the dataset with the recipe it looks clean:
Download the FTDNA wrangle file here.
Now it’s time to Run the job and apply these transformations.
Running the job
You’ll want to modify some of the default configuration here if you plan to take this cleansed dataset right over to BigQuery.
Lets see what needs to be changed. Click the pencil and lets edit this job. Lets change the following:
- Replace the file every run
And under more options underneath replace the file every run
Single file so we can easily import the new .csv into BQ.
Save settings and you are taken back to the run job dialog box. Lets run job.
This transform should take a few minutes to spin up a dataflow job and output our new CSV.
The nice thing here is that we did not have to write any apache beam code at all to clean up this dataset. Dataprep is basically a layer ontop of Cloud Dataflow. While the job is running you can take a look at it in Dataflow if you would like.
A few minutes after running the job (about 7 min for this one) the transformations should be complete. View results to make sure your data is clean.
Its okay to have a 1% mismatch on the CHROMOSOME field in this table as the raw data we are using here has our x chromosome (X) and autosomal values (1–22).
Looks good to me.
Now export your results.
Validate the locations you set before, and click Create
Check your cloud storage bucket and the new .csv should be there.
Now you can either download the .csv or just import to BQ right from the GCS bucket.
Import cleansed dataset to BQ
First I’ll remove the old dataset with bad datatypes:
mikekahn@cloudshell:~ (mikekahn-sandbox)$ bq rm personalgenome.ftdna
rm: remove table ‘mikekahn-sandbox:personalgenome.ftdna’? (y/N) y
Now import the dataprep cleaned dataset
mikekahn@cloudshell:~ (mikekahn-sandbox)$ bq --location=US load --source_format=CSV personalgenome.ftdna gs://mikekahn-sandbox.appspot.com/jobrun/genome-ftdna-0623.csv RSID:string,CHROMOSOME:string,POSITION:integer,RESULT:string
Waiting on bqjob_r5b939bcc2a5a3afa_000001639a154b19_1 … (6s) Current status: DONE
Success! Now my FTDNA dataset is imported into BigQuery. Next, lets get the 23andMe dataset prepared and ready for BQ.
23andMe text file data transformation with Google Cloud Dataprep
Since the 23andMe raw data columns do not match FTDNA columns, you will want to clean the 23andMe raw data genome TXT file in Dataprep as well. Import it as a data source and make the following transformations in your recipe:
Download the 23andMe Cloud Dataprep wrangle file here.
Run the job like we did on the FTDNA dataset, no compression and single file.
Let’s view the results of our completed job to see how it came out:
This dataset should now look similar to your Dataprep cleaned FTDNA dataset and is ready to be exported to CSV then loaded into BigQuery alongside the FTDNA table. Export results the same way that I did for the FTDNA earlier.
Super easy to import to BQ after you run the job and export the results.
mikekahn@cloudshell:~ (mikekahn-sandbox)$ bq --location=US load --source_format=CSV personalgenome.23andme gs://mikekahn-sandbox.appspot.com/jobrun/genome-23andMe.csv RSID:string,CHROMOSOME:string,POSITION:integer,RESULT:string
Waiting on bqjob_r56a55846613e01c_000001639f213e24_1 … (6s) Current status: DONE
Using BQ to explore your genome dataset
My BQ personalgenome dataset includes 2 tables and 2 different genomes. I have a test with Ancestry awaiting results so eventually I’ll have 3 different copies of my genome from unique personal genetic providers. This is my personal DNA dataset:
Find matching SNPs between the 2 genomes
In BQ to find the total number of rows and matching rows use the following query:
SELECT COUNT(*) as totalrows, count(DISTINCT RSID) as rsidrowmatches
Both genomes together have 1,322,236 rows, 1,009,997 of which are unique. So that gives me 312,259 matching rows. That’s a 23.6% match between my 23andMe and FTDNA genomes.
Also, try a nested wildcard query to find the total number of matching rows between the tables in my dataset
SELECT COUNT(*) as totalcount
SELECT RSID from `mikekahn-sandbox.personalgenome.*`
GROUP BY RSID HAVING ( COUNT(RSID) > 1 )
) AS total
As of June 24, 2018, SNPedia has detail on only 108,485 SNPs. So do not expect to have some strong answers to your DNA, but instead some possible insights.
SNP checking across Counsyl, 23andMe, FTDNA data
I had a carrier test done before I got married from a company called Counsyl. Unfortunatly Counsyl does not provide comparable data to 23andMe and FTDNA as they primarily provide a report matching your DNA with your spouse. I attempted to obtain my raw data from Counsyl and they did send me a vcf but it was considerably smaller and did not include reference genome variants. More on Counsyl vs 23andMe:
The Counsyl Foresight screen performs full next generation sequencing across hundreds of genes that can cause inherited genetic disease. This constitutes approximately a million base pair positions which are inspected for novel variation. The VCF primarily indicates positions with rare non-reference variation which gives a VCF of approximately 1,200 entries.
23andme by comparison performs genotyping with a microarray at approximately 600,000 positions with known common ancestral variation. Their VCF files likely contain the genotypes at every position so it’s expected that the file would be 500x larger.
My Counsyl test came back that I was a carrier for something called congenital amegakaryocytic thrombocytopenia. I wanted to verify the SNP for this disease found in my genome by Counsyl was also in my genome taken by 23andMe and FTDNA.
So lets query those tables in BigQuery.
Step one, find the SNP.
Google search for congenital amegakaryocytic thrombocytopenia snp returns an SNPedia result:
So now that we have both 23andMe and ftdna datasets in BigQuery, lets check for this SNP in my personal genomes
Since I have made sure both tables have the same column names using Dataprep, I can use a wildcard table in my query to query all tables in the dataset for the SNP. I’ll use _TABLE_SUFFIX to let me know which raw data source is displaying the SNP.
That SNP is matched on all 3 raw data sources I’ve had, Counsyl (via report), 23andMe and ftdna. No worries, my wife had the Counsyl test done as well and she is not a carrier for this disease so our future children will be fine 🙏🏼.
What I just did wasvalidate a genetic marker across 3 different DNA testing services. This gives me a good level of confidence on this genetic marker on my DNA. Pretty cool.
More exploring — Prostate Cancer
Randomly checking the SNPs for cancer on SNPedia
Prostate Cancer has an identified SNP with aggressive prostate cancer here:
One SNP has been found to be associated not only with prostate cancer in general, but also specifically with aggressive prostate cancer [PMID 18073375]:
Checking for this SNP across my FTDNA and 23andMe genome
SELECT rsid, chromosome, position, result, _TABLE_SUFFIX as table_id
WHERE rsid = 'rs1571801'
Shows up on both of my raw data sources:
Some research has been done to back this up:
We performed an exploratory genome-wide association scan in 498 men with aggressive prostate cancer and 494 control subjects selected from a population-based case-control study in Sweden. We combined the results of this scan with those for aggressive prostate cancer from the publicly available Cancer Genetic Markers of Susceptibility (CGEMS) Study. Single-nucleotide polymorphisms (SNPs) that showed statistically significant associations with the risk of aggressive prostate cancer based on two-sided allele tests were tested for their association with aggressive prostate cancer in two independent study populations composed of individuals of European or African American descent using one-sided tests and the genetic model (dominant or additive) associated with the lowest value in the exploratory study.
Before you get nervous..
Fortunately the allele (variant of the gene) for this marker on SNPedia is A;A with the highest risk and my allele seems to be G;T, so hopefully I may be in the clear here. But this shows you how much we know about DNA. The research that was done was only on about 500 men and SNPedia says the A;A variant has >1.36x risk for prostate cancer. So its not very strong or conclusive but it may mean something in another case to someone.
Next I’ll use Google Cloud Datalab to create notebooks for my genome data warehouse exploration.
Using Cloud Datalab I can run BigQuery queries inside a notebook with %%bq query
In Datalab I can write python code to do analysis. For example in my notebook I am scraping snpedia.com for the popular SNPs using BeautifulSoup.
Once I add more raw data sources (Ancestry), I’ll write something to give me a score for matches found across multiple tables.
Download this Cloud Datalab ipython notebook here.
Today we can have genotyping done from multiple sources at a relatively reasonable cost. Services such as 23andMe, Family Tree DNA, and Ancestry the most popular today for people to have genotyping done for ancestry and health data exploration. This data can help us better understand our genetic makeup, understand disease risk, and possibly allow people to better plan their lifestyle. SNP markers are still being identified and may not be accurate for certain cases. With your genotyping raw data and Google Cloud Platform you can easily build a personal data warehouse and gain a better understanding of large datasets such as your own DNA.