Genetics Resources for the Data Nerd
My first piece on data and genetics was focused on helping data nerds understand the meaning of the data they might get from a genome sequencing service like 23AndMe or ancestry.com, both of which let you download your “raw data”. This will be more of an overview of the many places to get additional data that will help get you started on your own data journey.
As I go along, I’m happy to admit that I don’t know everything right now. Some things that may seem very simple to a bioinformatics expert is still somewhat opaque to me. I’ll try to highlight that information as I get to it & update if I find the answer. Feel free to start a conversation with me on Twitter at @matthiasshap. I’m happy to help and thirsty to learn more.
SNP Research Resources
There are a couple places to go to research information related to SNPs. Most of the information I’ve gathered is from OpenSNP.
OpenSNP is a great source for exploring single SNPs and, more importantly, downloading genotypes to play with. Individuals can upload their genotyped data to the site and anyone can download it.
Additionally, OpenSNP ranks SNPs by the amount of research done on them. Rankings are based on research done through PLoS, Genome.gov, the Personal Genome Project, and Mendeley. I’ve scraped the site to collect the 3000 most researched SNPs . I even made an infographic about them.
SNPedia is the wiki for SNPs. It is filled with information on the individual SNPs including which alleles correspond to phenotypes or disease associations. There is a service called Promethease that runs off SNPedia and for $5 you can upload your genotyped info (the raw data you got back from 23AndMe or ancestry) and it will deliver a health report based on the publically available information.
UCSC Genome Browser
The UCSC Genome Browser is for serious researchers and I quite frankly haven’t figured the whole thing out yet. But it lets you see genes in the chromosome sequence, gives a multi-species comparative genetics view, and is a good portal if you want to look into the research associated with specific areas of the genome.
Individual Genotype Data
If you have done the 23AndMe or ancestry.com genome sequencing, you should have access to your raw data. This data will come in as a set of SNPs, which will have the SNP ID, which chromosome it is in, where in the chromosome it is, and what that individual’s genotype is for that SNP.
23AndMe has a list 1.5 million SNPs they genotype for users. If you sign up for the 23AndMe API you can download this file, but unfortunately Excel will not open them all due to the number of rows being too great. To handle, I’ve split the 23AndMe SNPs into a set of files by chromosome. You can get all this data at my github.
It’s hard to tell exactly how the list of SNPs 23AndMe examines matches with the raw data they return. When I sifted through my raw data, I found only 600,000 SNP data points, which compare to the 700,000 SNPs that I found in the data sets from ancestry.com.
Stranger still was the fact that it seems only about 300,000 of those SNPs are common between the two data sets. I don’t know what to make of this, but I thought it was valuable information.
Whole Sequence Genomes
SNP data is cool, but maybe you’re the kind of person who isn’t happy unless you have ALL THE BASE PAIRS! Then you, my friend, need to make your way over to GenBank.
GenBank has full genome sequences for dozens of species (including several full genome sequences for humans). You can download them and run… whatever you want… off them until your processor melts down. I haven’t spent a lot of time working with whole-sequence genome data because I’ve been focused on analyzing my personal data, but it’s there if you want.
That’s all I have at the moment. I’m mostly writing this to externalize the things I’ve learned so I can come back to it if I ever forget it (I do that sort of thing). Hopefully this is also helpful as a jumping off point for anyone who has been trying to do what I’m doing.