Come take a peek inside the fascinating (and sometimes frustrating) world of genetic research!
by Daniel Speyer (with Richard Aufrichtig)
The simplest, and most common, way to look at a large set of DNA files is by conducting a Genome-Wide Association Study, or GWAS. In this kind of study, scientists look for genetic features that are associated with traits they are interested in learning more about.
The genetic features that the Erlich Lab looks at are called Single Nucleotide Polymorphisms, or SNPs. If you conceptualize a person’s genome as being a very long word consisting of the letters A, C, G and T — then a SNP is a letter in that word that varies an individual’s spelling of it. SNPs have standard names, which are “rs” followed by a number. While not all genetic differences can be found in SNPs, we find that line of research to be the most useful given the data set we currently have. Interesting traits that have already been linked to SNPs range from divorce rate to susceptibility to Crohn’s disease. To predict these traits from the files our users upload, we can use SNP-to-trait links from already existing research. It was via this method that we created our Trait Reports.
The search for a “Firefox Gene” began as a result of a technical problem. While DNA.Land is interested in discovering new SNP-to-trait links, in order to conduct that kind of research we will first need to collect corresponding data from our users. At the moment, we only have a very small amount of this kind of data from the survey on our “Me And My Family” page. While we will soon be launching a survey for breast cancer research, we started to think about what we could study in the meantime. It was then that we remembered our log of which web browsers our users visit us on.
Why do we have a log of our users’ web browsers of choice?
In order to simplify our process for testing new features on our website, we started keeping track of what browser each of our users used. Every time a web browser requests a web page from a server, it includes a “User-Agent” header identifying itself. (If you wish to learn more about User-Agent headers, there is an informative and melodramatic history of them here.)
How did we conceptualize our browser study?
We chose to restrict our study to Windows desktop users, because using Internet Explorer on Windows (where it comes pre-installed) says different things about someone’s personality than using it on a Mac or on desktop Linux. Including mobile browsers would have added whole other layers of complication as well. When we had evidence of users visiting us on more than one browser, we identified them via the one they used most often.
After setting these parameters, this is what we found:
Of the users who signed on after we started keeping track,
- roughly 70% use Chrome
- 20% use Firefox
- and 10% use Internet Explorer.
We found a handful of users who use Opera and one who uses Safari (yes, on Windows), but there were not enough of those to study.
What are the next steps?
Our first step was to calculate how many of our users had the differing values for each SNP in the human genome. Then, we were able to look for any potential correlation between SNP variations and browser of choice, which is more complicated than it sounds. At the time of writing, we have almost 40,000 users, and each of these users has about a million SNPs for us to study. To get the counts done quickly, we must go through each file only once, do many things at the same time, and not trip over our own feet in the process.
(You may wonder at our having only “about a million” SNPs when we’ve talked elsewhere having tens of millions. The files our users upload only have a million SNPs in them. While we can deduce the rest of the SNPs with pretty high confidence by a process called imputation, SNPs deduced via imputation aren’t very interesting for our purposes here.)
Telling if we’ve found anything
If you take a bunch of people and divide them into completely random groups, you will still see some differences between them. If the division into groups signifies something, it is likely these differences will be big. When we divided our users into groups based on their genomes, and looked at the difference in their browsers, we used a method called the chi-squared independence test. This test involves taking the table of numbers, applying a formula, and getting back a probability called a “p-value.”
The p-value is defined as the probability of obtaining a result equal to or “more extreme” than what was actually observed, when the null hypothesis is true. A high p-value means that what we saw is what we were going to see anyway, and we shouldn’t learn anything from it. A low p-value means that we saw something that cannot be explained by mere chance.
There is something of a tradition in science of trusting p-values less than 5% or less than 1%. This works until you start looking at lots of possibilities. If you look at 100 possibilities, none of which are actually relevant, you should expect to find 5 that look significant p<5% and 1 that does p<1%. If you look at one and a half million possibilities (as we do), you’ll be deluged with false positives. (This has been famously illustrated by xkcd.)
The best solution is the simple one: instead of looking for chances that are one in a hundred, we look for chances that are 1 in 150 million.
The rare genotype problem
Unfortunately, this still wasn’t specific enough. The first batch found a lot of results like:
In this table, we see roughly the usual 70/10/20 pattern for “C;C” and “C;T,” but the “T;T” case is way off. While the chi-squared test confirmed that our quest was interesting, it did not find something conclusive. In essence, what we found was actually a weakness of the chi-squared test. It’s an approximate formula that works well as long as all of the numbers are large. But, in our case, there aren’t enough T;T users to conclude much from. While the usual guideline is not to use it if any number in the grid is below five, we chose to chop things at ten just to be safe.
When we chopped the “T;T” line off the grid, we got:
In short, there’s nothing here.
Testing the whole system
With the chopping fix in place, we ran two end-to-end tests.
First we looked for links between our users’ DNA and the ancestries we deduced from that DNA. We found plenty that were very strong. But, as a blind check, we assigned our users random words in a 70/20/10 distribution (like the browsers) and looked for links between those words and the users’ genome. Remarkably, not a single one was found.
So, what did we find?
After all that, we found seven SNPs that were significantly linked to web browser. The strongest was:
The fact that the heterozygous case had ratios in-between the two homozygous cases was encouraging, as there was nothing in our tests to cause that. What’s less encouraging is what the SNP actually does.
While not every SNP is inside a gene, and not every gene is well studied, this one fits both of those bills. Rs6482158 is inside a gene called Nebulette. The protein here holds together muscle filaments and is mainly active in the heart. While it may be poetic or funny to conclude that the choice of web browser comes from the heart, our results were not definitive enough to prove it.
We also found two intriguing SNPs in more encouraging parts of the genome:
The first of these was in a gene called SEMA5B and the second was close to a gene called LSAMP. Both genes are involved in helping neurons find each other in the brain and connect. LSAMP plays an especially strong role in the limbic region of the brain, which is responsible for emotions and liking or disliking things.
Is rs1398641 being close to LSAMP interesting? While the DNA near a gene can control when the gene is expressed, or can correlate to unmeasured variations inside the gene, these are ultimately weak connections. It would be far more impressive to find a SNP inside an interesting gene.
Correlation or Causation?
While we had established that these genes had some connection to browser use that was not just found by chance, this didn’t quite show that the genes effect what browser you use. Even though we have some plausible theories about how they might, which helps, this isn’t the same as showing that they do.
But we do need to worry that both the genes and the browser are impacted by a third factor: ancestry. In particular, we need to worry that ancestry effects browser use in ways that are not mediated by genes.
We chose to look at two different ancestries: African and European — in both causes using the files of individuals with at least 75% of their DNA traced back to that continent. The reason for this choice stemmed from the fact that broken down in this way, these two data sets represented our largest applicable groups.
Yes. The users of European ancestry have roughly the same pattern as the overall userbase, but those of African ancestry are significantly less likely to use Firefox. Is the ancestry effect driving the gene/browser effects we saw earlier? Let’s take a look at rs9864897, which had the most plausible explanation:
When we were comparing G;G and T;T users earlier, we were really looking at ancestry. We just didn’t know it. There’s still some sign of a G;T vs T;T difference among users of African ancestry, but it’s only p<0.0015, which is hardly applicable when a million SNPs are being looked at.
As the other SNPs show similar patterns, we determined there was no connection to be found.
It’s always possible that there is a gene/browser link that was either too subtle or too complicated for us to detect, but it seems unlikely. If there were going to be a connection, we’d expect it to be a simple one.
This is what science is like. You see an opportunity to investigate something, so you take it. You see some encouraging numbers, go down some dead ends, use fancy statistical tools that only protect you against certain types of errors, and poke at what you’ve found from every angle you can. Every once in a while you find something incredible. And, even when it falls apart, you’re still left with a fun story.