(this blog post was written by Daniel Speyer, a Computer Science Masters student in Columbia University, and Yaniv Erlich).
DNA.Land has a new ancestry report. We have upgraded both the algorithm to infer your ancestry to provide more accurate and precise information and the user interface to display it more clearly.
Before explaining what has changed, we will need to go over a little background the reference panel we use to infer ancestry.
The reference panel
To find your ancestry, we start with a reference panel of 6238 monoethnic individuals from 151 places organized into 40 populations. This gives us a pretty good — but not complete — coverage of the world’s ethnicities. Our algorithm then looks for similarity between your DNA composition to individuals in the reference panel.
Suppose the best match we can find for some part of your ancestry is to a group of ethnic Khmer living in Phnom Penh, Cambodia. Does this mean you have Khmer ancestry? Not necessarily. This is the best match among those we tried, including the Kinh in Ho Chi Min City, Vietnam, the Dai and Han in various parts of China and the Telugu in India. We did not, for example, compare you to the Thai, because it is not part of the reference data. So what does happen if someone of Thai ancestry uses our service? They get matched to the closest population we have data for, which we guess would be the Khmer. Therefore we summarize “Closer to Khmer than to Kinh, Dai, Han or Telugu” as “Cambodian/Thai”.
we encourage you to look at the the lists of samples that underlies each definition
In some cases it gets a little more complicated. For example, we have fifty or so samples each from the English, Scottish, Orcadian, Icelandic and Norwegian peoples. The differences between those groups are small compared to the variation within them. Therefore, we cannot reliably tell them apart. We therefore group them into a category called “Northern European”.
To cluster the reference populations into categories, we used a machine learning clustering algorithm. That algorithm spits out lists of reference samples that are in each population and (implicitly) lists of samples that are not. Giving names for these populations names is not always straight forward. We used a mixture of ethnic, geographic and linguistic terms, trying to get the right level of breadth while remaining concise and readable. As none of those are quite the same as genetic, a lot of the names are not perfect fits. They are the best fits we could find. We hope you find them helpful, but we encourage you to look at the the lists of samples that underlies each definition.
The new map
Now that we understand what the populations mean, the challenge is to convey it. No one is going to read the label “Cambodian/Thai” and immediately understand that it means “Closer to Khmer than to Kinh, Dai, Han or Telugu”. This is why we have been reluctant to display such a label in the past. We need to make clear what our categories mean without swamping you in verbosity.
The first solution is the new map (above). For each reference population, we have a pretty good idea of where it was collected. We mark it for you on the map with a small symbol. The symbol will be “✓” if you match this reference population or “∅” if you do not match this population (if it is hard for your to see the symbols, zooming in may help — you can use the scrollwheel, double-click, pinch on a touchscreen or use the buttons in the bottom left).
Next, we will color regions based on which populations they are closest to. The concept of “closest” is geometric, with a handful of corrections to compensate for oceans and mountains. We hope that by keeping things simple, we make intuitively apparent the underlying truth: the points are what we actually know, everything else is deduced from them.
For coloring, we offer you several options to better understand your ancestry:
The “Colors: Same as Above” just paints each region based on the colors in your ring.
The “Greyscale: Indicating Percentage” indicates the strength of signal that the algorithm observed from each region. Note that this is per-region, and small adjacent regions (each of which alone is a small fraction of your ancestry) can add up. When in doubt, check the pie chart.
The “Both at once” combines the colors in your ring with the strength of the signal.
Regardless of view, you can click any symbol or colored region, for a detailed explanation of why it’s there.
A Hierarchical View
In addition to details and maps, we have introduced the concept of taxonomic hierarchy. It looks like this:
This has three big advantages.
First, if our user hasn’t heard of (for example) the Mende people, she can still see at a glance that they are African. Then she can look at the map for more details.
Rather than guess or give up, we place DNA in the category we can with the label “Ambiguous”.
Second, if she is not fond of mental arithmetic, she can still see how her ancestry breaks down into European, African and Native American.
The third advantage is that it allows us to express limited knowledge. In our hypothetical case here, we are unable to precisely match 2.1% of her ancestry, but we are able to say what continent it is from. Rather than guess or give up, we place it in the category we can with the label “Ambiguous”. In much the same way, we can also display ambiguous ancestry one step down the tree, when we can tell category but not exact population. Roughly 90% of users will see ambiguity somewhere in their ancestry.
What Comes Next
In the near future, we hope to get your feedback to our ancestry report. While Facebook is great to compare results with other people, we would like to encourage you to send us emails to firstname.lastname@example.org, because then we have your username and can take a deeper look at your results. We are aware that the reference panel is not complete. We need more reference populations such as Cree, Ojibwa, Algonquins in North America, the Afar and Oromo in East Africa, and the Czech in Europe.
For everything we do, we need more participants. Please encourage your colleagues, family, and friends to join DNA.Land. Our website is free, not-for-profit, and runs by academics in Columbia University and the New York Genome Center.
 For a more general discussion see our previous post: “What is ancestry”.
 Nobody is really monoethnic, but we generally require that at least all grandparents of the person self-identify themselves from the same population.