A New Probability Calculator for Genetic Genealogy

Relationship predictions are now updated to include differences in maternal and paternal recombination rates as well as validation of ranges by peer-reviewed standard deviations

Brit Nicholson

Published in

Alexandria Science

11 min readApr 6, 2021

The calculator can be found here.

I’ve previously published exact averages and very accurate ranges of shared DNA for any genealogical relationship that can be imagined. The model that produces these results is validated by the standard deviations of Veller et al. (2019 & 2020). Since the data that come out of this model are so accurate, and since they can be calculated for sex-specific genealogical relationships, which had never been done before, it seemed only natural to use it for a relationship probability calculator.

Probability curves for different relationship types

The most striking thing about the figures shown here is the curve for grandparent/grandchild relationships, which features two distinct peaks. Who would’ve thought that those relationships are so different than avuncular and half-sibling relationships? Genetic genealogists have been treating them all the same. We now see that treating them as a homogenous group is a gross oversimplification.

**Figure 1**. Probability curves for relationship types 5C1R to full-siblings at AncestryDNA. The y-axis shows the probability of each relationship type relative to all others included. All types here are sex-averaged, although the calculator gives sex-specific probabilities for half-avuncular, 1C, avuncular, half-sibling, and grandparent/grandchild relationships. 1C1R = 1st cousin, once removed; cM = centiMorgan, HIR = half-identical regions. The second cousin (2C) curve is higher because it’s the first curve to be the only one from its group (it has little competition near its center).

The first thing that came to mind when I saw the probability curves in Figure 1, other than surprise, was a discovery that I had made and written about just one week earlier. At that time, I had found that a person is actually more likely to share 22% or 28% DNA with a grandparent than 25%, despite 25% being the expected value. But it turns out that that rule isn’t the reason for the two peaks on the grandparent/grandchild curve, at least not directly. In fact, the two peaks are actually much farther apart than 22% and 28%. And the histogram for grandparent/grandchild relationships only has one peak, as shown in Figure 2.

**Figure 2**. Normalized histogram for 500,000 grandparent/grandchild pairs. These are the same data points that went into the probability calculator. The individuals were simulated as 250,000 paternal grandparent/grandchild pairs and 250,000 maternal grandparent/grandchild pairs, but the fractions of shared DNA for each were not differentiated when creating the histogram. For that reason, despite not being labeled as paternal or maternal, values near 0.25 on the x-axis are more likely to come from maternal grandparent/grandchild pairs and values at the far ends of the histogram are much more likely to be from paternal grandparent/grandchild pairs.

The reason for the two peaks in Figure 1 is that grandparent/grandchild relationships have far more variance than all other relationships (Veller et al., 2019 & 2020). But I think that the reason that those relationships have such high variability is the 22%/28% rule. One relationship is dependent on the other, since the percentages for a grandparent pair have to add up to 50%. It’s the only relationship type in which a value far away from the mean has to be accompanied by another value far away from the mean. For this reason, the 22%/28% rule an indirect cause of the two peaks. Since this subject of relationship probabilities concerns the relative probabilities of relationship types, a gap between two curves has to be filled by one or more other relationship curves. And the largest gaps occur between the group that includes grandparents and the two groups on either side of it. The difference is even more striking when looking at IBD data such as in Figure 3. (IBD stands for identical by descent. It’s the total amount of DNA that two people are reported to share. It can be contrasted with half-identical region (HIR) sharing, which counts fully-identical regions (FIR, or IBD2) as if they are HIR). Reporting the total amount of DNA that full-siblings share moves the curve for that relationship even farther to the right of grandparent/grandchildren relationships.

**Figure 3** Probability curves for relationship types 5C1R to full-siblings at 23andMe. IBD = identical by descent, which includes both HIR and FIR shared DNA. All other parameters and abbreviations are the same as in Figure 1.

Figure 3 shows a drastic increase in the height of the right-most peak for grandparent/grandchild relationships when compared to Figure 1. The probability of this relationship type peaks at 78.7% around 2,510 cM as would be reported by 23andMe. This is due to moving the full-sibling curve far to the right, from the 37.5%, on average, that would be reported by AncestryDNA to the 50%, on average, that full-siblings actually share. In contrast, half-siblings are only 12.1% likely and avuncular relationships only 3.2% likely at 2,510 cM. An added benefit of IBD sharing platforms is that half-siblings are more easily distinguished from avuncular relationships, which is very apparent from about 2,200 cM to 2,500 cM.

Is it really possible for the likelihood that you’ve found a grandparent at 2,510 cM to be that much greater than a half-sibling, aunt, or uncle? Because of how unlikely it is for half-siblings or avuncular pairs to share 2,510 cM, the answer is yes. The caveat to that is that a grandparent/grandchild might be less likely because of age or representation in the population. But, as time progresses and DNA kits remain in the database, the likelihood of finding grandparents will likely increase. You would have to weigh the probabilities against those other factors. And, of course there are other relationship types that are possible at this number of cM. It could be 3/4 siblings, for example, and the amount of FIR sharing should be analyzed separately in cases such as this.

Comparison to a previously used probability curve

I calculated these probabilities presumably the same way that it was done in the AncestryDNA white paper. Their probability curves from that paper have been the most widely used method of determining relationship probabilities. However, in their methodology, relationship types are lumped into groups, and sex-specific probabilities aren’t calculated.

I wasn’t sure what to expect once I developed a way to compare my model results to AncestryDNA’s model results. Very few details are given about their methods or data, including anything that could be used to validate their methods or probability results. I find that the white paper probability curves look very similar to the curves that I plotted. Since the simulation I use is validated by standard deviations from Veller et al. ( 2019 & 2020), this means that the AncestryDNA numbers are probably fairly good. That’s because they used a simulation. Despite my love for data, in genetic genealogy bad data is the name of the game.

**Figure 4**. Relationships probabilities from my simulations on the left compared to those from AncestryDNA on the right. Units are the same for both graphs. The y-axes for both graphs are on a logarithmic scale. This was done at AncestryDNA in order to show the differences in more distant relationships, which were otherwise bunched-up.

The differences for distant cousins can be accounted for by the fact that the probabilities in my dataset were calculated against other, more distant relationships that are not shown here in order to correspond to the AncestryDNA chart. The 3C1R, 4C, etc. probabilities on my graph now don’t add up to 1. They did when 4C1R, 5C, and 5C1R were included, but those are now left out. For relationship types such as the half-sibling/grandparent group, I was able to add up all of the probabilities to make one curve. I could go back and re-calculate the probabilities for 3C1R, 4C, etc. without including more distant relationships, but I think the comparison of graphs is clear as-is.

Methodology

To calculate probabilities for the new tool, 500,000 individual pairs were compared from each relationship type. Each pair will share a certain number of cM. Bins 1 cM wide were created, centered on integer values, and the number of pairs for each relationship type were counted for each bin. Those counts are then used to determine the probability of each relationship type at a given cM value. For 500,000 half-siblings, 250,000 paternal and 250,000 maternal half-sibling pairs were included. That would allow half-siblings to be equally weighted against grandparent/grandchild relationships, which share the same mean. First cousins include four different sex-specific paths, therefore each type consisted of 125,000 pairs. Sex-specific probabilities were calculated for relationships including 1st cousins and closer. Sex-specific probabilities are not as different for more distant relatives, plus the number of sex-specific paths increases exponentially (16 types of 2nd cousins), so those differences weren’t included.

The amount of shared DNA between individuals is highly variable. Smoothing of the data was very much necessary, and it was by far the hardest step of the process. Figure 5 shows how un-smooth the curves are for raw data. These curves are actually less realistic than the smoothed curves. For a given set of assumptions and parameters, even in real life, there is some definite probability for each relationship type at each cM value. It is not a fuzzy probability. If I increased the number of individual pairs for each relationship type, perhaps to one million or several million, then the probability curves wouldn’t require smoothing. Imagine trying to get an empirical database that large, which would then contain a lot of erroneous data and/or be missing a lot of data erroneously labeled as “outliers.”

**Figure 5**. Un-smoothed probability curves for relationship types 5C1R to full-siblings at AncestryDNA. The y-axis shows the probability of each relationship type relative to all others included. All types here are sex-averaged, although the calculator gives sex-specific probabilities for half-avuncular, 1C, avuncular, half-sibling, and grandparent/grandchild relationships.

I ensured that the smoothing didn’t flatten the curves. I only applied as much smoothing as was necessary to get the curves monotonic over the applicable ranges and then ensured that the probability values were unchanged from what would be expected if you were to draw a curved line along the center of the above probability curves. It’s easy to see in the un-smoothed graph: Grandparent/grandchild relationships are quite different than avuncular and half-sibling relationships.

Advantages of this probability calculator

Some relationship types within a group are too different to be treated the same: Grandparents are far different than half-siblings and avuncular relationships. This calculator treats them differently. This and the next point make this calculator especially accurate for close relatives.
There are significant differences between paternal and maternal recombination rates. This results in much wider ranges of shared DNA between paternal relatives than for maternal relatives. The probability calculator used here allows for those differences.
The data for IBD probability curves, such as that for 23andMe data, come from IBD data. This is an exceedingly important point. It is not a good idea to use an AncestryDNA graph to try to distinguish between relationships at 23andMe
The data used to calculate the probabilities are from the same model and version that made the most accurate tables of shared DNA currently published.
The probabilities used in this calculator can’t be influenced by erroneous data, whether mislabeled, affected by endogamy, or potentially includes multiple unknown relationships.

There are important differences that can be seen with this tool.

For AncestryDNA data, 1,272 cM is the value at which grandparents and great-grandparents are equally likely, at about 25.6% probability each. Half-avuncular relationships are 18.6% likely, half-siblings are 11.9% likely, and avuncular relationships are 7.8% likely. This makes a total of 46.3% for the group that includes grandparents, half-siblings, and avuncular relationships and leaves 53.7% for the next group. This is similar to the 50/50 split that AncestryDNA reports, except the former values are broken down by multiple relationship types (including paternal and maternal, which aren’t shown in this example but are included in the calculator), and are validated by peer-reviewed statistics. AncestryDNA hasn’t released any kind of statistics to validate their data.

Other important notes

All probabilities are for autosomal DNA only. Please subtract any X-DNA before using the calculator. Also, I recommend subtracting any shared DNA from segments less than 7 cM that may have found their way into your total. Family Tree DNA includes very small segments in their total cM calculations.

The above probabilities assume no endogamy or other pedigree collapse. Those cases should be treated separately.

Multiple cousin relationships are not included here, but averages and ranges can be found here.

Parent/child relationships are not included here. They are easy to distinguish from other relationships, including full-siblings. Parent/child relationships consist of a half-identical match across the whole length of the genome. Full-siblings share 25% fully-identical regions, on average. Genotyping sites will take this into account in their relationship prediction. If a relationship is predicted to be parent/child, full-sibling is not a possible relationship and there is no need to analyze the shared DNA amount here.

Relationships more distant than 1C1R and half-1C are grouped together by those with the same average shared DNA. Also, half-avuncular relationships are treated the same as siblings of grandparents, which are called great- or grad-avuncular relationships. They are treated the same because the curves are the same, as are any other relationship types that share the same curve. For each curve shown in the figure at the bottom of the page, 500,000 pairs were simulated. Therefore, relative probabilities of each relationship type are based on the assumption that an equal number of each are possible in the population. While this assumption isn’t true, it’s the best way to generate probabilities. Age and other factors, such as the likelihood that your unknown great-grandparent or great-grandchild is the DNA match you’ve found, should be taken into consideration. It’s probably more likely that a 1,200 cM match is a half-avuncular relationship than a great-grandparent, despite the fact that, if they were equally likely relatives to find as DNA matches, the cM value alone suggests great-grandparent is more likely.

These probabilities are only calculated as far back as 5C1R. The huge advantage of this tool, other than the accuracy of the data, is that it treats close relatives as not being in the same group because the curves are significantly different. For distant relatives, there’s much less certainty about the genealogical relationship for your DNA matches. Matches as low as 8 cM are allowed here, however the relationship may be farther back than 5C1R. However, the relative probabilities may be accurate even at those low values. Indeed, any of the probabilities shown above are only relative to the other relationships listed, therefore they’re only meaningful in comparison to the other relationships. And there’s no cM value at 8 cM or above at which even a 4C1R is the most probable relationship. So, while the probability of an 8 cM match may be higher for “4C1R or more distant,” listing each relationship type separately would not result in more useful information. Not only are very low cM values difficult to assign to a recent ancestor, but segments of 20 cM or 30 cM may be on pile-up regions and therefore come from very distant ancestors.

Totals will not always add up to 100%. When multiple relationship types are present, the chances of rounding errors increases. I don’t believe that the totals are ever off by more than 0.2 percentage points.

This is not the first tool to show relationship probabilities based on a user input of shared DNA. Genetic Affairs had the first automated tool. And Jonny Perl has done amazing work at DNA Painter, including probability calculations that can be built-in to your family tree.

Feel free to ask me about modeling & simulation, genetic genealogy, or genealogical research. To see my other articles on Medium, click here. And try out a nifty calculator that’s based on the first of my three genetic models. It lets you find the amount of an ancestor’s DNA you have when combined with various relatives. And most importantly, check out these ranges of shared DNA percentages or shared centiMorgans, which are the only published values that match known standard deviations.

Originally published at https://dna-sci.com.

A New Probability Calculator for Genetic Genealogy

Relationship predictions are now updated to include differences in maternal and paternal recombination rates as well as validation of ranges by peer-reviewed standard deviations

Written by Brit Nicholson