New and Improved Autosomal Genetic Model
A comparison of three models, including updated amount of shared DNA between various relatives and ancestors
Genetic Modeling Background
I recently published a comparison of autosomal genetic models. The first model, which I made about two years ago, is as simple as can be, but I think it captures the important insights. The new model adds the feature of two homologues per chromosome that you’d find if you peered into the genome of a real human. This allows the simulation of ‘genes’ or ‘segments’ switching places from one homologue to another, potentially multiple times. A constraint on both models is that siblings have to share 50% of their DNA, on average, but with a standard deviation of 3.6%. A constraint on the newer, two homologue model, is that there has to be an average of 55 recombinations from a parent’s genome to a child’s genome (the number is actually higher in mothers and lower in fathers, but the specific numbers are unknown), and the random distribution that produces that number is approximated by a Poisson distribution.
The only thing left to decide was how many ‘genes’ or ‘segments’ would have to be available in order to recombine them an average of 55 times. A lower bound on that number would be one more than the maximum value you would get from the Poisson distribution. It would be hard to find an upper bound, but the standard deviation between siblings would still have to be 3.6% no matter the number of segments, and a larger number of available segments would very quickly become computationally burdensome for the simulation.
A Better Model
I knew at the time that the complicated model wouldn’t be necessarily more accurate. In fact, I noted that the simpler one is likely the more accurate of the two. But there could be additional constraints that, if known, could change that. For example, what’s the standard deviation between a grandparent and grandchild? Or what percentage of third cousins share no DNA with each other? Or my favorite constraint of all …
No sooner had I published that comparison of two models than I found something I had been waiting for since making my first genetic model. Late last year a preprint was released for a study of recombination and variance in genetic relatedness. This paper estimated the variation between grandchildren and a paternal grandparent to be 4%. Surely this value would be easy to come by empirically for someone with a decent dataset, and I had been asking the major direct-to-consumer testing companies for statistics on grandparent-grandchild relationships for two years, but so far none have been interested in helping. I don’t know how accurate the value is from the above paper. In fact, I thought it sounded quite high, but it’s at least a starting point for something I’ve been wanting to model for a couple of years. If the value turns out to be a bit different, I can simply substitute an updated value in the future.
It’s been known for years that genomes recombine differently in men and women, with more recombinations in women, resulting in a less variation between maternal relatives, and fewer recombinations in men, resulting in more variation between paternal relatives. My earlier models showed that the standard deviation between a grandchild and a grandparent might be about 2.55%. Therefore, if the standard deviation of grandchildren and paternal grandparents is 4%, the standard deviation between grandchildren and maternal grandparents is likely much less than that (in order to average something like 2.55%). This is in line with more recombinations in women and less variation between maternal relatives.
I was ready to create a model that treats recombination rates of mothers differently than those of fathers. Once the model could find the recombination rate for fathers that produces the correct standard deviations, the recombination rate for mothers and the standard deviation for maternal grandparents would be known. Finally, I could compare results of the updated autosomal model with the two previous models.
Training the Model
Initial tests made it clear that, as the number of segments are increased, the input recombination rate for fathers has to increase asymptotically (see Figures 1 and 2). That paternal rate appears to level off around 35 average recombinations per autosomal genome (corresponding to a Poisson lambda input of 13). In order for the average recombination rate to be 55, the rate for mothers would have to be about 75.
In order to run simulations that don’t take too long, it would have been preferable to use a low number of segments such as 99, which was used in the first two-homologue model. Perhaps suggesting that a low number could be used, values close to the targeted standard deviations could be achieved by using almost any number of segments. However, with the asymptotic relationship between the number of segments and the recombination rate used, it seems as thought it would be best to pick a high number of segments-one that gets the recombination rate close to the asymptote.
When the number of available segments used in the simulation is high enough, those segments are essentially simulating individual genes. There are approximately 20,000 coding genes in the human genome. This would probably be a good number to use in the simulation. There are actually many more genes that don’t code for proteins, suggesting that the simulation should maybe use a higher number. However, recombinations often occur at the same spots. And that’s obviously true for the beginning and end of each chromosome, which this model doesn’t attempt to simulate. Recombination occurring in the same spots suggest using a lower number of available segments for the model. Without more information, 20,000 genes is probably a good mid-point. Unfortunately, regular computers won’t be able to simulate many trials of this model if 20,000 available segments are used. It will have to be a much lower number, but one that gets the recombination rate close to the asymptote.
If more statistics become available to use as constraints, such as the percentages of 3rd cousins who share no DNA, I believe that that would pinpoint the best spot on the asymptotic curve to use. If the assumptions of the model are at all correct, which I think is true, given the constraints that this model has, that would probably give a recombination rate that’s very close to the real-life value.
I eventually decided to run simulations with 1,700 segments. I wasn’t very happy about how far away that was from the asymptote, but any higher number of segments required simulations that took way too long to run. Perhaps someday better processors or distribution over processor cores will allow a higher number of segments. Or perhaps I should be running these simulations in C++. The best paternal recombination rate to use with 1,700 segments was 33.84, which is probably about one value lower than the asymptote. That results in a maternal recombination rate of 76.16. Based on other values I’ve tested, some more accurate combinations to use would probably be 1,800 segments and 33.86 recombinations or 2,000 segments and 33.9 recombinations. However, since the target statistics aren’t known to very precise values (the sibling standard deviation could be anywhere from 3.55% to 3.65% and the paternal grandparent standard deviation could be anywhere from 3.5% to 4.5%), there’s no need to try to predict them more accurately.
Recommended model input based on training model results:
I’ve noted already that the single homologue model is likely more accurate than the first two-homologue model that I made since the latter didn’t have additional constraints on it to justify its additional complexity. However, the new model presented here has the additional constraint of relatedness to paternal grandparents. I believe that this model is now the most accurate of the three. The tables below will present the statistics for shared DNA between various relatives alongside the results of previous models.
Figure 12 shows the first relatives (tested here so far) with which you might not share any DNA. I believe that these figures are fairly accurate. I mentioned in a previous article that ‘0% shared DNA’ is the statistic that the simple, single homologue model probably gets wrong. The reason is that the model only uses an average of 97.5 segments available to pass from a parent to a child. The smallest amount of DNA that can be shared between two people is 1/98 segments (~1%), other than zero. In many cases, values that might have otherwise fallen between 0 and 1/98 will end up as 0 in the simple model. In the first two-homologue model, more than twice as many segments are available, allowing for better resolution when very little DNA is shared. In the newest two-homologue model, 3,400 segments are available (after adding the two homologues together). Using something like 20,000 segments would result in even better resolution, but probably only slightly.
A follow-up post shows all of the results for the new and improved model without the previous model results.
Cover photo by Robin Kumar. Feel free to ask me about modeling & simulation, genetic genealogy, or genealogical research. And make sure to check out these ranges of shared DNA percentages or shared centiMorgans, which are the only published values that match peer-reviewed standard deviations. That model was also used to make a very accurate relationship prediction tool. Or, try a calculator that lets you find the amount of an ancestor’s DNA you have when combining multiple kits.
Originally published at http://www.dna-sci.com.