Modeling the Inheritance of X Chromosome DNA
A discrete, stochastic model written in Python
First X Model Attempt
In June, 2019 I started working on a model of X Chromosome inheritance. The model ran into some problems and I seemed to have given up on it for some time. Seven months later I decided that this was something I still really wanted to do.
I set up the model much in the same way as this autosomal model. The only differences were that males would now have half of the number of segments as females and that X-DNA can’t be passed through a line of two consecutive males.
These were the rules for the X Chromosome model:
- A mother passes half of her available segments to her children.
- Those segments passed from the mother are done so at random, so that they can come from her father or mother. On average, half will be from her father and half from her mother, but the amount from one instance to another will be highly variable.
- A father passes all of his segments to his daughter, but none to his son.
One consequence of these rules is that a mother will pass a full X Chromosome to her son, but it could likely be recombined from both of her parents. Another is that a father will pass a full X Chromosome from his mother to his daughter. Although it may have been recombined from the paternal grandmother’s two homologues, all of the X DNA that the granddaughter has on one of her X Chromosomes came from that grandmother, preserved over two generations. Indeed, it’s well known that alternating male-female lines preserve the most X-DNA over the generations; although I discovered some small caveats to that, which I’ll discuss later.
I discovered something that astounded me pretty early on in my development of the model. Using the premise that X-DNA can’t be passed through a line of two consecutive males, the fraction of ancestors that can contribute X Chromosome DNA to a female descendant approaches 0.618…, or the inverse of the golden ratio, as generations increase farther back in time. This number is derived from the famous Fibonacci sequence. I first noticed when I was trying to develop a formula for which ancestor would be the first (from the left), in a given generation, to be able to pass X-DNA to a female descendant. Then I noticed that the number of ancestors, as shown by the boxes with color in Figure 1, followed the pattern 2, 3, 5, 8, etc. I decided to investigate a little further and found that I wasn’t the first person to discover this phenomenon. Here’s how it works: both parents, or 100%, can contribute X-DNA. However, a paternal grandfather cannot make a contribution, leaving 3/4 (or 0.75) of grandparents. Only 5/8 (or 0.625) of great-grandparents can contribute X-DNA, and the fraction decreases towards 0.618 from there.
Building the X Chromosome model based on the autosomal model was fairly easy. The only major hurdle this time was that I had to create dictionary keys on-the-fly as well as strings of letters that give the path to one’s ancestor, to be used as labels in the output, for example ‘MMM’ for maternal maternal great-grandmother. I was eventually able to find the information I needed.
The notation used for the model made it easy for the simulation to create a variable number of ancestors on-the-fly (variable based on the number of generations input by the user). It also made it easy for me to keep track of the generation, left-right position in the tree, and sex of an ancestor. The first character is a letter corresponding to male ancestors (‘P,’ for paternal) or female ancestors (‘M,’ for maternal). Next is a number, taking up only two characters in this figure, but more characters if going seven or more generations back. The number corresponds to the order of the ancestor from left to right, only within a particular generation. The next character is an underscore and that’s followed by the generation of a particular row. Parents are called generation 1, grandparents are generation 2, etc.
As in the autosomal model, the order of ‘genes’ or ‘segments’ didn’t matter. This made for a simpler model at the expense of being a bit less like reality. One result of the previous, autosomal, model was that increasing the number of segments available to pass from parent to child decreased the variability of shared DNA between relatives. This was expected because, for example, it would be nearly impossible for two siblings to inherit all or none of the same DNA if there were hundreds of segments available; however, if one of only two segments were passed from a parent to a child, siblings would always share either all or none of their DNA. So decreasing the number of segments available increases the variability of shared DNA between relatives. The number of segments needed in the autosomal model ended up being about 97.5, on average. The reason it wasn’t around 55, as in real life, was that not requiring the ‘genes’ or ‘segments’ to stay in the same order decreased the variance, requiring more segments to be used in order to increase the variance to match real data.
Finding data for the autosomal model was a dismal affair, but the availability of X Chromosome data is even worse. Something would be needed to tune the simulation, otherwise only the average percentage of shared DNA would be valid. Much more interesting would be to know how much the percentage can vary. The only data I could find was an average number of recombinations and an average percentage of maternal meiosis events in which no recombination took place on the X Chromosome. Since my simulation didn’t preserve gene order, the average number of recombinations wouldn’t help. The only data point available to tune the simulation was that about 14% of the time all of the X-DNA that a mother passes to a child is from one of her parents, and none from the other. This was from a study of 250 grandparent-grandchild relationships, so the results should be statistically significant, although they probably could be improved upon quite a bit.
Since the X Chromosome is obviously way smaller than the combined length of all of the other chromosomes, much fewer (than 97.5) segments would have to be used for an X Chromosome model. Getting the simulation to result in no recombinations about 14% of the time required an input number of segments to be about 4.5, on average. This caused a problem. Using that few segments, the percentages of X-DNA that a granddaughter could share with a maternal grandparent were as follows: 0%, 16.67%, 25%, 33.3%, and 50%, with no values in between. With a maternal grandfather, a grandson could share 0%, 33.3%, 50%, 66.7%, or 100%. I wasn’t satisfied with that level of resolution and I thought that it might affect the variance. Something would have to be done to increase the number of segments, which would result in better resolution, but without decreasing the percentage of times that no recombination takes place.
I decided do something I’ve long intended to try with my autosomal model: have two ordered sets of genes, allowing crossover, but keep the original order throughout the simulation, i.e. simulated segments or genes would stay at the same loci.
A good approximation for the number of recombinations a chromosome undergoes is the Poisson distribution, so I used that in this model. Real chromosomes are more likely to recombine in certain places, however that wasn’t taken into account here. I only needed a model input for the average value of the Poisson distribution. In the same study mentioned above, in which 250 grandparent-grandchild relationships were compared, the average number of recombinations on the X Chromosome was 1.655. I don’t know if that study took into account double or higher-order crossover events. If not, the true average number could potentially be larger than 1.655. I also don’t know if many of those grandparent-grandchild relationships were from the same parents and/or same grandparents. I wouldn’t be surprised at all if certain people tend to have more or fewer recombinations events. But I know, for example concerning autosomal DNA, that my mother passed more of her father’s DNA than her mother’s DNA six out of six times. A future study with more grandparent-grandchild relationships or one that doesn’t include the same parents multiple times might find that the average number of recombination events is slightly different than 1.655.
There are two effects that occur when the Poisson average, λ, is increased:
- Increasing λ decreases the standard deviation of shared DNA between relatives.
- A larger number of available segments should be used, which causes a different set of effects.
Here are the effects of changing the number of available segments:
- Increasing the number of segments available increases the standard deviation of DNA shared between relatives.
- Increasing the number of segments provides for better resolution in the shared DNA between relatives, i.e. fixing the problem above in which a grandchild might only receive 0%, 33.3%, 50%, 66.7%, or 100%, but no values in between, of their grandparent’s DNA.
- Decreasing the number of segments available slightly narrowed the range of expected percentages for a 90% confidence interval. But this is only a good result if accurate, so it isn’t a valid reason to change the input.
- Decreasing the number of segments available risks cutting off some of the higher values of the Poisson distribution, resulting in a lower average, λ, than was inputted.
Combining those effects, if the standard deviation of the simulation is known to be higher than in real-world data, the Poisson average, λ, could be increased, the number of segments could be decreased, or both. However, increasing λ will often mean that the number of segments needs to be increased to make sure that the high end of the distribution isn’t being cut off. And increasing the number of segments will potentially cancel out the effects of increasing λ.
Note that the goal isn’t simply to find a combination of λ and number of segments that results in the right simulation. It would be preferable to use the value of λ that best approximates the average number of recombinations on the X Chromosome in real life, and then to use the number of segments that best approximates the standard deviation of shared DNA between relatives in real life. Hopefully, then, the percentage of events in which no recombination occurred in the simulation would also match the data in real life.
To train the model, I started by only compared the shared X-DNA between two siblings each trial run. The sex of the siblings wouldn’t matter, since I would only compare the X-DNA that came from maternal grandparents, i.e. that which has potentially been recombined. I have a small dataset of shared percentages of X-DNA between siblings. I admit that it isn’t a good dataset because it contains only 21 sibling pairs. (I found out recently that there doesn’t appear to be a way to see how much maternal X-DNA two full sisters share on GEDmatch because it only shows one homologue of the X Chromosome. It will show a 100% match for the paternal homologue, but won’t list the segments of the maternal homologue. Otherwise, I could add a few more known sibling pairs to my dataset.) Of course I found that the average shared percentage between siblings is about 50%. What’s of interest is that the standard deviation appears to be about 22.25%. I trained the simulation with the goal of reaching a standard deviation between siblings close to that value.
Using λ = 1.655 and 12 available segments, the standard deviation came out to about 23.6%. It would be preferable, given the data now available, to get the standard deviation a little lower. Trying λ = 1.7 brought the standard deviation down slightly. However, I then had to increase the number of segments, which would, in turn, increase the standard deviation.
In a test of 500 million Poisson distributions with λ = 1.7, the largest random value returned for any distribution was 13. That number represents the places in between segments that could be cut, so I used 14 segments in the simulation. In my simulations of 500 thousand trial runs, if any of the random Poisson values were greater than 13, it would be extremely rare. The number of segments used is like a maximum value allowed during the simulation. If the Poisson distribution returns 13 or fewer, that’s the number of places in which the 14 segments will be cut. I’ve found that no error occurs in the simulation if the Poisson distribution returns a number higher than the number of places available at which to cut (the number of segments minus one). Instead, the simulation simply defaults to the maximum number available rather than the abnormally high Poisson value. (This wasn’t intentional, but it was convenient.) Figure 2 shows how few Poisson values are generated at the high end of the range.
I wasn’t comfortable raising the value of λ any higher than 1.7, so I stuck with that value, which is only rounded up to one decimal place from the source of that statistic (1.655). I used 14 available segments, which likely never resulted in cutting off any abnormally high Poisson values. As mentioned above, my very small dataset of shared DNA between siblings has a standard deviation of 22.25%. The standard deviation for this training model was 23.5%. If a better dataset is produced someday that shows a slightly higher standard deviation, it will be a vindication for this model. If a Poisson distribution with an average greater than 1.7 is someday found to better represent the number of recombinations on the X Chromosome, this model could be slightly adjusted for more accurate results. If another distribution is someday found to better represent X Chromosome recombinations, the simulation could use that instead of the Poisson distribution. Still, I think the results are worth looking at for a model with a standard deviation of 23.5%.
In order to display the results of the simulation, I developed yet another convention for naming ancestors (in addition to the one with underscores in Figure 1). An X Chromosome tree showing ancestors with the new naming convention is shown in Figure 3. ‘M’ stands for maternal and ‘P’ for paternal. Any ancestor with a ‘P’ at the end of his name is a male and any ancestor with an ‘M’ at the end is a female. MPMPM is your mother’s father’s mother’s father’s mother.
When calculating the percentage of shared DNA between an ancestor and a female descendant, I had the choice to divide the amount shared by twice the number of segments, since she has two X homologues, which would show the shared X-DNA as a percentage of all of her X-DNA, or I could divide the shared percentage by the same number of segments as for a male descendant. I chose the latter method. There are a lot of advantages for doing it this way and very few for the former. The biggest advantage is that amount of shared DNA is now comparable to that for male descendants. This may cause it to appear that a woman shares 100% of her X-DNA with her mother, as shown below in Figure 4, but, what it really means is that she shares 100% of the X-DNA that she can share with her mother. She shares all of her X-DNA with her mother on the only homologue on which she can share X-DNA with her mother.
Early runs of the simulation showed common patterns that make it easy to predict how much X-DNA one will share with a particular ancestor. These are the two principles:
- If an ancestor is male, all of his X-DNA came from his mother. Simply keep the same percentage for her that’s already displayed for him.
- If an ancestor is female, take the percentage that’s displayed for her and halve it for both her mother and father.
Those two rules were applied in order to fill in the values in Figure 4, below.
An interesting result of Figure 4 is that that the sex of the farthest-back ancestor doesn’t matter, nor does the sex of the descendant being compared to a particular ancestor. For example, a grandson will inherit 50%, on average, of his X-chromosome from any maternal grandparent. It doesn’t matter if it’s a maternal grandfather or a maternal grandmother. And it doesn’t matter if it’s a grandson or granddaughter, which allowed for the convenience of only putting one percentage in each box. As mentioned earlier, it’s well known that the greatest amount of X Chromosome DNA is passed down from alternating male-female lines. But one result of this model is that the gender in the first and last generations doesn’t matter, which is something I had never heard before. So, while one might expect that the greatest shared percentage of X-DNA over five generations might be between between a female descendant and her paternal 3g-grandfather, PMPMP, it turns out that she could expect to share the same amount of X-DNA with her maternal 3g-grandmother, MPMPM. And that’s the same amount that her brother could expect to share with that same ancestor. Still, excluding the first and last generation, the conventional wisdom holds true. While a person shares, on average, 25% of X-DNA with their ancestors MPMP and MPMM (it’s the ‘MPM’ that matters), following the alternating line MPMP farther back in time will find ancestors with greater shared X-DNA than on the MPMM line. Figure 4 showed the percentages I expected to find in the results of the simulation. Let’s see if those are accurate.
Figure 5 shows model results with the same percentages that are found in the tree in Figure 4.
With the exception of one value of 24.9% that’s awfully close to 25.0%, and which would likely be 25.0% in a simulation with more trials, Figure 6 shows model results that are identical to those predicted in Figure 4 based on patterns observed from early trial runs.
Again, these values are nearly identical to those predicted in Figure 4.
I will likely use this model to make a calculator that predicts the amount of DNA a person can share with various relatives. Additionally, I could make one for combinations of various relatives just like in the autosomal calculator. The X Chromosome model could even be coupled with the autosomal model at some point. But first, I would rather modify the autosomal model to take into account differences in recombination frequencies between mothers and fathers. If only people with DNA databases would share simple statistics such as standard deviation, including some of the direct-to-consumer companies 10s of millions of kits in their databases.
Feel free to ask me about modeling & simulation, genetic genealogy, or genealogical research. And make sure to check out these ranges of shared DNA percentages or shared centiMorgans, which are the only published values that match peer-reviewed standard deviations. That model was also used to make a very accurate relationship prediction tool. Or, try a calculator that lets you find the amount of an ancestor’s DNA you have when combining multiple kits.. The cover photograph for this article shows the greatest likely line of X Chromosome inheritance for my family.
Originally published at http://www.dna-sci.com.