Testing for Linkage Disequilibrium

You have more options that you think.

Will Matlock
The Startup
11 min readJun 13, 2020

--

Linkage disequilibrium (LD) plot of CD44 gene in the GIH population in Hapmap. Image courtesy of Tulsyan et. al (2013) under CC BY 4.0.

Early beginnings

The term linkage disequilibrium (LD) was first coined way back in 1960 by Leowontin and Kojima, but really picked up steam in the 1980s. Not only had large-scale computations become more feasible, but people had begun to realise its importance in human genetics.

As far as genetic parameters go, LD is quite simple. We take two alleles at some loci and we measure their association: how likely is it alleles a and b appear together. We say two alleles are in linkage equilibirum if the gametic frequency (a and b appearing together) equals the product of their individual frequencies. Else, the two alleles are in linkage disequilibirum. Also note that this is analogous to the notion of statistical independence. Nice.

At the end of this long and winding article, we will obtain a fully-fledged statistical test for multilocus LD (gasp). But we have a way to go first…

Some mathematical foundations

Notation for LD varies greatly, but I’ve settled for the following convention. We will denote the loci by uppercase letters and the possible alleles at each loci by lowercase letters with subscripts. For example, suppose we have two dimorphic genes at loci A, B with corresponding alleles a₁ and a₂, b₁ and b₂, respectively. This admits four possible combinations: ab₁, ab₂, ab₁, and ab₂.

Now, we can get started with some probabilties. We let

(1)
(2)
(3)

This is flexible, as we can extend our notation to alleles c, d, e… and so on. Also, it’s worth remembering it’s not necessary for the loci to be on the same chromosome. Now we can define the so-called LD coefficient:

(4)

Short and sweet. So, when D=0, we are in equilibrium. Additionally, since the marginal allele probabilities sum to 1, we can obtain lovely relations such as these:

(5)

In other words, for a biallelic system, if a given allele at loci A couples with one allele at locus B, it must repel the other.

From (4), it is clear that D is constained in magnitude by underlying allele frequencies. This means that a change in D reflects not only a change in linkage, but the underlying allele frequencies. This disqualifies it as a comparative measure between different alleles. People have suggested alternative measures to alleviate this: D’ coefficient, correlation coefficient, and so on. But, they have their own problems too and I won’t say anymore on them.

Hardy and Weinberg

We now rather aggressively change tact. This will, however, serve us in the long run. We start by considering a definition of heterozygosity and homozygosity (hereby collectively ‘zygosity’), based upon the Hardy-Weinberg principle. A primary use of zygosity is to determine allelic diversity in a population. Specifically, a higher heterozygosity indicates higher genetic variability in a population.

The HW principle states that the marginal allele frequencies will remain constant over generations provided there are no evolutionary influences, such as genetic drift and mutation. We can explore this with a simple example. Consider again our dimorphic gene at locus A with alleles a₁ and a₂. Suppose that in generation k,

(6)

Then in generation k+1,

(7)

which is no more than a binomial expansion. What does this all mean for us? Well, a population obeying the HW principle admits haplotype independence between individuals in a population, which allows us to freely define zygosity through products of marginal allele frequencies.

So let’s do it, and while we’re at it, make it as general as possible. We start with a contingency table of gametic frequencies for polymorphic genes at loci A and B:

(8)

Using this (slightly unwieldy but convenient) notation, we can define homozygosity by

(9)

and heterozygosity as

(10)

Lastly, we define the haplotype homozygosity by

(11)

This calculates the probability of selecting two identical haplotypes in a random sample from the population. Again, note that the use of haplotype in this definition is for brevity, and that both the loci may not exist on the same chromosome.

Chromosome-crossed lovers

Now we have some mathematical underpinnings, we can explore analytical relationships between LD and zygosity. This shouldn’t be too tricky, as both are defined by the same parameters.

We start with some wonderful reasoning by Sabatti and Risch (2002). They note that the homozygosity at a some locus is higher when fewer alleles dominate. That is, the contingency table of gametic frequencies contains few entries different from zero. This occurs at a high magnitude of linkage disequilibrium. This was experimentally shown to be true in humans. We shall stick with this reasoning, and get some more maths in place.

Let’s assume we have linkage equilbrium, that is, D=0. Then

(12)

So it seems linkage equilibrium induces a multiplicative property of homozygosity. This result may serve to reduce computational time in modelling, as we can now consider independent loci. However, dynamics such as these are rarely that well behaved. In fact, the multiplicative property holds for more than D=0.

Let’s take our good old dimorphic alleles at loci A and B again. Without loss of generality (I refuse to ever use ‘WLOG’, it looks horrible), let

(13)

Accordingly, a 2x2 contingency table can be written as

(14)

for which we calculate homozygosity as

(15)

This sneaky parametrisation in terms of D reveals that both

(16)

give us the multiplicative property for homozygosity. For the quadratic solution, these two roots are clearly distinct provided q≠0.5. This hints at a more complex relationship between homozygosity and LD.

A further tangent

The following, again, may seem unrelated. But we are still building to a final, elegant result. Now we are going to find an expression for the covariance of heterozygosity for an arbitrary number of polymorphic loci. Why? Well it’s actually a useful expression to have: it provides a single measure of association which isn’t sensitive to sample size. We will also be using it for our statistical test. The following derivations are pretty old too, but I think they’re really beautiful applied statistics.

But let’s start small and build up from two polymorphic loci A and B. Let the random variable H (imagine it’s script) denote the total number of heterozygotes over loci A and B from a random draw. We will initially work under the assumption that the loci are in linkage equilibrium and hence assort independently. We also observe the distribution

(17)

the sum of independent Bernoulli random variables. Hence, we admit H={0, 1, 2}. We now want to calculate the expectation and variance of H. To do this we exploit the moment generating function (MGF). I will leave its derivation to you, but it is simply the product of two Bernoulli MGFs:

(18)

The first derivative at zero yields the expectation

(19)

This along with the second derivative at zero yields the variance

(20)

Pretty simple — now let’s generalise. Consider the random variable G (again, script) with an arbitrary number of loci L={l₁, l₂, …, lN} (subscript N) under the same independence conditions as H. It admits the values G={1, 2, …, N} and has expectation

(21)

and variance

(22)

Great. Now we must relax our earlier assumption: we do not assume linkage equilbrium. For clarity, I will calculate the expectation and variance from first principles as opposed to using MGFs. However, this does yield messy algebra, so it’s best to start with H before moving on to G.

Onwards. We begin with the probability distribution for H:

(23)
(24)
(25)

From here, it’s merely an algebraic slog to obtain

(26)
(27)

as before. But now, more interestingly,

(28)

Yuck! But look, the since we aren’t assuming linkage equilbrium, a cross-term has appeared in the form of the covariance. Let’s compare terms and find a nicer expression for our desired quantity:

(29)

What an elegant relationship between LD and heterozygosity —probably worth the effort.

We’re very close to forming a statistical test for LD now, we just need one final expression. I’ll skip the proof on this one, but here is the variance for G:

(30)

The first term is the contribution of a sum of variances and the second term of nested summations represents the covariance over all unique pairs of loci over all alleles. And that’s it, no more algebra. All that remains is devising our test.

Testing for multilocus LD

Assessing the background LD of a given genomic region is crucial for establishing the significance of other LD tests. We will construct two methods for determining multilocus LD: a statistic and a confidence limit.

Let us assume we have collected allele frequency data from a population. We can then calculate the sample variance without the assumption of locus independence as in (30). Additionally, we calculate the sample variance with this assumption as in (22). Using the following quantities:

(31)
(32)
(33)

we define our test statistic by

(34)

Taking the quotient will reduce the influence of the marginal allele frequencies. If D=0, we are under equilibrium and X(G)=0. Thus, we have a measure of deviation from linkage equilibrium. Therefore a greater X(G) is proportional to a greater magnitude of LD in the sample. Cool.

But we can take this further! Consider the following hypotheses

(35)
(36)

So under the null hypothesis, we are assuming D=0. The upper 95% confidence limit for our sample variance (31) is then given by

(37)

where

(38)

and

(39)

notates the i-th central moment. This is the variance of the sample variance. Pretty niche. A full derivation of this result is available in Cho and Cho (2006), but I’ll add some intuition: we have made the assumption that the sampling distribution is approximately Gaussian in order to apply the so-called 68–95–99.7 rule, meaning we expect 95% of random draws to fall within 2 standard deviations of the mean. Note, we can increase our confidence to 99.7% by replacing the coefficient of the square root in (37) with a 3.

An example

Let’s exemplify our test with everyone’s favourite prokaryote: E. coli. The data used for this example can be found in Whittam et al. (1982). Specifically, we will examine four dimorphic genes at loci A, B, C and D in a particular strain from a sample of size 1,705. We want to test whether the loci are in four-way linkage disequilibrium. The (3-dimensional) contingency table is given by

(40)

Feel free to mash these numbers through the meat-grinder, but I’ve done the work for you:

(41)

Remember how D was symmetric? Well now it means these are the only values we need to calculate for our test. We see that

(42)

meaning we reject H₀ and accept H₁ with 95% confidence. Hence, we believe that our loci assort dependently. Additionally, X(G) = 0.82 is quite close to 1, giving further evidence for four-way linkage disequilibrium. Indeed, this agrees with Whittam et al.’s findings, but they relied on 6 pairwise chi-squred tests, which brings us to…

Why not chi?

There already exist tried and tested methods for examining independence in data: chi-squared, Fisher’s exact etc. So why not use one of these instead of our (more complicated) method? Well firstly, out test works for an arbitrary number of loci, meaning only a single test need be carried out per sample. Conversely, the chi-squared test must be calculated pair-wise over all loci.

Additionally, we are employing the 2-nd and 4-th central moments in our confidence limit L. This higher-order behaviour may recognise more complex associations in LD than the classical approaches.

Finally, statistics such as X(G) are more computationally efficient than classical counterparts, serving as a ‘litmus test’ before a more intensive statistical test.

Notes

As is well known, incorporating LaTeX into Medium articles is a pain, and this was my first attempt. I’m still unsure how I feel about using captions to number equations, but it works. I have to thank Joe (I’m guessing this is your name from your GitHub) for their breezy LaTeX to PNG rendering website.

--

--