Orthogonal Clustering for Genotype Scoring (Part II)

Published in

Krishna Yerramsetty

5 min readMay 23, 2016

This is part II of the post titled Orthogonal Clustering for Genotype Scoring. Part I is here.

In this post, I will talk about orthogonal clustering and how it can be used to automatically score fluorescence values into genotypes.

One thing that is evident from looking at the plot of fluorescence values above is that k-means clustering would fail, because k-means expects spherical clusters. An example data set can be seen below where the plot on the left shows the correct clustering, and the plot on the right shows incorrect clustering using k-means clustering.

The above result is a convincing argument against using k-means clustering where the idea is to find the cluster centers that minimize the sum of squared distances between the data points in a cluster and that cluster’s center . Instead, we can make use of the fact that the clusters align around straight lines that pass through origin. Therefore, a potential distance measure to perform clustering could be the orthogonal distance between the data points and these cluster “lines”. We can then formulate the problem as one of finding these straight lines that minimize the orthogonal distance between the data points in a cluster and that cluster’s line. In essence, we have replaced finding cluster centers in the case of k-means with finding cluster “lines” in our case. Also, note that these cluster lines are nothing but the first principal components of each cluster.

You could ask, why not use the simpler linear regression, instead of the more complex orthogonal regression? The answer is, for simple linear regression (which gives you the ordinary least squares solution), we have to choose which of the two dyes is the dependent variable and which is the independent variable. And, the implicit assumption of simple linear regression is that the independent variable has no uncertainty or error associated with it, which is not true because both dye measurements have errors associated with them (Equation 1 below). This leads to something called regression dilution, where the calculated slopes of the regression lines are smaller than the true slopes. On the other hand, total least squares solutions, of which orthogonal regression is a special case, take into account the errors in both the dependent and independent variables (Equation 2 below), and will give us an unbiased estimate for the slopes of the cluster lines.

$latex Y + \epsilon_Y=\beta*X$ (1)

$latex Y + \epsilon_Y=\beta*(X + \epsilon_X)$ (2)

Equation 2 is generally solved using singular value decomposition (SVD). SVD is nothing but Eigen Value Decomposition extended to any matrix. For a good explanation of Eigen vectors and values refer to this video by KhanAcademy. For a more detailed discussion on SVD check this article.

Another key concept that is involved with solving this problem is mixture model. Since we have multiple clusters, using a mixture model framework assigns the probabilities of a data point belonging to each cluster:

$latex z_{i,g} = \frac{\Large\pi_g f_{g}}{\sum_{n=1}^{n=G} \pi_n f_{n}}&s=3$ (3)

where, $latex z_{i,g}$ represents the probability of sample i belonging to cluster g, $latex \pi_g$ represents the proportion of samples belonging to cluster g, and $latex L_{i,g}$ represents the likelihood of sample i belonging to cluster g. The likelihood is calculated based on a bi-variate (X and Y components) normal distribution. The means for this bi-variate distribution are the projections of each data point in that cluster onto the orthogonal cluster line, and the variances are the variance in the orthogonal distances for that cluster:

Now that we know the rationale for using orthogonal regression, let me talk about the algorithm for performing this: Orthogonal regression has been used for clustering SNP data by at least two other groups: Guohua Yan, and Bargary et al. The current work is built on these two works with several modifications. In this blog, however, I will only talk about the algorithm based on the work done by Bargari et al. The basic steps of this algorithm are based on the popular Expectation-Maximization (E-M) algorithm and mixture models, and are given below:

Choose number of samples =n.
Choose number of clusters G.
Algorithm outputs probabilities of a data point belonging to a cluster, $latex z_{i,g}$
Assign each data point to one and only one cluster.
M-Step: For each cluster g,

Use SVD to find the direction βg of most variance in that cluster. This will be the orthogonal line for cluster g.
Calculate the variance σg of the orthogonal distances between each data point and the orthogonal line g:
$latex \sigma_g² = \frac{ \sum_{i=1}^n z_{i,g}(x_i — x_{i,g})² + \sum_{i=1}^n z_{i,g}(y_i — y_{i,g})²}{2 \sum_{i=1}^n z_{i,g}}&s=2 $
Also, calculate the proportion of data points in each cluster:
$latex \pi_g = \frac{\sum_{i=1}^g z_{i,g}}{n}&s=2$
Calculate the projection of each data point (x_i,y_i) onto orthogonal line g. Call this the fitted value of that data point (x_{i,g},y_{i,g}).

E-Step: For each cluster g,

Calculate the bi-variate normal distribution for each data point
Calculate the variance σg of the orthogonal distances between each data point and the orthogonal line g:
$latex f_g(x_i,y_i\mid\mathbf{\mu_g , \epsilon_g}) = f_g(x_i\mid x_{i,g},\sigma_g²) + f_g(y_i\mid y_{i,g},\sigma_g²)&s=2$
Update the probabilities of a data point belonging to each cluster:
$latex z_{i,g} = \frac{\pi_g f_g(x_i,y_i\mid\mathbf{\mu_g , \epsilon_g})} {\sum_{h=1}^G\pi_h f_h(x_i,y_i\mid\mathbf{\mu_h , \epsilon_h})}&s=2$

7. Repeat steps 5 and 6 until a convergence criterion is met.

An example of the algorithm working for sample data is shown below:

The full code for the algorithm could be found here.

Additional modifications were made to the basic code described above to detect the best number of clusters and to detect outliers.

Originally published at Krishna Yerramsetty.

Orthogonal Clustering for Genotype Scoring (Part II)

Written by Krishna Yerramsetty