Orthogonal Clustering for Genotype Scoring (Part I)

Krishna Yerramsetty
Krishna Yerramsetty
3 min readMay 23, 2016

One of the tools that accelerated plant/animal breeding in the last decade has been Marker Assisted Selection (MAS). The idea of MAS is to quantify the relationship between some biological markers (genes, proteins, metabolites) and the phenotype (trait) of interest such as plant height, and use these relationships to predict and indirectly select for the phenotype. Since genes are considered the unit of inheritance, they make good biomarkers for MAS. Genes generally exist in several forms or alleles. For example, the gene for human eye color exists in four different allelic states; blue, brown, green, and grey. Watch this good explanation from Khan Academy to know more: https://www.khanacademy.org/science/biology/classical-genetics/molecular-basis-of-genetics-tutorial/v/alleles-and-genes

Each chromosome in an organism has one allelic version of each gene which may or may not be the same across all chromosomes. The particular combination of alleles for each gene in an organism is called its genotype. This genotype information can then be used to predict the phenotype using statistical methods such as least squares or Bayesian methods such as BayesC and BayesD. Therefore, the first step in MAS has to be the identification of genotypes for a given set of plants/animals. One way to identify the genpotypes is by measuring the single nucleotide polymorphisms (SNPs) across the genome for an individual plant/animal. Read this review article to learn more about SNPs and molecular markers: http://www.ncbi.nlm.nih.gov/pubmed/12081799

In short, a SNP is a variation at a single position in a DNA region. DNA is made up of four bases A, C, G, T, and if more than 1% of the population of individuals carries a different base at a particular position than the rest of the population, then the population is said to have a SNP at that DNA position. A SNP could be inside a genic region or outside any genic regions, and if its inside a gene, then the SNP could potentially contribute to a change in the protein encoded by that gene. The allelic state of that gene can then be described by the SNP(s) present in the genic region. Therefore, the genotype of an individual could be simplified by all or a subset of the SNPs across the genome.

A general strategy to measure SNPs is by hybridizing different DNA probes to the SNP region, which subsequently releases a dye based on the version of the probe that successfully hybridizes at the DNA region. The dye intensities produced by one SNP region for a sample of diploid individuals could be plotted on a two-dimensional plot as shown:

Rplot01

The colors in the plot indicate the allelic states at the specific SNP region we are interested in, for each sample in the population. Throughout this post, I will use red to indicate allelic state 1 (or interchangeably “X”), blue to indicate allelic state 2 (or interchangeably “Y”), and black to indicate allelic state 1/2 (or interchangeably “HET). Allelic states X and Y are homozygous, in the sense that both chromosomes (remember these are diploid samples) have the same SNP state at that particular genomic region. On the other hand, allelic state “HET” indicates that the samples have different SNP states on the two chromosomes. In a perfect world, the X homozygous samples would not express any Y fluorescence dye and therefore would lie on the X-axis, and similarly the Y homozygous samples would not express any X fluorescence dye and therefore would lie on the Y-axis. The HET samples would then lie on the 45 line. However, since the probes are not perfect, and for various other reasons, all samples express both dyes, and what we typically observe is a plot like the one above. In addition, depending on the samples, instruments, and other lab effects, the three clusters might move around while still retaining their relative positions with each other.

But the problem with the plot above is that we know the fluorescence values coming from the lab, but typically do not know the allelic states of the samples that generated those fluorescence values. So, one needs a person or preferably an algorithm that can detect clusters in the plot and assign them colors to indicate the SNP or allelic states for the samples in that cluster. Part II of this post describes a finite mixture modeling approach along with orthogonal regression techniques to identify these clusters.

Originally published at Krishna Yerramsetty.

--

--

Krishna Yerramsetty
Krishna Yerramsetty

Data Scientist with over 7 years of experience. Too many things to learn and experience, too little time :)