Identification of Significant Genes in Microarrays using Lassoed PCA

Kayli Leung
Analytics Vidhya
Published in
4 min readJun 10, 2019

The Use of Lassoed Principal Component Analysis

In biology, it is understood that a single gene is usually not the sole reason for a specific phenotype. Rather it is a collection of genes working together that create that phenotype. It is potentially more valuable to discover this collection of genes than to find one gene that is correlated with an outcome. With this collection of genes identified, more research can be completed to uncover the method these genes interact to create that specific outcome.

Should a single gene be determined to be significant, it is much hard to determine which other genes interact with it. In addition, from a statistical standpoint, it is unlikely that a number of genes with both be correlated with an outcome AND to each other by random chance, but it is more likely that a single gene could be randomly correlated with an outcome.

This is where the works of Daniela Witten and Robert Tibshirani’s paper “Testing Significance of Features by Lassoed Principal Components” comes in. Witten and Tibshirani hoped to discover a new tool for discovering clusters of genes that are significant to a particular outcome such as diagnosis or survival time. They used microarrays to measure gene expression and used their technique, Lassoed Principal Components (LPC) to discover significant sets of genes. In this blog post, I hope to describe how this technique works to discover these sets.

Before we talk about LPC, let’s first discuss microarrays. A microarray uses fluorescent markers to measure the expression of several genes at once. Should a gene behave similarly in both normal and suspect cells, the gene will show yellow. If the gene is expressed in normal but not suspect cells it will be gene and if in suspect but not normal than it is red. It is therefore of interest to find genes that are expressed differently for different conditions.

Microarrays show the relative expression of genes in control and diseased cells.

During analysis of microarray data, n microarrays are generated for p genes. In general, there are A LOT more genes that are analyzed than microarrays that are completed. Each gene is measured by its association with the outcome and if that value exceeds a specified amount (based on statistical method) then it is differently-expressed and therefore important. There are several methods which are able to use information across genes such as the Limma procedure, Optimal Discovery Procedure (ODP), and Cox score. LPC builds upon these methods.

The steps to LPC are three-fold:

1. Compute the scores for the genes using an existing method

Creating an n x p matrix, X, of the log-transformed gene expression levels. Each row represents a single microarray and each column represents a gene.

From this matrix, create a vector, T, of length p that represents the gene scores for each gene. Some simple scoring methods include the two-sample t-statistic, F-statistic for one-way ANOVA, and Cox proportional hazards.

2. Regress the scores onto the eigenarrays of the data

The eigenarrays of X are taken from the rows rather than the columns. The eigenarrays compose the orthogonal columns of a matrix, V, which as a whole is orthogonal to T.

3. Apply the L1 constraint to the eigenarrays

Apply to T a Lasso Constraint using the V matrix. The Lasso constraint works upon the eigenarrays to penalize genes that do NOT show a similar expression pattern to other genes. It penalizes these genes to the point that the gene score becomes zero. This means that genes that are identified as significant for the outcome should be the only values that are nonzero.

LPC builds upon other models, so the next thing to explore is how it compares to these models. Three simulations were completed using 1000 genes and 40 observations for a two-class system (yes or no for cancer). 50 genes in each simulation were significant. ODP, Limma, and the LPC built from these two scores were calculated. The LPC derived gene scores showed a decrease in false discovery rates over ODP and Limma alone.

Using significant genes discovered by LPC decreases the false discovery rate in cancer diagnosis

For survival predictions, LPC is placed over Cox scores. In this situation, LPC does perform better than Cox, but only up to a certain number of genes. As more and more genes are used to make the prediction, Cox outperforms LPC. This likely occurs because since LPC selects high scores for correlated predictors; each additional gene is less correlated and does not lead to much improvement. Cox instead adds genes that may not be correlated with each other and so the vast number of genes can lead to greater improvement in the model.

LPC (red) has a better log rank test statistic score than Cox (black) for predicting survival up to a certain number of included genes.

It is important to remember what the goal of LPC is. LPC is not necessarily well suited for making predictions as LPC is more useful in the identification of significant genes. It is also important to remember our biological assumption that multiple genes are important to an outcome. If this assumption is incorrect for any particular test, LPC will not be helpful. Should a single gene, or small cluster of genes be the causes an outcome, LPC would score these genes as zero because they would be more correlated with unimportant genes than they are to each other.

Thank you for reading my blog, for a more technical look at LPC, read the paper by Witten and Tibshirani.

--

--