TBP #9: Finding Important Features in Gene Expression Data

Published in

The Bioinformatics Press

2 min readOct 23, 2019

The curse of high dimensionality is ever-present in the big data community. It is essentially the idea that we have too many features and not enough samples to truly say which feature is important or not to our task at hand.

This is why a whole breadth of knowledge has evolved around the need to develop tools that reduce the number of dimensions that is more manageable by our algorithms and can be visualized by humans. There are common techniques like independent and principle component analysis or t-SNE that allow researchers to extract the important features from a high dimensional dataset.

These researchers recently reported an interesting method that improves upon K-means clustering to ensure that the features extracted can discriminate between classes.

The following figure illustrates this:

As you can see, there are 3 distinct clusters that they are trying to segment. The first two methods do an okay job at finding features that segment them. However, in their improved method called “Cluster-Specific” Sparse K-means, they are able to computationally extract features that cluster the 3 groups well along the Feature 1 and 2 axes.

They apply this method to a leukemia gene expression dataset and compare their algorithm against other clustering algorithms. Out of 5,135 genes, their method extracted 99 selected genes that provided the best results for a 3-way clustering tasks. 7 out of these 99 genes were a part of the 11 signature genes that were the ground truth, resulting in a p-value of 2.5e-10.

Methods such as these can allow researchers to properly target which genes are important nodes in a network that affect a certain disease phenotype for future study.

Thanks for reading.

TBP #9: Finding Important Features in Gene Expression Data

Written by stay trying.