Dimensionality Reduction Methods For Microarray Cancer Data Using Prior Knowledge

Tell us about your research

CySE
CySE Articles
5 min readJan 18, 2017

--

By: Dr. Zena Maria Hira*

During my PhD I developed algorithms for large genetic datasets, which could distinguish between different types of cancer and predict whether a cancer patient would respond to treatment. The aim of my research was to introduce new methods for dimensionality reduction, that is algorithms to transform big data sets to be small enough to apply machine learning techniques to, while maintaining the important aspects of the data. To do this I leveraged prior knowledge — other databases and results that could prime the algorithms and guide the data set reduction process.

The biological datasets I was working with, also known as microarray datasets, contain genetic information about how the genes are expressed (i.e. in what degree is this particular gene present) for the different cancer patients. Many microarray experiments have been designed to investigate the genetic mechanisms of cancer, and in the last ten years, machine learning techniques have been investigated in microarray data analysis. Several approaches have been tried in order to: distinguish between cancerous and non-cancerous samples; classify different types of cancer and to identify subtypes of cancer that may progress aggressively. All these investigations are seeking to generate biologically meaningful interpretations of complex datasets that are sufficiently interesting to drive follow-up experimentation.

Analysing biological datasets is not a straightforward process. Most of these datasets are so big that conventional algorithms are not able to give any meaningful results, also known as: the curse of dimensionality. This is due to the fact that datasets require too much computational power to be analysed which might not be available; there are complicated interactions among the genes which are not trivial to capture; as the dimensionality of a dataset grows, proving any result significant becomes more and more difficult and biological datasets contain a lot of “noise”, i.e. errors in the measurement of gene expressions and damaged equipment can produce the wrong results.

To overcome the above difficulties machine learning is often applied. Machine learning is concerned with programming systems that can learn from the data provided and improve their performance with experience. Learning involves generalising a behaviour so that when a new situation comes it can be identified and dealt with. To do that the system is designed so that it can detect patterns or rules in the data by making a model and adjusting its performance accordingly. Machine learning can be particularly useful when dealing with large amounts of data and complex problems that are not easy for the human brain to solve.

There are many types of machine learning algorithms. The types I was mostly working with were dimensionality reduction and classification algorithms. These take a dataset as an input and using mathematics and statistics produce a smaller version of that dataset.

Classification algorithms proceed in two steps:

Learning involves generalising a behaviour so that when a new situation comes it can be identified and dealt with. To do that the system is designed so that it can detect patterns or rules in the data by making a model and adjusting its performance accordingly. Machine learning can be particularly useful when dealing with large amounts of data and complex problems that are not easy for the human brain to solve.

There are many types of machine learning algorithms. The types I was mostly working with were dimensionality reduction and classification algorithms. These take a dataset as an input and using mathematics and statistics produce a smaller version of that dataset.

Classification algorithms proceed in two steps:

1. Training: Giving the system enough data to learn from.

2. Classification: Running the learned model on a new dataset to classify the data.

Machine learning by itself however does not help us solve the problems described above. Large datasets tend to be prone to overfitting. An over-fitted model can mistake small fluctuations for important variance in the data which can lead to classification errors. During the training step the algorithm was not able to generalise well due to limitations of the training dataset and therefore in the classification step it was not able to produce accurate results for the new dataset. A reason for that could be the large dimensionality of the training dataset which makes it hard for the algorithm to identify specific trends.

Generally, dimensionality reduction algorithms are used to overcome this limitation but when it comes to biological datasets the dimensionality is so high that even dimensionality reduction cannot work properly. Moreover, the relationships among genes and biological products are so complicated that they cannot be captured accurately, leading to the wrong result after the reduction. Therefore, a way to improve that is by using prior knowledge. Prior knowledge is all the information that is available to the system in addition to the training data. This information is integrated with the dataset to get a more accurate, reduced version of the dataset.

For my research the prior knowledge was found in biological databases. Those are databases that contain information on how genes interact with each other also known as gene pathways. This information was then added to already existing dimensionality reduction algorithms in order to make computation more efficient and more accurate. This way the behaviour of the dimensionality reduction algorithm was influenced and genes belonging to the same pathway were considered to have a higher probability of being related (therefore biasing the output of the dimensionality reduction algorithm).

Biological pathways can affect the performance of the traditional machine learning algorithms and help answer questions about biology and cancer. Genes expressions from specific gene pathways which have been previously associated with specific forms of cancer can be strong indicators of someone’s response to cancer treatment. In addition, gene expressions can also indicate which form of cancer a person has. More biological experimenting is needed to prove those hypotheses correct but certainly this is a step closer to curing cancer. With this information more targeted clinical trials can take place. Below there is a figure of one of the algorithms distinguishing between different types of leukaemia and normal cells. Each dot represents a sample (patient) while the x and y axes represent their genetic information after dimensionality reduction. This visualisation would have been infeasible without the dimensionality reduction step.

Distinguishing between different types of leukaemia and normal cells. Dots here represent different patients while x and y axes represent their genetic information after dimensionality reduction.

More information on the algorithms and background of this research can be found:

*Current Position: Analytics Engineer at McKinsey — QuantumBlack

Research Group: Machine Learning and Intelligent Data Group — Imperial College London

Email: zena.hira@gmail.com

--

--

CySE
CySE Articles

A group for Cypriot Scientists, Engineers or anyone interested in Science and Engineering. Read our articles and subscribe at https://medium.com/cyse-articles