Bioinformatics + machine learning (I)

Introduction

Flu Genome Data Visualizer by Jer Thorpe (Flicklr)

Machine learning has been being a trend since the very first time people saw its potential. And it has a huge one. At present and on the future. Since deep learning is now arising. But first things first.

What is machine learning?

You may find several definition along internet and, of course, you’ll find a suitable one on Wikipedia. However, I’m defining it (or trying to) through my own knowledge and with a simple example.

Machine learning itself is already a field of computer science and, more precisely, artificial intelligence. However, when we said, we’re applying machine learning, it referes to the bunch of techniques (or algorithms being more technical) you can use on the proper field. Closing to the bone: the example. I always say this when I try to explain this term:

Baby fat cheeks by Aikawa Ke
Machine learning is like our brain. It learns. Both have two phases. The phase in which the babe doesn’t have enough samples to say (or do) anything and, thus, he/it is only learning and fitting its model. In this stage called training, the baby learns from all his environment, mimicking his parents in all they do and say. The model can’t test anything because it is not fitted enough. The reason is that he has a lack of samples. That’s why the baby can’t say anything, during his first year of life.
Sandhill Crane Baby by Matthew Paulson

Afterwords, comes the period of testing, which will start after one year and will never end. Though the model is not well enough fitted, the baby will begin to try to do almost everything. Eventually, he will attempt to walk. Through this period, his parents would correct him and teach him how to do it properly, giving weights to the model, in order to boost the positives (encouraging its true-positive) attempts and discourage the negatives (damping the false-positive) tries. We also have to test our model and its classification in order to measure its efficiency and precision, when a example that it doesn’t know comes for the first time. We al want to test its prediction efficiency and measure it, usually, with the ROC curve or AUC (area under the curve).

During the first years of life, the model and the baby fail more times than they succeed. That’s why it still needs more samples.

Learning takes time. The brain is usually compared to an algorithm used on this field: neural networks. Without entrying more than the necessary into its definition, we can say that, depending on the layers and neurones defined at the algorithm, it could have more (time and space) complexity or less. Our brain is so complex and, thus, the process of learning takes time, however, it can learn a huge variety of tasks.

So, as the proper term could said machine learning is learning through a machine. Normally, a model is fitted to a single task.

In the field of bioinformatics, machine learning is used so frequently. From cancer biology, in which you could classify (a very naive example) cancer sequences from no-cancer sequences (a more complex example). Passing through clustering, and its several approachs on phylogeny(e.g. 1 and 2). To learn if a sequence would be translated into a transmembrane ß-barrel protein.

Cheers,
Pablo

P.S: If you find any mistake or something wrong, please say it. I’ll be glad to put it right it.