Machine Learning in Bioinformatics

Published in

Axioma AI Journal

5 min readJun 16, 2022

Machine learning is now simply ubiquitous. Even in Bioinformatics, there are several problems that seem to be specially invented for solving with the help of machine learning, and this article is devoted to the practical application of machine learning techniques on one example. I want to tell you about a small project to predict the binding activity of ligands using ML.

All of the code could be found in Jupyter notebook. So don’t be confused to copy-paste it from the article :)

Problem definition

Any applied computer science is primarily an Application, not a Technique, so first you need to clearly define the problems. So what is this binding activity?

Each protein or macromolecule consists of simple molecules that oscillate, taking into account the Van der Waals forces. Due to this, all the time some particles approach each other or, conversely, repel each other. But if the particles exist relatively stably in a state of approximation, then this indicates a large force of attraction of one particle to another. As such, this binding force indicates the probability of binding of two molecules, which is an important characteristic of a substance.
In medicine, this application is used because small ligands, binding to large macromolecules, can change their functional activity and are thus used by people to prevent or treat diseases.

In molecular physics, the Van der Waals force is a distance-dependent interaction between atoms or molecules.

In my case, I focused on the well-known drug target of Alzheimer’s disease — beta-amyloid protein. Researchers suggest that this protein accumulates in the brain and causes cell degeneration. My idea is to find some particles that could bind this macromolecule and presumably prevent cell degeneration.

What is the plan then?

As for many chemicals, the specifics of their interaction with some substances are known for them. This all provides good data for training the machine learning model. An important database of information about medicinal chemicals is the Database of the European Molecular Biology Laboratory called Chembl (Chemical EMBL), which contains data necessary for the analysis of Beta Amyloid.

The idea of the project is to use data from the Chembl to obtain a model predicting the binding activity of substances.

Problem Solving

1. Collecting data

If you want to collect data from any source, you have to contact it somehow. For relational databases there is a special SQL language, for web applications there are ordinary queries via the search bar, and Chembl has a special API for data parsing or automatic process control.

First we need to create a chemble client identification and then make a request to the database for information about Beta amyloid.

2. Cleaning

Now we have information about particles, corresponding to Beta Amyloid. Next step is data preparation/manipulation/ validation. Our target is clean data suitable for next preparations.

3. EDA&Descriptor calculation

Logicaly next step should be basic EDA and statistical analysis of data to avoid abnormalities, outliers and other things. But in this biological project used techniques are quite difficult for desribing them in pair of abstracts. Calculating Lipinski descriptor, analysing H- and OH- bonds, using Conda and Rdkit in code, canonical smiles invertation — all these things deserve a separate article, so let me know if you are interested.

In statistics, exploratory data analysis is an approach to analyzing data sets to summarize their main characteristics, as well as preparing these data in order to improve quality, often using statistical graphics and other data visualization methods.

4. Teaching the Machine

Finally we have suitable data for our analysis and we can go forward to choose Machine Learning model. For this purpose there is pretty tool called Lazypredict. It can automaticly check performance of different ML techniques and give you the best ones. This tool deserves another description, that’s why wait for my another article about it.
Otherwise there are plenty of hacks and articles, how to choose passed model. So you can choose, what you want and moe forward to training model.

Finally, we have the right data for our analysis, and we can move on to choosing a Machine Learning model. For this purpose, there is a wonderful tool called Lazypredict. It can automatically check the effectiveness of various ML methods and give you the best ones. This tool deserves a separate description, so wait for my other article about it.
Otherwise, there are many tips and articles on how to choose a model. So you can choose what you want and move forward to training model.

However, I chose a Random Forest Regressor and conducted a typical training.

5. Evaluation

After the evaluation, I mean that the average absolute error is 0.59 degrees, and the accuracy of the model is 89.97%, which is a very good result.

Results

Finally, we can say that the task of predicting the chemicals binding activity has been completed quite successfully. The model turned out to be quite accurate, as the evaluation showed, and we can say that machine learning takes place.

However, it is worth noting that in reality the results can vary significantly and therefore it is worth always remembering that the first three letters play the greatest role in bioinformatics. Follow me and expand your knowledge with my Medium.

Be human, do science 🕊

🔔 Loved this Article & Want more?
📩 Feel free to follow and subscribe to my newsletter.

🔍 New in Medium?
📌 Join the largest community!

🔍 Interested in Science and Bioinformatics particularly?
📌 View my other Articles.

❓ Have questions?
✅ Feel free to contact me on:
🔘 Linkedin
🔘 Twitter