Cancer detection isn’t so straightforward as it looks, right? Currently, the interpretation of genetic mutations is being done manually. This is a very time-consuming task where a clinical pathologist has to manually review and classify every single genetic mutation based on evidence from text-based clinical literature. Machine learning is on the rise, this project aims to develop a Machine Learning algorithm that, using this knowledge base as a baseline, automatically classifies genetic variations. I will briefly explain the process in this post. The code can be found on my GitHub profile. Since this is a medical application, accuracy is much more important than achieving latency. We are fine with a prediction taking some time as long as it is accurate.
So let’s begin!
Explaining the dataset
The dataset was acquired from Kaggle. It has 2 files, namely training_variants and training_text. There are a total of 3321 training records.
training_variants consists of ID, the Gene code, variation in that gene code and final class label from 1 to 9(multiclass classification)
Everything is well described in the notebook.
training_text contains the corresponding medical literature for the ID in the training_variants.
Performance Metric(s) used:
- Multi-class log-loss
- Confusion matrix
Firstly, there is some text data preprocessing that has been done, the standard ways of cleaning text data — lowercasing everything, removing all special characters, replacing multiple spaces/tabs with a single space., removal of stop words.
Next off I merged both the files into a single Pandas Dataframe. There were 5 missing values in the TEXT column, which I imputed by concatenating Gene and Variation. I could have dropped them but already I did not haave a lot of training data here.
The whole data is split into 80% training and 20% testing dataset. Further, the 80% of training data has been split into train and cross-validation set for hyperparameter optimization.
The accuracy metric used is Multi class Log-Loss(since we have probability scores at the end).
Since this is a healthcare application, model interpretability becomes extremely important in this case. Therefore, I have used models which are easy to interpret and gave decent accuracy for this use-case. I used Naive Bayes, Logistic Regression, KNN, Linear SVM and Random Forest classifiers. I also tried the Stacking model and the Maximum Voting classifier at the end.
This project describes how immensely Machine Learning can help a doctor make important decisions and help in the detection process, by automating manual stuff they have to go through. All the details and everything has been thoroughly explained in the notebook.
Thanks for reading!