Redefining Cancer Treatment with Machine Learning
Introduction
Over the past decades, there have been continuous evolution related to cancer treatment.Scientists applied different techniques to find the types of cancer before they cause symptoms.Recent years have seen many breakthroughs in the field of medicine and also there have been large amount of data available to medical researchers, as more data is available medical researchers have used machine learning to identify hidden patterns from complex data to try to predict effective future outcomes of the cancer type.
Given the significance of personalized medicine and the growing trends on the application of ML techniques we will try to solve one such problem where the challenge is to distinguish the mutations that contribute to tumor growth (drivers) from the neutral mutations (passengers).
Currently this interpretation of genetic mutations is being done manually. This is a very time-consuming task where a clinical pathologist has to manually review and classify every single genetic mutation based on evidence from text-based clinical literature.We need to develop a Machine Learning algorithm that, using this knowledge base as a baseline, automatically classifies genetic variations.
Data Overview
Given Gene, Variations and Text as features we need to predict the class of the Class variable(target variable). It’s a multi-class classification problem and we will measure the performance of our model with a multi-class log-loss metric.
Roadmap Ahead
We will read the data, perform text-preprocessing , split the data into train, test and cross- validation , train random models, train different ML models , compute log-loss and also the percentage of misclassified points and then compare and find out the best model.
Chaliye Shuru Karte(Let’s start coding!!)
import pandas as pd
import matplotlib.pyplot as plt
import re
import warnings
import numpy as np
from nltk.corpus import stopwords
from sklearn.preprocessing import normalize
from sklearn.feature_extraction.text import CountVectorizer
import seaborn as sns
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix,accuracy_score,log_loss
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import SGDClassifier,LogisticRegression
from scipy.sparse import hstack
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import SVC
from collections import Counter, defaultdict
from sklearn.calibration import CalibratedClassifierCV
from sklearn.naive_bayes import MultinomialNB, GaussianNB
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.metrics import normalized_mutual_info_score
from sklearn.ensemble import RandomForestClassifier
warnings.filterwarnings("ignore")
from mlxtend.classifier import StackingClassifier
So we start by importing various libraries pandas , numpy for data manipulation, matplotlib and seaborn for plotting and sklearn package is used for model building.
Reading the data
Our data is present in two different files with different separators so will read each file separately and then combine both the files using “ID” column.
Text-Preprocessing and Feature Engineering
After reading the data we will do text preprocessing which involves cleaning of text like stopword removal, removing special characters if any , normalizing text and converting all the words to lowercase. During this process we found that there are some rows which doesn’t have text and therefore we will replace the NaN values with Gene + Variation values.
Splitting the Data into Train, Test and Cross-Validation
We will now split our data into train, test and cross-validate data to check if the distribution of our target values are same in all the three data or not.
Why distribution needs to be same? Distribution of our target value should be same so that during training, our model should encounter all the class values as present in our dataset.
Training
We will first train a random model so that we can compare our other models and their performance and efficiency.
How to perform log-loss for a random model in a multi-class setting?We will randomly generate numbers equal to our number of classes(10 in our problem) for every point in our Test and Cross Validate data and then normalize them to sum it to one.
test_data_len = test_df.shape[0]
cv_data_len = cv_df.shape[0]# we create a output array that has exactly same size as the CV datacv_predicted_y = np.zeros((cv_data_len,9))
#for every value in our CV data we create a array of all zeros with #size 9for i in range(cv_data_len):#iterating to each value in cv data(row)
rand_probs = np.random.rand(1,9) #generating randoms form 1 to 9
cv_predicted_y[i] = ((rand_probs/sum(sum(rand_probs)))[0]) #normalizing to sum to 1print("Log loss on Cross Validation Data using Random Model",log_loss(y_cv,cv_predicted_y, eps=1e-15))# Test-Set error.
#we create a output array that has exactly same as the test datatest_predicted_y = np.zeros((test_data_len,9))
for i in range(test_data_len):
rand_probs = np.random.rand(1,9)
test_predicted_y[i] = ((rand_probs/sum(sum(rand_probs)))[0])
print("Log loss on Test Data using Random Model",log_loss(y_test,test_predicted_y, eps=1e-15))predicted_y =np.argmax(test_predicted_y, axis=1)
plot_confusion_matrix(y_test, predicted_y+1)
In the above we first created an empty array with size 9 for each class label and then randomly generated probabilities for each class label and plotted the confusion matrix and computed log-loss.
We can see that our random-model has a log-loss of 2.4 across cross-validate and test-data so we need our models to perform better than this, let’s check the precision and recall for this model.
How to interpret the above precision recall matrix?
Precision
1. Taking an example of cell(1x1) it has value of 0.127 ; it says of all the points that are predicted to be class 1 only 12.7% values are actually class 1
2. For original class 4 and predicted class 2 we can say that of the values that our model predicted to class 2, 23.6% values actually belong to class 4
Recall
1. Check cell (1X1) it has a value of 0.079 which means for all the points which actually belongs to class 1 our model predicted only 7% values to be class 1
2. For original class 8 and predicted class 5 values is 0.250 means of all the values which are actually class 8 are model predicted 25% values to be class 5
We will now be training our models after some exploratory data analysis and also feature encoding which you can check on my notebook. We trained multiple models and Logistic Regression and Support Vector Machine stands out from the rest.
Logistic Regression
Support Vector Machine
Comparison of all the models
We can see that Logistic Regression and Support Vector Machine performs better than others in terms of both log-loss and percentage of mis-classified points.
You can check my GitHub for complete code and data.
Feel free to connect with me on any of the platforms.
Check out my other articles also