Redefining Cancer Treatment with Machine Learning

Akshay Patel
Analytics Vidhya
Published in
6 min readDec 14, 2020

--

Photo by - National Cancer Institute

Introduction

Over the past decades, there have been continuous evolution related to cancer treatment.Scientists applied different techniques to find the types of cancer before they cause symptoms.Recent years have seen many breakthroughs in the field of medicine and also there have been large amount of data available to medical researchers, as more data is available medical researchers have used machine learning to identify hidden patterns from complex data to try to predict effective future outcomes of the cancer type.

Given the significance of personalized medicine and the growing trends on the application of ML techniques we will try to solve one such problem where the challenge is to distinguish the mutations that contribute to tumor growth (drivers) from the neutral mutations (passengers).

Currently this interpretation of genetic mutations is being done manually. This is a very time-consuming task where a clinical pathologist has to manually review and classify every single genetic mutation based on evidence from text-based clinical literature.We need to develop a Machine Learning algorithm that, using this knowledge base as a baseline, automatically classifies genetic variations.

Data Overview

Given Gene, Variations and Text as features we need to predict the class of the Class variable(target variable). It’s a multi-class classification problem and we will measure the performance of our model with a multi-class log-loss metric.

Roadmap Ahead

We will read the data, perform text-preprocessing , split the data into train, test and cross- validation , train random models, train different ML models , compute log-loss and also the percentage of misclassified points and then compare and find out the best model.

Chaliye Shuru Karte(Let’s start coding!!)

import pandas as pd
import matplotlib.pyplot as plt
import re
import warnings
import numpy as np
from nltk.corpus import stopwords
from sklearn.preprocessing import normalize
from sklearn.feature_extraction.text import CountVectorizer
import seaborn as sns
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix,accuracy_score,log_loss
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import SGDClassifier,LogisticRegression
from scipy.sparse import hstack
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import SVC
from collections import Counter, defaultdict
from sklearn.calibration import CalibratedClassifierCV
from sklearn.naive_bayes import MultinomialNB, GaussianNB
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.metrics import normalized_mutual_info_score
from sklearn.ensemble import RandomForestClassifier
warnings.filterwarnings("ignore")
from mlxtend.classifier import StackingClassifier

So we start by importing various libraries pandas , numpy for data manipulation, matplotlib and seaborn for plotting and sklearn package is used for model building.

Reading the data

Our data is present in two different files with different separators so will read each file separately and then combine both the files using “ID” column.

Text-Preprocessing and Feature Engineering

After reading the data we will do text preprocessing which involves cleaning of text like stopword removal, removing special characters if any , normalizing text and converting all the words to lowercase. During this process we found that there are some rows which doesn’t have text and therefore we will replace the NaN values with Gene + Variation values.

Splitting the Data into Train, Test and Cross-Validation

We will now split our data into train, test and cross-validate data to check if the distribution of our target values are same in all the three data or not.

Why distribution needs to be same? Distribution of our target value should be same so that during training, our model should encounter all the class values as present in our dataset.

Distribution of target variable

Training

We will first train a random model so that we can compare our other models and their performance and efficiency.

How to perform log-loss for a random model in a multi-class setting?We will randomly generate numbers equal to our number of classes(10 in our problem) for every point in our Test and Cross Validate data and then normalize them to sum it to one.

test_data_len = test_df.shape[0]
cv_data_len = cv_df.shape[0]
# we create a output array that has exactly same size as the CV datacv_predicted_y = np.zeros((cv_data_len,9))
#for every value in our CV data we create a array of all zeros with #size 9
for i in range(cv_data_len):#iterating to each value in cv data(row)
rand_probs = np.random.rand(1,9) #generating randoms form 1 to 9
cv_predicted_y[i] = ((rand_probs/sum(sum(rand_probs)))[0]) #normalizing to sum to 1
print("Log loss on Cross Validation Data using Random Model",log_loss(y_cv,cv_predicted_y, eps=1e-15))# Test-Set error.
#we create a output array that has exactly same as the test data
test_predicted_y = np.zeros((test_data_len,9))
for i in range(test_data_len):
rand_probs = np.random.rand(1,9)
test_predicted_y[i] = ((rand_probs/sum(sum(rand_probs)))[0])
print("Log loss on Test Data using Random Model",log_loss(y_test,test_predicted_y, eps=1e-15))
predicted_y =np.argmax(test_predicted_y, axis=1)
plot_confusion_matrix(y_test, predicted_y+1)

In the above we first created an empty array with size 9 for each class label and then randomly generated probabilities for each class label and plotted the confusion matrix and computed log-loss.

Confusion Matrix with log-loss of 2.4

We can see that our random-model has a log-loss of 2.4 across cross-validate and test-data so we need our models to perform better than this, let’s check the precision and recall for this model.

Precision and Recall for random model

How to interpret the above precision recall matrix?

Precision
1. Taking an example of cell(1x1) it has value of 0.127 ; it says of all the points that are predicted to be class 1 only 12.7% values are actually class 1

2. For original class 4 and predicted class 2 we can say that of the values that our model predicted to class 2, 23.6% values actually belong to class 4

Recall

1. Check cell (1X1) it has a value of 0.079 which means for all the points which actually belongs to class 1 our model predicted only 7% values to be class 1

2. For original class 8 and predicted class 5 values is 0.250 means of all the values which are actually class 8 are model predicted 25% values to be class 5

We will now be training our models after some exploratory data analysis and also feature encoding which you can check on my notebook. We trained multiple models and Logistic Regression and Support Vector Machine stands out from the rest.

Logistic Regression

Performance of Logistic Regression on Cross-Validation
Confusion Matrix of Logistic Regression Model

Support Vector Machine

Performance of SVM on Cross-Validation
Confusion Matrix of SVM Model

Comparison of all the models

We can see that Logistic Regression and Support Vector Machine performs better than others in terms of both log-loss and percentage of mis-classified points.

--

--

Akshay Patel
Analytics Vidhya

A life long learner and gamer, Akshay spends most of his time learning new skills and enhancing existing skills. A self-taught data scientist.