Personalized Cancer Diagnosis using Machine Learning

Exploratory Data Analysis and diverse machine learning models applied to diagnose cancerous tumors.

TULASI RAM LAGHUMAVARAPU

Published in

Analytics Vidhya

11 min readApr 3, 2018

If you are more interested in code you can directly jump into this repository

What is Cancer basically?

Our body is made up of trillions of cells, which are constantly dying and regenerating. Normally, a cell divides and make a perfect copy of itself using a genetic blueprint called DNA. Once in a while, the DNA blueprint gets damaged sometimes so that cell doesn’t listen to body signals and keeps on dividing forming a tumor.

What is the business problem we need to solve?

Let us discuss briefly the data and the business problem since it is very important to understand the business problem we are solving.

When a patient seems to have cancer, we take a tumor sample from the patient and we go through genetic sequencing of DNA. Once sequenced, a tumor can have thousands of genetic mutations. Here briefly, a ‘mutation’ is small change in gene which causes cancer. One more important thing is that for every gene, there is a variation associated with it. Now with the help of the gene and its variation, we have to classify which class(total we have 9 classes) it belongs to. Only some classes belong to cancer.

Let us go through the workflow more clearly.

Source:https://www.kaggle.com/c/msk-redefining-cancer-treatment/discussion/35336#198462

A molecular pathologist selects a list of genetic variations of interest that he/she want to analyze.
The molecular pathologist searches for evidence in the medical literature that somehow are relevant to the genetic variations of interest.
Finally the molecular pathologist spends a huge amount of time analyzing the evidence related to each of the variations to classify them.

Here going through Steps 1 and 2 can be done easily and with less time. But Step 3 is very time consuming. Our goal is to replace Step 3 with a machine learning model.

So, the problem statement is to classify genetic variations based on evidence from the text-based clinical literature or research papers.

As we said, we have to classify genetic variations — meaning that it is a classification problem. As there are 9 classes it is a multi class classification problem.

Knowing the business constraints of the problem is most important. If we don’t know the business constraints, we train models which cannot be put in production.

BUSINESS CONSTRAINTS OF THIS PROBLEM:

Interpretability of the algorithm is must because a cancer specialist should understand why the model is given particular class so that he can explain to the patient.
No low-latency requirement which means patient can wait for the results. As there is no low-latency requirement, we can apply complex machine learning models.
Errors are very costly.
Probability of belonging to class is needed rather than it belonging to particular class.

MAPPING REAL WORLD/BUSINESS PROBLEM TO MACHINE LEARNING PROBLEM:

As we already mentioned that there are 9 classes to classify. Therefore, it is a multi class classification problem.
Performance metric:- Multi class Log-Loss, confusion matrix(Log-Loss is chosen because it actually uses probability which is our business constraint)

Machine Learning Objective: Predict the probability of each data point belonging to each of the 9 classes.

Machine Learning Constraints: These are same as business constraints which I mentioned previously.

After reading the data, we preprocessed the data like removing stopwords, converting to lower case and removing punctuations etc…Preprocessing is a very important stage.

Splitting the data: As the data is not temporal in nature which means it is not changing with time we can split the data randomly for training, cross validation and testing. Then after splitting the data it is also found out that training and test data have almost similar distributions and from the distributions it is clear that data is imbalanced.

As we know that log-loss is ranging from 0 to infinite. So we first define a random model so that if our ML model has a log-loss less than our random model, then we can consider that our ML model is good. After giving data to our random model it gives a log loss ofroughly 2.5.

We even checked precision and recall matrix in which diagonal elements(which are precision and recall of all classes) seemed to be very low because of a random model. Precision and recall matrices are attached below. From the above distributions it is also very clear that 1,2,4,7 classes are majority classes.

UNIVARIATE ANALYSIS:

We take each feature and check whether it is useful for predicting in class label by various ways so that we can use that feature. If it is not useful, we can simply remove that feature

Gene Feature

As we know that gene is a categorical feature. From that we observe that there are 235 types of unique genes out of which top 50 most frequent genes nearly contribute to 75 percent of data.

Now we feature the gene into vector by one hot encoding and response coding. Then we build one simple Logistic Regression model with Calibrated Classifier and applied gene feature and class labels to it. We find that Train, CV, Test log-loss values are roughly same and also find out that log loss value is less than 2.5(Random Classifier value). Hence we can say that Gene is an important feature for our classification. We can also conclude that gene is stable feature because CV and Test errors are roughly equal to Train Errors.

2. Variation Feature

Here variation is also a categorical feature and we observed that 1927 unique variations out of 2124 present in training data which means most of variations occurred once or twice.

CDF of variations looks as follows:

Cumulative Distribution is straight line which means most variations occur once or twice in training data. We feature the Variation into vector by one hot encoding and response coding. As we did earlier for the gene feature, we build a simple LR model and apply data to it and find that log loss values of Train, CV, Test found to be less than Random Model.

But the difference between Train log loss and CV, Test log loss is significantly more than gene feature which means variation feature is unstable. But as the log loss is less than the Random Model we still use the variation feature but be careful since it is not stable.

3. Text Feature

In text data there are total 53,000 unique words which are present in training data. We also observe that most words occur very few times which is common in text data. We convert the text data into vector by BOW(one hot encoding) and Response Coding.

As we did in previous cases we apply it to simple model LR and log loss values of Train, CV, Test are found to be less than Random Model. From the distributions of CV, Test data it is found out that test feature is a stable feature.

Now combine all the features by two ways

One hot encoding: It is found out that by one hot encoding the dimensionality is 55,517 which is because of text data.
Response Coding: It is found out that by Response Coding the dimensionality is 27 (each feature corresponds to 9 dimensions).

BASELINE MODEL

APPROACH 1:

NAIVE BAYES:

We know that for text data NB model is a baseline model. Now we apply the training data to the model and used the CV data for finding best hyper-parameter(alpha)

With the best alpha we fit the model. The test data is then applied to the model and we found out that the log-loss value is 1.27 which is quite less than Random Model. Here we also find out that the total number of mis-classified cases is 39.8 percent. We also checked the probabilities of each class for each data and interpreted each point. This is to check why it is predicting particular class randomly. We conclude that for mis-classified points, the probability that the point belongs to a predicted class is very low. From the precision and recall matrix it is found out that most of points from class 2 predicted as 7. Similarly most points from class 1 are predicted as 4.

APPROACH 2:

2. K Nearest Neighbors:

As we know that the k-NN model is not interpretable(which is our business constraint) but we still use this model just to find out the log loss values. Since k-NN suffers from the curse of dimensionality, we use response coding instead of one-hot encoding. After applying the data to the model we obtain the best hyper-parameter(k)

With the best k we fit the model and test data is applied to the model. The the log-loss value is 1.002 which is less than NB model. But number of mis-classified points are 39.47 percent(almost equal to NB model). In k-nn model it is found out that most of points from class 2 predicted as 7. Similarly most of points from class 1 predicted as 4.

APPROACH 3:

3.LOGISTIC REGRESSION:

As we have already seen, the LR model worked very well with univariate analysis. So we did some thorough analysis on LR by taking both imbalanced data and balanced data.

With Class Balancing: We also know that LR works well with high dimension data and it is also interpretable. So we did oversampling of lower class points and applied the training data to the model and used the CV data for finding best hyper-parameter (lambda)

With the best lambda we fitted the model and test data is applied to the model. The log-loss value is 1.048(close to k-nn). But number of mis-classified points are 34.77 percent(which are less than NB and K-nn). As LR is interpretable and mis-classified points are less than other models(k-NN and NB)it is better than k-NN and NB.

Without class balancing log loss and mis-classified points are increased. Therefore, we use class balancing.

APPROACH 4:

4. SVM:

We use Linear SVM(with class balancing) because it is interpretable and works very well with high dimension data. RBF Kernel SVM is not interpretable so we cannot use it. Now we apply the training data to the model and use the CV data for finding best hyper-parameter (C)

With the best C we fit the model and test data is applied to the model. Now, the log-loss value is 1.06(near to LR) which is quite less than Random Model. Here, the total number of mis-classified cases is 36.47 percent(more than LR). Since we used class balancing we got good performance for minor classes.

APPROACH 5:

5. RANDOM FOREST:

5.1) One-hot encoding: Normally Decision Tree works well with low-dimension data. It is also interpretable. By changing the number of base learners and max depth in Random Forest Classifier, we get best base learners=2000 and max depth=10.Then we fit the model with best hyper-parameters and test data is applied to it. The resultant log loss value is 1.097(near to LR) and total number of mis-classified points is 36.84 percent(more than LR).

5.2) Response Coding: By changing the number of base learners and max depth in Random Forest Classifier we find that best base learners=100 and max depth=5. We then fit the model with best hyper-parameters and found that train log loss is 0.052,and CV log loss is 1.325 which says that model is overfitted even with best hyper-parameters. That is why we don’t use RF+Response Coding .

APPROACH 6:

6. STACKING CLASSIFIER:

We stacked three classifiers — LR, SVM, NB and kept LR as the meta classifier. Now we apply the training data to the model and used the CV data for finding best hyper-parameter. With the best hyperparameter we fit the model apply the test data to the model. The log-loss value is 1.08 which is very much less than Random Model. Here we also find out that total number of mis-classified cases is 36.2 percent. Here, even though we used complex model, we got the results nearly similar to LR. Additionally, we know that the stacking classifier is not interpretable.

From the above table it is clear that RF(Response Coding), there is a drastic change in the train and CV log loss(nearly 20 times). This means that the model is over-fitted, and thus, we remove that model. In Stacking Classifier (which is ensemble) log loss values are nearly same as LR+Balancing. From the above table LR+Balancing suits our business or real world constraints such as interpretability and having a better log loss values than any other models.

APPROACH 7:

After this, now I also checked with TF-IDF Vectorizer(for one-hot encoding of features) instead of CountVectorizer and the results are pretty good.

We can see that results of TF-IDF Vectorizer are slightly better when compared to CountVectorizer. Here also we can see that Logistic Regression is working well compared to other algorithms and Random Forest with response coding is overfitted as there is huge difference in train and CV loss.

Lets see one more interesting analysis:

APPROACH 8:

I selected the top 1000 words from text data excluding gene and variation based on the TF-IDF score of each word. Now I remove the words from text data which are not present in those top 1000 words. Interestingly I got good results compared to the results of above analysis. Lets see the results once.

We can see that results are much improved. As it is, Logistic Regression is performing well compared to other algorithms and Random Forest with response coding is overfitted as usual. But the results have improved.

As we can see that in all the above analysis, Logistic Regression is outperforming all algorithms. So, I tried applying Logistic Regression on unigrams and bigrams instead of using only unigrams. However, the results are not fruitful.

APPROACH 9:

LOGISTIC REGRESSION(WITH UNIGRAMS AND BIGRAMS)

TRAIN LOG LOSS:0.83

CV LOG LOSS:1.17

TEST LOG LOSS:1.19

PERCENTAGE OF MISCLASSIFIED POINTS:40.7

Finally, after trying different approaches, log loss value is finally reduced to less than 1. That is Logistic Regression with 4 grams and extracted top 2000 features from 1–4 grams and applied Logistic Regression with class = balanced .

APPROACH 10:

LOGISTIC REGRESSION(1–4 GRAMS) TOP 2000 TF-IDF FEATURES:

TRAIN LOG LOSS: 0.439

CV LOG LOSS: 0.957

TEST LOG LOSS: 0.982.

I will keep update this blog with new analysis.

For detailed code analysis check this repo.

If you have any queries contact tulasiram11729@gmail.com

Thank your reading. Please share your comments and feedback below.