Cancer Diagnosis Using Machine Learning

Mohd Saquib
The Wisdom
Published in
9 min readMay 11, 2020

What is Cancer? And How it is Caused?

Cancer can start any place in the body. It starts when cells grow out of control and crowd out normal cells. This makes it hard for the body to work the way it should.

Cancer is caused by changes or mutations to the DNA within cells. The DNA inside a cell is packaged into a large number of individual genes, each of which contains a set of instructions telling the cell what functions to perform, as well as how to grow and divide.

What is the Business Problem?

FIG 1 — PROCESS OF CANCER DIAGNOSIS IN LABS

During the past several years a lot has been said about how precision medicine, how genetic testing is going to disrupt the way diseases like cancer are treated.

But this only partially happening due to the huge amount of manual work still required.

As shown above (Fig 1), for the manual diagnosis of cancer following step is taken —

  1. First, the medical pathologist takes the sample of the tumor.
  2. Then it detects the type of mutation/changes shown by the genes present in the tumor.
  3. Then the mutation is matched with the text literature i.e. the previous research in Cancer.
  4. On the basis of the text, the pathologist classifies them. (In our data set the cancer is classified into 9 different classes).

This is a very time-consuming task.

Till now we understood how cancer is formed, diagnosis process, and what is the problem we need to solve.

Let us dive into the real work, i.e using Machine Learning to diagnose the class of cancer.

Business Constraints:

  1. The interpretability of the algorithm is a must because a cancer specialist should understand why the model has classified the sample into that particular class.
  2. No low-latency requirement which means the patient can wait for some time to get the result. As there is no low-latency requirement, we can apply complex machine learning algorithms.
  3. Errors can be very costly.
  4. The probability of a sample belonging to each class is needed.

Dataset:

To train any machine learning model, we need reliable data. The data we are going to use for this case study is provided by Memorial Sloan Kettering Cancer Center (MSKCC) on kaggle.com

Source:https://www.kaggle.com/c/msk-redefining-cancer-treatment/discussion/35336#198462

Download training_variants.zip and training_text.zip from Kaggle.

Machine Learning Problem Mapping:

  1. We have to classify the sample into 9 classes, therefore it is a multi-class classification problem
  2. Performance Metric: Multi-Class Log-Loss, Confusion Matrix.

If you don’t know about these performances metrics go through these links —

Log-Loss -https://ml-cheatsheet.readthedocs.io/en/latest/loss_functions.html

Confusion Matrix — https://towardsdatascience.com/understanding-confusion-matrix-a9ad42dcfd62

Objective: Predict the probability of each sample/data point belonging to each of 9 classes.

Constraints: Same as business constrains.

We have understood the machine learning problem and objective, now open the code section from my GitHub repository, for better understanding of the steps given below —

Step 1: Reading both training variants and text variants data using pandas which we have downloaded earlier.

Step 2: Preprocessing of Data i.e. removing stopwords, converting text to lower case and removing punctuations, etc… Preprocessing is a very important stage.

Step 3: Splitting the data — We randomly split the data for training, cross-validation, and testing.

Step 4: Plotting the distribution of y_i’s (Classes) in Train, Test, and Cross-Validation datasets.

FIG 2: DISTRIBUTION OF CLASSES

From the above plot, we can see that after splitting the data the train, cross-validation, and test data have almost similar distributions and from distribution, it is clear that data is imbalanced, which means that there majority of samples belonging to class 7, 4, 1, 2 than in other classes.

Step 5: Random Model — Now log-loss ranges from 0 to infinite. So we define a random model so that if our ML model has a log-loss less than our random model, then we can consider that our ML model is good. After giving data to our random model it gives a log loss of 2.5. So we need to find a model with a log-loss lower than 2.5. It is like a base model for the comparison of the algorithm.

We even checked the confusion, precision, and recall matrix in which diagonal elements seemed to be very low because of the random model as shown below.

FIG 3: CONFUSION PRECISION & RECALL MATRIX

Step 6: Univariate Analysis: We have taken each feature from our and check whether it is useful for predicting in the class labels. If it is not useful, we can remove that feature.

  1. Gene Feature — The gene is a categorical feature. From that, we observe that there are 235 types of unique genes out of which the top 50 most frequent genes contribute to 75% of data as shown in the below figure(FIG 4).
FIG 4: CUMULATIVE DISTRIBUTION OF GENES

We converted the gene feature into a vector using one-hot encoding and response coding. Then we use a simple Logistic Regression model with Calibrated Classifier and applied our gene feature and class label to it.

We find out that the Train, Test, and CV log-loss values are nearly equal to each other and are also less than 2.5 i.e Random Classifier Log-loss value.

Hence we can conclude that Gene is an important feature for our classification.

2. Variation Feature

The variation feature is also a categorical feature and we observed that 1927 unique variation out of 2124 present.

CDF of variation feature is shown as below(FIG 5) —

FIG 5: CUMULATIVE DISTRIBUTION OF VARIATION

Cumulative Distribution is a straight line which means most variation occurs once or twice in training data. We similarly converted the Variation Feature into the vector by one-hot encoding and response coding. After that applied the LR model on the Variation feature and class label.

The Train, Test, and CV log-loss is found to be less than the Random Model, hence we can consider the Variation feature also.

3. Text Feature

The text data have a total of 53,000 unique words that are present in the training text data. We converted the text feature into a vector using Bag of Words(One-Hot Encoding Response Coding.

Similarly, we applied the simple LR model and calculated the log-loss values of Train, Test, and CV which were found to be less than the value of the Random Model, therefore text feature is also considered as an important feature.

Step7: Now as we know that all our features are important, we combined all the features using One-hot encoding and Response Coding.

Step8: Applying Machine Learning Models

  1. Naive Bayes —

As we know that for text data Naive Bayes is a baseline model, we trained the model using training data and used the CV data for finding the best hyper-parameter(alpha), with the best alpha we fit the model. The test data is then applied to the model and we found out that the log-loss value is 1.20 which is quite less than the Random Model. We also find out misclassified cases for Naive Bayes model is 40.2 %. Also, we calculated the probabilities of each class data and interpreted each point, to check why it is predicting that particular class randomly.

2. K Nearest Neighbors —

KNN model is not interpretable( which does not meet our business constraint) but we still use this model to find out the log-loss values, since k-NN suffers from the curse of dimensionality, we have used on response encoding. After applying the model to our training data and obtaining the best hyperparameter (k) using cross-validation. we fit the best (k) and test data is applied to the model. The log-loss value obtain is 1.05 which less than the NB model as well as Random Model. The number of misclassified points is 36.3%.

3. Logistic Regression —

We tried Logistic Regression with both Class Balancing and without Class Balancing-

a) With Class Balancing —

We did oversample of the lower classes where there were fewer data points as compared to other classes, and applied the training data to model, and used CV data to find the best hyper-parameter(lambda). After that, we fit the model with the best lambda and applied the test data to the model. The log-loss value is 1.09 i.e. close to the KNN model, and the number of misclassified points is 35.3%. Also, Logistic Regression is interpretable, and misclassified points are less than the above two models.

b) Without Class Balancing —

The log-loss and misclassification increased, therefore, we only used class balancing.

4. Support Vector Machine (SVM) —

Linear SVM with class balancing it interpretable and works well with high dimension data. Similarly, we used training, test, and CV data to calculate log-loss and misinformation. We find out the log-loss value is 1.10 i.e. near to Logistic Regression and is quite less than Random Model. The total number of misclassified cases is 38.15 % which is more the LR.

5. Random Forest —

We used both one-hot encoding and response encoding technique to vectorize our feature and applied the data on Random Forest Classifier keeping our base learner = 2000 and max depth =10 in one-hot encoding, and base learner = 100 and max depth = 5 in response encoding, we find out the log-loss value for test data is 1.15 and misclassified points is 37.59% for One-Hot Encoding.

In Response encoding, we find out there is a huge difference between training log-loss and CV log loss which says that the model is overfitted even with the best hyperparameter. That is why we don’t use RF+ Response Coding.

6. Stacking Classifiers —

We stacked three classifiers — Logistic Regression, SVM, and NB. Now we apply training data to the model and used CV data for finding the best hyperparameter. With the best hyperparameter, we fit our model, and using test data we found out the log-loss for test data is equal to 1.16 and misclassified cases are 36.6%. Here even though we used a complex model, we got the result nearly similar to Logistic Regression. Also, the stacking model is not interpretable.

7. Logistic Regression (with Unigrams and Bigrams) —

We tried different encoding techniques like unigrams and bigrams and applied our training data to model, and using CV data we find out the best hyperparameter, after fitting the data to model, the testing log-loss in 1.19 and misclassified point is equal to 39.66%.

8. Feature Engineering —

We merged gene and variation data into one list and apply the Tf-IDF vectorizer on top of it, and after fitting the model with the best hyperparameter we find out the log-loss equals to 1.06 and misclassified points to be 36.4%.

Step9: Results —

We made a comparison table using Pretty-Tables as shown below-

TABLE 1: WITHOUT FEATURE ENGINEERING
TABLE 2: WITH FEATURE ENGINEERING

Finally, after trying different approaches the log-loss value is reduced to less than 1, using LR+FE+ RESPONSE ENCODING i.e. 0.95.

If you face any problem in Github repository you can use access the code from below link —

https://drive.google.com/open?id=1favs7i6Enb6jAMhkMo3oEGZkM24xq4kT

If you have any queries mail me at mohdsaquib0998@gmail.com

If you find this blog helpful, please give a clap. :)

--

--

Mohd Saquib
The Wisdom

“The goal is to turn data into information, and information into insight.”