Building ML Model to predict whether the cancer is benign or malignant on Breast Cancer Wisconsin Data Set !! Part 3

Shahid
Shahid
Sep 15, 2020 · 7 min read

In part 1 of this series, we understood in depth the data health check up, and Exploratory Data Analysis using visualization techniques. In part 2, we understood the different kind of feature selection techniques. Also we understood the voted method.

Now, in the part 3, we will use the Wisconsin Cancer data-set and we will build seven different models for each feature selection methods and will analyze & compare the performance of the models. We have already done feature selection part in part 2.

Steps involved in doing Machine Learning Model to predict whether the cancer is benign or malignant:

Step 1: Define Problem Statement
Step 2: Data Source
Step 3: Cleaning the Data
Step 4: Data Analysis and Exploration
Step 5: Feature Selection
Step 6: Data Modeling
Step 7: Model Validation

Step 8: Hyperparameter Tuning
Step 9: Deployment
In this part 3, we will cover step 6 & 7We are using the Breast Cancer Wisconsin dataset available on UCI Machine Learning Repository.

Problem Statement

Our objective is to identify which features are most helpful in predicting malignant or benign cancer and to classify whether the breast cancer is benign or malignant.

Data Source

The first thing we need to do is to understand the structure of the data. We have already discussed data health check up and exploratory data analysis in our previous articles part 1 and part 2. Here, we are just giving you description about it.

  • There are 569 rows in total, each with 31 columns.
  • The first column is an ID that identifies each patient.
  • Diagnosis is a categorical variable.
  • Missing attribute values: none
  • The second column is the diagnosis of the tumor and it has two possible values: B means that the tumor was found to be benign. M means that it was found to be malignant.
  • Out of the 569 patients in the dataset, the class distribution is: Benign: 357 (63%) and Malignant: 212 (37%)

This is useful information that allows us to achieve some conclusions.

  • Our objective will be to train our models to predict if a tumor is benign or malignant, based on the features selected in part 2.
  • We will not make use of the first column that holds the ID of the patient.
  • In binary classification scenarios, It’s good to have a good percentage of data from both classes. We have a 63%-37% distribution, which is good enough.
  • The benign and malignant classes are identified with the characters B and M. We will change the values of the class columns to hold a 0 instead of a B for benign cases and a 1 instead of a M for malignant cases.

Step 6–7: Model Building and validation

Derivation of the prediction model

In this work, the cohort of 569 patients is randomly divided into two parts: two-third (70%)(learning set) for developing a prediction model and the remaining one-third (30%) for validating the developed model (validation set). We will build seven different models (1. Logistic Regression, 2. Random Forest Classifier, 3. Gradient Boosting Classifier, 4. Extra Trees Classifier, 5. XGB Classifier, 6. KNeighbors Classifier and 7. SVM Classifier) for each eight feature selection method ( 1. Correlation, 2. Chi-square, 3. Recursive Feature Elimination (RFE), 4. Recursive Feature Elimination with Cross-validation (RFECV), 5. Random Forest, 6. Extra Trees, 7. L1-based, 8. Voted). For each feature selection, total 56 models will be build and best one will be selected.

Model classifiers and selected features

Image for post
Image for post
Model Classifiers
Image for post
Image for post
Selected features

Models evaluation

We use 70% of the data for training and the remaining 30% for testing. We trained our models using 5-folds cross-validation; the data was first divided into 5 folds, four folds were used to train the model, and the remaining fold was used to assess model performance/generalizability.

Performance measures

We assess a set of performance measures including recall, precision, and AUC for each model. We use traditional performance measures for classification that are based on the four values of the confusion table: true positive (TP), false positive (FP), true negatives (TN), and false negatives (FN). We use these values to compute Positive Predictive Value(PPV) or precision and sensitivity or recall as in Equation (1) and Equation (2).

Image for post
Image for post

In addition, the Receiver Operating Characteristic Curve (ROC) is graphed and the areas under the ROC (AUC) are analyzed.

Training and testing performance for each feature selection method.

  1. Feature selection : Correlation
Image for post
Image for post
Feature selection : Correlation
Image for post
Image for post
Feature selection : Correlation
Image for post
Image for post
Feature selection : Correlation

2. Feature selection : Chi-square

Image for post
Image for post
Feature selection : Chi-square
Image for post
Image for post
Feature selection : Chi-square
Image for post
Image for post
Feature selection : Chi-square

3. Feature selection : RFE

Image for post
Image for post
Feature selection : RFE
Image for post
Image for post
Feature selection : RFE
Image for post
Image for post
Feature selection : RFE

4. Feature selection : RFECV

Image for post
Image for post
Feature selection : RFECV
Image for post
Image for post
Feature selection : RFECV
Image for post
Image for post
Feature selection : RFECV

5. Feature selection : Random forest

Image for post
Image for post
Feature selection : Random forest
Image for post
Image for post
Feature selection : Random forest
Image for post
Image for post
Feature selection : Random forest

6. Feature selection : Extra trees

Image for post
Image for post
Feature selection : Extra trees
Image for post
Image for post
Feature selection : Extra trees
Image for post
Image for post
Feature selection : Extra trees

7. Feature selection : L1-based

Image for post
Image for post
Feature selection : L1-based
Image for post
Image for post
Feature selection : L1-based
Image for post
Image for post
Feature selection : L1-based

8. Feature selection : Voted-based

Image for post
Image for post
Feature selection : Voted-based
Image for post
Image for post
Feature selection : Voted-based
Image for post
Image for post
Feature selection : Voted-based

Let’s compare the best selected models:

Training and testing performance for best classifier model.

Here, we selected the best model from each feature selection method. From the table below it is clear that the best classifier is Logistic regression using random forest feature selection method.

Logistic regression

Training performance : Accuracy = 0.977, AUC = 0.995

Testing performance : Accuracy = 0.977, AUC = 0.971

Image for post
Image for post
Image for post
Image for post
Image for post
Image for post
Confusion matrix
Image for post
Image for post
Model- Logistic regression, Feature selection — Random forest

Conclusions

Thus, the most significant features in predicting malignant or benign for cancer patients obtained by algorithms are texture_mean, area_mean, concavity_mean, area_se, concavity_se, fractal_dimension_se, smoothness_worst, concavity_worst, symmetry_worst, fractal_dimension_worst.

To obtain best results from predictive model, many different models are trained, optimized, and evaluated using 16 set of features. During this process the feature set itself is culled using model-specific methods. Each model and subset of features is evaluated using Accuracy, AUC and sensitivity using a 5-fold cross-validation. Best results are obtained with Logistic regression, with the random forest feature set culled to 10 features. Above table shows the performance measures of the classification techniques. The logistic regression achieved an accuracy and AUC of 0.977 and 0.971 on test data.

End Notes

This ends our Part 3 on model building and evaluation. The aim of this Part 3 was to provide an in depth and step by step guide to use different kind of Machine Learning Techniques.

Personally, I enjoyed writing this and would love to learn from your feedback. Did you find this Part 3 useful? I would appreciate your suggestions/feedback. Please feel free to ask your questions through comments below.

We will explore step 8 & 9: Hyperparameter Tuning and Deployment in Part 4.

Stay tuned!

All the code and datasets used in this article can be accessed from my GitHub.

The code is also available as a Jupyter notebook.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium