Building ML Model to predict whether the cancer is benign or malignant on Breast Cancer Wisconsin Data Set !! Part 3
In part 1 of this series, we understood in depth the data health check up, and Exploratory Data Analysis using visualization techniques. In part 2, we understood the different kind of feature selection techniques. Also we understood the voted method.
Now, in the part 3, we will use the Wisconsin Cancer data-set and we will build seven different models for each feature selection methods and will analyze & compare the performance of the models. We have already done feature selection part in part 2.
Steps involved in doing Machine Learning Model to predict whether the cancer is benign or malignant:
Step 1: Define Problem Statement
Step 2: Data Source
Step 3: Cleaning the Data
Step 4: Data Analysis and Exploration
Step 5: Feature Selection
Step 6: Data Modeling
Step 7: Model Validation
Step 8: Hyperparameter Tuning
Step 9: DeploymentIn this part 3, we will cover step 6 & 7We are using the Breast Cancer Wisconsin dataset available on UCI Machine Learning Repository.
Our objective is to identify which features are most helpful in predicting malignant or benign cancer and to classify whether the breast cancer is benign or malignant.
- We have used publicly available dataset Breast Cancer Wisconsin and have download from UCI Machine Learning Repository. Repository: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29.
- The data, in csv format, can be downloaded from this link
- At this Github link, you can access all the code and data of the project.
Our objective is to identify which features are most helpful in predicting malignant or benign cancer and to classify…
The first thing we need to do is to understand the structure of the data. We have already discussed data health check up and exploratory data analysis in our previous articles part 1 and part 2. Here, we are just giving you description about it.
- There are 569 rows in total, each with 31 columns.
- The first column is an ID that identifies each patient.
- Diagnosis is a categorical variable.
- Missing attribute values: none
- The second column is the diagnosis of the tumor and it has two possible values: B means that the tumor was found to be benign. M means that it was found to be malignant.
- Out of the 569 patients in the dataset, the class distribution is: Benign: 357 (63%) and Malignant: 212 (37%)
This is useful information that allows us to achieve some conclusions.
- Our objective will be to train our models to predict if a tumor is benign or malignant, based on the features selected in part 2.
- We will not make use of the first column that holds the ID of the patient.
- In binary classification scenarios, It’s good to have a good percentage of data from both classes. We have a 63%-37% distribution, which is good enough.
- The benign and malignant classes are identified with the characters B and M. We will change the values of the class columns to hold a 0 instead of a B for benign cases and a 1 instead of a M for malignant cases.
Step 6–7: Model Building and validation
Derivation of the prediction model
In this work, the cohort of 569 patients is randomly divided into two parts: two-third (70%)(learning set) for developing a prediction model and the remaining one-third (30%) for validating the developed model (validation set). We will build seven different models (1. Logistic Regression, 2. Random Forest Classifier, 3. Gradient Boosting Classifier, 4. Extra Trees Classifier, 5. XGB Classifier, 6. KNeighbors Classifier and 7. SVM Classifier) for each eight feature selection method ( 1. Correlation, 2. Chi-square, 3. Recursive Feature Elimination (RFE), 4. Recursive Feature Elimination with Cross-validation (RFECV), 5. Random Forest, 6. Extra Trees, 7. L1-based, 8. Voted). For each feature selection, total 56 models will be build and best one will be selected.
Model classifiers and selected features
We use 70% of the data for training and the remaining 30% for testing. We trained our models using 5-folds cross-validation; the data was first divided into 5 folds, four folds were used to train the model, and the remaining fold was used to assess model performance/generalizability.
We assess a set of performance measures including recall, precision, and AUC for each model. We use traditional performance measures for classification that are based on the four values of the confusion table: true positive (TP), false positive (FP), true negatives (TN), and false negatives (FN). We use these values to compute Positive Predictive Value(PPV) or precision and sensitivity or recall as in Equation (1) and Equation (2).
In addition, the Receiver Operating Characteristic Curve (ROC) is graphed and the areas under the ROC (AUC) are analyzed.
Training and testing performance for each feature selection method.
- Feature selection : Correlation
2. Feature selection : Chi-square
3. Feature selection : RFE
4. Feature selection : RFECV
5. Feature selection : Random forest
6. Feature selection : Extra trees
7. Feature selection : L1-based
8. Feature selection : Voted-based
Let’s compare the best selected models:
Training and testing performance for best classifier model.
Here, we selected the best model from each feature selection method. From the table below it is clear that the best classifier is Logistic regression using random forest feature selection method.
Training performance : Accuracy = 0.977, AUC = 0.995
Testing performance : Accuracy = 0.977, AUC = 0.971
Thus, the most significant features in predicting malignant or benign for cancer patients obtained by algorithms are texture_mean, area_mean, concavity_mean, area_se, concavity_se, fractal_dimension_se, smoothness_worst, concavity_worst, symmetry_worst, fractal_dimension_worst.
To obtain best results from predictive model, many different models are trained, optimized, and evaluated using 16 set of features. During this process the feature set itself is culled using model-specific methods. Each model and subset of features is evaluated using Accuracy, AUC and sensitivity using a 5-fold cross-validation. Best results are obtained with Logistic regression, with the random forest feature set culled to 10 features. Above table shows the performance measures of the classification techniques. The logistic regression achieved an accuracy and AUC of 0.977 and 0.971 on test data.
This ends our Part 3 on model building and evaluation. The aim of this Part 3 was to provide an in depth and step by step guide to use different kind of Machine Learning Techniques.
Personally, I enjoyed writing this and would love to learn from your feedback. Did you find this Part 3 useful? I would appreciate your suggestions/feedback. Please feel free to ask your questions through comments below.
We will explore step 8 & 9: Hyperparameter Tuning and Deployment in Part 4.
All the code and datasets used in this article can be accessed from my GitHub.
The code is also available as a Jupyter notebook.