Breast Cancer Classification Using Python

A guide to EDA and classification

Mugdha Paithankar
The Startup
13 min readNov 8, 2020

--

Photo by Peter Boccia on Unsplash

Breast cancer (BC) is one of the most common cancers among women in the world today.

Currently, the average risk of a woman in the United States developing breast cancer sometime in her life is about 13%, which means there is a 1 in 8 chance she will develop breast cancer!

An early diagnosis of BC can greatly improve the prognosis and chance of survival for patients. Thus an accurate identification of malignant tumors is of paramount importance.

In this article I will also go over all the steps needed to make a Data Science project complete in itself, and with the use of machine learning algorithms, ultimately build a model which accurately classifies tumors as Benign or Malignant based on the tumor shape and its geometry.

Step 1: Get the data!

I got the dataset from Kaggle. It contains 596 rows and 32 columns of tumor shape and specifications. The tumor is classified as benign or malignant based on its geometry and shape. Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass, which is type of biopsy procedure. They describe characteristics of the cell nuclei present in the image.

The features of the dataset include:

  1. tumor radius (mean of distances from center to points on the perimeter)
  2. texture (standard deviation of gray-scale values)
  3. perimeter
  4. area
  5. smoothness (local variation in radius lengths)
  6. compactness (perimeter² / area — 1.0)
  7. concavity (severity of concave portions of the contour)
  8. concave points (number of concave portions of the contour)
  9. symmetry
  10. fractal dimension

The mean, standard error and “worst” or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features.

Step 2: Exploratory Data Analysis (EDA)

The dataset has 569 rows and 33 columns. There are two extra columns “id” and “Unnamed: 32”. We drop Unnamed: 32 which has all Nan values.

212 Malignant and 357 Benign tumors

There are now 30 features we can visualize. I decided to plot 10 features at a time. This led to 3 plots containing 10 features each. The means of all the features were plotted together, so were the standard errors and worst dimensions.

Violin plots are like density plots and unlike bar graphs with means and error bars, violin plots contain all data points which make them an excellent tool to visualize samples of small sizes.

I made violin plots and commented, based on their distribution whether that feature will be good for classification. To make violin plots for this dataset, first separate the data labels ‘M’ or ‘B’ (into y) and features (into X). Then visualize 10 features at a time.

Violin plot displaying all the mean features

The median of texture_mean for Malignant and Benign looks separated, so it might be a good feature for classification. For fractal_dimension_mean, the medians of the Malignant and Benign groups are very close to each other.

Violin plot displaying all the standard error features

The medians for almost all Malignant or Benign don’t vary much for the standard error features above, except for concave points_se and concavity_se. smoothness_se or symmetry_se have a very similar distribution which could make classification using this feature difficult. The shape of the violin plot for area_se looks warped and the distribution of data points for benign and malignant very different!

Violin plot displaying all the worst dimension features

area_worst look well separated, so it might be easier to use this feature for classification! Variance seems highest for fractal_dimension_worst. concavity_worst and concave_points_worst seem to have a similar data distribution.

In order to check the correlation between the features, I plotted a correlation matrix. It is effective in summarizing a large amount of data where the goal is to see patterns.

Correlation heatmap of all the features

The means, std errors and worst dimension lengths of compactness, concavity and concave points of tumors are highly correlated amongst each other (correlation > 0.8). The mean, std errors and worst dimensions of radius, perimeter and area of tumors have a correlation of 1! texture_mean and texture_worst have a correlation of 0.9. area_worst and area_mean have a correlation of 1.

By now we have a rough idea that many of the features are highly correlated amongst each other. But what about correlation between the benign and malignant groups for each feature? In order to understand if there is a difference between the data distribution for malignant and benign groups, I visualized some features via box plots and performed a t test to detect statistical significance.

Box plots succinctly compare multiple distributions and are a great way to visualize the IQR.

Comparing mean features for M and B groups

Texture means, for malignant and benign tumors vary by about 3 units. The distribution looks similar for both the groups. Malignant tumors tend to have a higher texture mean compared to benign.

Fractal dimension means are almost the same for malignant and benign tumors. The IQR is wider for malignant tumors.

Comparing se features for M and B groups

Malignant groups have a distinctly wider range of values for area se. The distribution range is very narrow for benign groups. This might be a good feature for classification.

Standard error (se) of concave points has a higher mean and IQR for malignant tumors. The distribution looks somewhat similar for both tumor types.

Comparing worst dimension features for M and B groups

Malignant groups have a wider range of values for radius worst compared to benign groups. The IQR is wider for the same. Malignant tumors have a higher radius worst compared to benign groups.

Similar to area_se, area_worst has a very different data distribution for malignant and benign tumors. Malignant tumors tend to have a higher value of mean and wider IQR range. Because of noticeable differences between B and M tumors, this could be a good feature for classification.

Box plots indicated a difference in means for most of the features visualized above. But are these differences statistically significant? One way to check for this is by a t test.

t test tells us he t test tells you how significant the differences between groups are; In other words it lets you know if those differences (measured in means) could have happened by chance.

t test results for some features from the dataset

Except for fractal dimension mean, the p value and t statistic is statistically significant for all the features in the table above. For fractal dimension mean the null hypothesis stands true, meaning there is no difference in means for the fractal dimension mean of M and B tumors.

From the correlation matrix we saw earlier, it was clear that there are quite a few features with very high correlations. So I dropped one of the features, from each of the feature pairs which had a correlation greater than 0.95. ‘perimeter_mean’, ‘area_mean’, ‘perimeter_se’, ‘area_se’, ‘radius_worst’, ‘perimeter_worst’, ‘area_worst’ were amongst the features that were dropped.

Step 3: Machine Learning

We want to build a model which classifies tumors as benign or malignant. I used sklearn’s Logistic Regression, Support Vector Classifier, Decision Tree and Random Forest for this purpose.

But first, transform the categorical variable column (diagnosis) to a numeric type. I used sklearn’s LabelEncoder for this purpose. The M and B variables were changed to 1 and 0 by the label encoder.

Transform categorical variables

Train Test Split the data

40% of the data was reserved for testing purposes. The dataset was stratified in order to preserve the proportion of target as in the original dataset, in the train and test datasets as well.

Scale the features

sklearn’s Robust Scaler was used to scale the features of the dataset. The centering and scaling statistics of this scaler are based on percentiles and are therefore not influenced by a few number of very large marginal outliers.

Train the data

Confusion matrix

Classification Report

Hyper parameter tuning

Hyperparameters are crucial as they control the overall behavior of a machine learning model.

In the context of cancer classification, my goal was to minimize the misclassifications for the positive class (ie when the tumor is malignant ‘M’). But misclassifications include False Positives (FP) and False Negatives (FN). I was focused more on reducing the FN because tumors which are malignant should never be classified as benign even if this means the model might classify a few benign tumors as malignant! Therefore I used the sklearn’s fbeta_score as the scoring function with GridSearchCV. A beta > 1 makes fbeta_score favor recall over precision.

After grid searching the accuracy improved a little but the FNs are still 2.

Grid searching was done on SVC and Random Forest models too but the recall was best for logistic regression which is why I am discussing logistic regression in this post.

Custom Threshold to increase recall

The default threshold for interpreting probabilities to class labels is 0.5, and tuning this hyperparameter is called threshold moving.

Finally the FNs reduced to 1, after manually setting a decision threshold of 0.42!

Graph of recall and precision VS threshold

Graph of recall and precision scores VS thresholds

The line for optimal decision threshold indicates the point of maximum recall which could be achieved without compromising a lot on precision. After that point the precision starts to drop more.

ROC Curve for Logistic Regression model

The AUC score for this model is 0.9979.

AUC score tells us how good our model is at distinguishing between classes, in this case, predicting benign tumors as benign and malignant tumors as malignant.

The ROC curve is plotted with TPR against the FPR where TPR is on y-axis and FPR is on the x-axis. ROC curve looks almost ideal.

When the TPR and FPR don’t overlap at all, it means model has an ideal measure of separability ie it is able to correctly classify positives as positives and negatives as negatives.

To conclude this post, I have discussed a few EDA, statistical analysis and machine learning techniques as applied to breast cancer classification dataset. Complete code of this project can be found on Github.

The breast cancer classification dataset is good to get started with making a complete Data Science project before you move on to more advanced datasets and techniques.

Hope you guys found this post helpful and learnt something new too! Follow Mugdha Paithankar for more stories. Please clap this article if you like it!

--

--

Mugdha Paithankar
The Startup

Not until we are lost do we begin to understand ourselves. Keep exploring. Keep learning!