A Binary Classification Problem: Breast Cancer Tumours

Published in

CodeX

15 min readNov 13, 2022

For my first capstone project as part of my bootcamp with Springboard, I decided to have a go at developing a machine learning model with the very well known Wisconsin Breast Cancer dataset. The goal of this project was to illustrate the Data Science Method (DSM) taught in Springboard, and for me personally, I wanted to focus on the machine learning algorithms that I’d been reading about. Of course, there are quite a few different data science methodologies in the industry such as CRISP-DM, OSEMN, and KDD to name a few. At Springboard, and as a beginner, the DSM method I followed is this:

1. Problem Identification
2. Data Wrangling
3. Exploratory Data Analysis
4. Pre-processing and training data
5. Modelling
6. Documentation

As I mentioned in this post, a bootcamp alongside a full-time job was exhausting. I was starting to suffer from imposter syndrome, and I was questioning whether any of all the lectures I was listening to, all the mini assignments I was submitting, and all of the maths I was revisiting was actually sinking in. I figured that it’s best to divide and conquer: I was going to use the first of the two capstone projects to focus on the machine learning algorithms, whilst the second project would have a strong data wrangling element. This decision was at the core of my thinking when I decided to pick the Wisconsin data set: the dataset is clean, there is no missing data, it’s labelled, making it ideal for supervised learning models, and the features were all numerical continuous data, making it perfect for most machine learning algorithms! Basically, the dataset was quite easy to work with!

The Data and Problem Identification

The Wisconsin dataset is great for supervised learning models as it consists of 569 samples, each labelled as benign or malignant, with 33 features for each sample. In real-life projects, this is the moment when you talk to experts in the industry to determine the scope and possible metrics for your project. As this was not possible, I set out three ‘common sense’ aims for this project — the success criteria:

An accuracy score of 90% or higher in the classification of malignant tumours.
A higher rate of falsely classifying malignant tumours compared to false negative rates because medical diagnosis rarely depends on one test. So, a higher false positive rate would result in more testing, whilst this is rarely done for false negative cases.
Correctly classify tumours with a reduced number of features is possible.

Data Wrangling

This is a crucial step in data science, and from what I understand this is where data scientists spend most of their time. The goal of this step in the DSM is to clean up the data, make sure the data is converted to numerical data where necessary (as most machine learning algorithms require numerical data), decide on how to treat missing values in the dataset, identify, and decide what to do with outliers. The Wisconsin dataset is a clean dataset, with no missing fields for any of the 569 samples. There is one limitation that’s noteworthy — the dataset is unbalanced. This means there are more samples of benign tumours than malignant tumours. There are methods of balancing this out, but as I wanted the focus of this project to be on applying machine learning algorithms, I decided to leave the dataset as unmodified.

Exploratory Data Analysis

They say a picture is worth a thousand words, and I feel that’s most illustrated in this stage of the DSM. Whilst the .describe() method does give us a dataframe of the summary statistics for each feature, it’s hard to get a deeper insight into the data.

I usually run the .describe() method pretty early on in the DSM. Here you can see that even the ID is included in the summary statistics data frame. We don’t really need this, so I went ahead and dropped this column early on.

To get a pictorial view of the summary statistics, particularly the spread of the data point for each feature, I created box and whisker plots using Plotly. I’m a big fan of Plotly as it’s very easy to use and creates interactive plots. I split up the features based on the range of the datapoints for each feature: the mean areas of the tumours have values ranging 143.5 to 2501.0 whilst the datapoints of the concavity of the tumours range from 0.0888 to 0.427. As the ranges vary so greatly, it’s best to split up the attributes with similar ranges to increase the interpretability of our plots. I then plotted histograms to get an idea of the skewness of the datapoints for each feature.

What’s the point of all these graphs? We use them to quickly identify outliers and get an idea of the spread of each data points. This also gives us an idea of whether we need to scale our dataset or not. For example, there is a benign sample with its worst concavity value as 1.252. It was also interesting to see that the malignant tumours had a much smaller range of values for each feature than compared to the benign tumours. This may be a consequence of our dataset being unbalanced in favour of the benign tumours, or it may be an important feature of malignant tumours.

As the features are numerical and continuous data, I also took a look at the correlation between the different features, and visualised this using a heatmap.

Why did I do this? Well, one of the aims I outlined was to try and develop a model based on reduced features. But why would we want to do that? The more features we have, the better, as far as machine learning algorithms go. However, I also questioned the practical use of having more features. As the Covid-19 pandemic nears an end, all over the world we are seeing the effects of our healthcare systems being stretched to the brink for two years. If we can effectively figure out which features can be dropped without compromising the accuracy of our model, then healthcare professionals can collect less information on tumours, speeding up the diagnosis process.

I am looking to find features that are highly correlated with each other, in the hopes of being able to drop one of these features. The logic being, if two features are highly correlated then we only need to use one of these features in the model as the other feature will follow suit. We can see from the heatmap, the mean radius and mean perimeter are highly correlated. I plotted these two features against each other to get a better idea of their association.

I first created a scatterplot (left) and then used the regplot from Seaborn to play around with the regression line (right). We can see that the orange regression line is a much better fit of the trends, which is a second order line known as a quadratic fit. I always go through this example with a lot of my maths students- just because something has a high correlation coefficient doesn’t mean we just blindly believe it’s correlation. We should always take a deeper look at the data! I went ahead and did the same analysis with other features that showed high correlations in my notebook: mean area and the perimeter, mean area and the mean concavity. I realised I’d like to have an even closer look at the mean area and the mean concavity: they had a correlation coefficient of 0.69, but the plot didn’t indicate a linear association. So, I used Statsmodel to create and evaluate a simple linear regression model for these two features, and a multiple linear regression model for the area mean and the other features.

The multiple linear regression model yielded a much higher coefficient of determination, R-squared value, of 0.799 compared to the simple linear model which had a R-squared value of 0.439. However, the standard errors were relatively high so I concluded that I couldn’t confidently drop any of 33 features.

Pre-processing and Training Data

Now that I’ve ruled out dropping features, I prepared the dataset for machine learning.

Firstly, the diagnosis column had B and M to denote the benign and malignant tumours. I had to convert this categorical data to indicator variables: this basically means I replaced the diagnosis column with a column called B for benign tumours, with values of 0 for samples which are malignant and 1 for those which are benign. This is known as feature encoding and there are quite a few different ways of doing this. I went with pandas .get_dummes() method because it’s easy to interpret: you get the clean column labelled B as I’d mentioned.

Secondly, I used train_test_split from Scikit Learn to split the data into 75:25 ratio of training and testing data respectively. Then I scaled the data using Standard Scaler from Scikit Learn. Check out this article on the various different scaling techniques if you’re interested. Have a look at my notebook to see how the values for the arrays X_train_scaled and X_test_scaled differ from the original data frame.

The dataset is now ready for us to train our models. This project is a binary classification project: we want our models to look at 75% of our dataset and learn what combinations of values for the different features help distinguish a tumour between malignant and benign tumours. So, I had to decide on which algorithms to use.

I first used a Dummy Classifier from Scikit Learn. This creates a classification model by classifying the tumours as malignant or benign without trying to find any patterns in the data. This is to serve as the base model, a model to compare our other classification models to.

The conclusion? The model is accurate 64% of the times. This will be the minimum number we want to see in our other classification models.

Before I go into detail about the various different models I created, I want to run through a set of steps I followed for each type of algorithm:

I first created a model and looked at their evaluation metrics.

Then, I used GridSearchCV and 10-fold cross validation for hyperparameter tuning. The first model for each machine learning algorithm is relatively simple. However, many of these algorithms can be further ‘tuned’ to improve their performance. GridSearchCV essentially divides our training set into 10 groups, because I went with a 10-fold cross validation. Then, it runs through every combination of parameter values, generates a model using 9 of the groups, and then uses the 10th group to test the model.

The output? We get to extract the best parameters, which is the values for each parameter that gave us the best performing model.

Model 1: Logistic Regression

Firstly, it’s important to realise that logistic regression is a regression method that we can use for classification problems. There are few different types of logistic regression: binomial, multinomial, and ordinal. As our target variable, the diagnosis/labelling of the tumours, has two possible outcomes, malignant or benign, binomial logistic regression is most appropriate. I won’t delve too deeply in this article on the theory of logistic regressions but check out this post if you’re interested.

As I mentioned earlier, I used GridSearchCV to determine the best parameters: ‘C’ = 0.1, ‘penalty: l2’ and solver: liblinear.

The C is the regularisation parameter, and one of the most important parameters to tune. The C value basically controls for unlikely high regression coefficients in our model. This means we can vary the values of C using GridSearchCV and pick a value which can help us avoid overfitting. A value of 0.1 is relatively low, which means we are telling the model that the if some datapoints suggest the coefficients should be very large, then it shouldn’t pay attention to these as the dataset may not be fully representative of all breast cancer tumours possible in the world.

The second hyperparameter is the penalty for which l2 is determined to be optimal. The penalty hyperparameter is a form of regularisation and essentially again helps to avoid overfitting. The ‘L2’ means we have picked ridge regression. The consequence? Mathematically speaking, this reduces the weights and allows for better generalisation: the model is better at making predictions on unseen datasets. Finally, ‘liblinear’ is used for the solver hyperparameter. This is known as Library for Large Linear Classification and uses what’s called a coordinate descent algorithm. Honestly, the maths gets quite tricky quite quickly for this, but check this out if you want to dig a little deeper!

The best logistic model had an accuracy of 98% with a higher false negative rate than a false positive, so not fulfilling our first goal but fulfilling our second goal.

Model 2: K Nearest Neighbours

The k-nearest neighbours algorithm uses the logic that if points have something in common, say they are of the same group, then they will appear near one another. So the smaller the distance between the points, the higher the chance of them being classified in the same class. We can set the value of the number of neighbouring points we want our model to use in order to determine the smallest distances, the k-values.

From GridSearchCV the best parameters were determined to be leaf_size = 1, n_neighbors = 3 and p=2. Again, I’ll refrain from digging too deeply into the maths here for now, but the leaf_size essentially controls the minimum number of points in a node: here we’ll have only one sample per node. Secondly, the n_neighbors is the number of neighbouring points to consider. Here we have a value of 3, so our model will consider the 3 nearest neighbouring points. Finally, what do we mean by the ‘distance between the points’? This is where the final parameter p comes in. The power parameter, can have values of 1 or 2 and they indicate whether to use the Manhattan distance formula or the Euclidean distance formula respectively. Our best scores were achieved using Euclidean distances.

The accuracy score for this model is 99%, and it was the first model to achieve a higher false positive rate than a false negative rate- we are finally getting closer!

Model 3: Support Vector Classifier

A support vector classifier can be hard to understand, so I find it best to think about this one visually. Imagine we plot each of our datapoints, and we see where we can draw a ‘line’ which best separates the datapoints into their target groups. In this case we want to plot the datapoint, and come up with a boundary with majority of the benign tumours on one side and the malignant ones in the other side. The dotted lines on the figure below, you can imagine them as the support vectors, and the solid red line as our ‘boundary’, sometimes called the optimal hyperplane.

By Larhmam — Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=73710028

The best parameters are determined as {‘C’: 1, ‘gamma’: 0.0001, ‘kernel’: ‘rbf’}. In this case ‘C’ controls how much we are penalising for misclassified points. If C is low then the penalty for misclassified points is low. This results in our model drawing the boundary in such a way that there isn’t a very accurate separation between the two classes.

Now, let’s take a moment to understand gamma and the kernel parameters. The ‘rbf’ stands for Radial Basis Function Kernel and is one of the most common kernels used. It basically calculates the distance between two points. Check out this post which goes into quite a lot of detail about the RBF kernel function. So, if we have a low gamma value then we essentially mean our model will group together points with larger distances, or larger radius, and can result in underfitting. In contrast, a high gamma value will require only points with very small distances between them to be grouped together, resulting in overfitting.

The best SVC model had an accuracy of 93%, a significant improvement from 64% which was the accuracy score prior to my tuning of the hyperparameters. However, this doesn’t achieve a higher false positive rate than a false negative rate, which is one of our criteria of success.

Model 4: Random Forest Classifier

The random forest classifier consists of many individual decision trees, and they all work together to produce their own predictions. Then, whichever level was predicted most frequently by the labels is selected as the model’s prediction. If you’re not too sure of how decision trees work, check out this page. The best parameters were determined again using a 10-fold cross validation: n_estimators = 23 and max_depth = 4. The max_depth parameter determines the number of splits our tree is allowed to make and the n_estimators determine the number of decision trees in each random forest model. The accuracy of this model is 97% with equal false positive and false negative rates.

Another advantage of a using a random forest classifier is that it’s easy to extract the feature importance. I worked these out using two different methods: using the .feature_importances_ from random forest, and also using the permutation-based calculations. Look at the documentation here if you want to learn more about their features.

Have a look at my notebook as I try to further tweak the model, but I essentially concluded that the three most important features are ‘area_worst’, ‘concave points_worst’, and ‘concave points_mean’ fields as they have 1% importance.

Model 5: Gradient Boosting Classifier

Gradient boosting classifiers essentially are a group of machine learning algorithms, usually quite weak learning models, that are used together to create a cumulative model of higher accuracy. So imagine a model tweaks its performance based on keeping only the correct guesses. Then, another weak learning model is used, this time tested on a poorly classified dataset and again the correct classifications are kept, and incorrect ones discarded. The final result will have a higher degree of accuracy when predicting the correct classes for the datapoints.

The accuracy of the optimal model is 98% and the three most important features are determined to be the perimeter worst, concave points worst, and radius worst, with area worst trailing very closely behind. We are starting to see similar features being identified as important features — we are getting somewhere!

Model 6: XGBoost Classifier

XGBoost stands for Extreme Gradient Boosting, which is like the random forest classifier in the sense that it is also an ensemble learning algorithm. Unlike random forest classifiers, where the classification agreed onby most of the trees is selected as the prediction, XGBoost is similar to Gradient Boosting Classifiers and uses a series of ‘weaker’ learning models to create a strong model. The ‘boosting’ essentially minimises the bias and helps avoid any underfitting of the model.

The overall accuracy is 95% and the most important features identified are perimeter worse, radius worst and concave point worst.

Modelling

Decision time! Now we have to pick the ‘best’ model that meets the success criteria. The best model therefore is a KNN model with leaf size = 1, n neighbors = 2, p =2.

It’s important to highlight that a random forest classifier with n_estimators = 23 and max_depth = 4 is also quite useful, as this highlighted important features such as area worst, concave points worst, and concave points mean.

When feature importance was identified using permutations, these three features were again identified as the most important. Dropping features with low feature importance values did show a slight improvement in the model metrics. However, the KNN model is slightly superior in the sense that it helps us achieve our aim of selecting a model with a higher false positive rate than false negative rate.

Documentation

As part of DSM, I was able to present my project through a report and a presentation. I’ve also uploaded the notebooks all on GitHub.

Honestly, by the time I’d finished this project I was pretty happy with what I produced. There’s still a long journey ahead, so I’ll follow this post up with a few articles going into a little more detail about how I decided on which model metrics to use and why. Stay tuned!