Classifying Breast Cancer Using KNNs

Published in

The Startup

6 min readJan 16, 2021

While I was diving deeper into machine learning, I came across a dataset for breast cancer and I thought it would be interesting if I could use machine learning to predict whether a given cell nucleus is malignant or benign (cancerous or non-cancerous).

Photo by National Cancer Institute on Unsplash

Analyzing the dataset

The dataset comes from the University of California Irvine ML repository, you can access it here. It contains 13 features, you can find a description of the features at the link under the dataset description section. Before we do anything with the data, we need to do some data exploration. We can do this by using pandas. First, we import pandas and load the dataset.

import pandas as pd# use read csv to load the dataset
df = pd.read_csv("data.csv")

Now that we have our dataset loaded, we’re going to use three simple methods to help us understand the content of the data. First, .head() function shows us an Excel-style representation of the data. Secondly, .describe() shows information like the max, mean and standard deviation and a few other metrics. These are important because it can be used for normalization and standardization. We’ll dive further into normalization later in this article. Last of the three, .info() shows information, such as the data type and if the column has any null fields.

df.head()df.describe()df.info()

After you do the code above, you should notice that the dataset has two columns that are not useful for us. When performing machine learning, it’s important to remove the id field because the ids are random and have no correlation to the data. Also, the Unnamed: 32 columns should be dropped because it doesn’t have any correlations to the data and has no values.

# drop the "unnamed" and "id" columns
# axis=1 means to use the column names not the index of the columndf = df.drop(["Unnamed: 32","id"], axis=1)

If you look at the diagnosis field, it has M or B indicating if it’s malignant or benign rather than having a binary value for the diagnosis. Let’s make a function that will change that column into a binary value, then apply that function to the column on the dataset. We need to do this because the model we’re going to use doesn’t accept strings in the data.

def type_to_binary(element):
    if(element == "M"):
        return 1
    return 0
df["diagnosis"] = df["diagnosis"].apply(type_to_binary)

Getting into the machine learning

In machine learning, there are three types of categories that our model can fit into.

What type of model should we use?

In a regression model, it predicts continuous values such as probabilities or the price of a house given its features. A classification model is a model where you have categories that you want to predict. Some examples are cats or dogs, democratic or republican, it can even have multiple categories like if an image is an image of a t-shirt or socks or pants. Both classification and regression fit under a category called supervised learning. Supervised learning is where you know what you’re looking to predict and unsupervised learning is where you group the data points in clusters based on patterns in the data. Our model is seeing if our diagnosis is benign or malignant, so it’s going to be a classification model.

The classification model that we’re going to use in this article is the K-Nearest Neighbours model.

Understanding KNNs

KNNs are what they sound like. Imagine having your data points plotted on a graph, and on that graph, you get a new one. What KNN’s do is looks for the k number of neighbours. If k is three, then we’ll look for the 3 points closest to the new point, look at what the majority of those points are then classified the new point as one of those points.

Preparing training and testing sets

When doing machine learning, we have two steps, training and testing. Training is important because this is where the model ‘learns’. In testing, we evaluate how well the model works with new data.

Let’s import some functions that we’re going to need. I’ll cover what each one of the functions does as we use them.

from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import plot_roc_curve
from sklearn.pipeline import Pipeline

We split our data into training and testing sets because if we use the same data we used to test the dataset to train the model, the model will have high accuracy, but we won’t know how well it does on new data. Before we can split the dataset, we need to separate our ‘features’ from our ‘label’. Our label is the column in the dataset that we want to predict. Our features are all the other columns that the model is going to use to predict the values in the label. In test_train_split we need to specify the percentage of data we want for testing (30% in our case) and the seed it uses to randomize the data.

# split the data
y = df["diagnosis"].values
X = df.drop(["diagnosis"],axis=1).values
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=42,test_size=0.3)

Standardization

The features in our data are on a different scale, this isn’t a good thing because our model looks for similarity from the data points around the point we want to predict. If we don’t use the same scale, the features don’t equally contribute to determine the nearest data points.

To solve this, we can standardize our data (put the data on the same scale). For every row, for every element, divide by the mean and subtract by the std. Luckily, sklearn has something that can do this for us and since we want to always standardize the data going into the model, we’re going to create a pipeline. All a pipeline does is execute what we put in it in order. In our case, it’s going to scale the data then create the model. To initiate the model, all we have to do is call KNeighborsClassifier(). We can specify the K in KNN by passing a parameter called n_neighbors.

knn_steps = [("scaler", StandardScaler()), ('knn', KNeighborsClassifier(n_neighbors=4))]
knn_pipeline = Pipeline(knn_steps)

Training our KNN

All we need to do to train our model is run .fit on the pipeline we created and pass in X_train, y_train

knn = knn_pipeline.fit(X_train,y_train)

Evaluating our model

Perfect, we have our model trained, it’s time to put that test set to use. Our model has a built-in function that tells us its accuracy which is defined by how many times the model predicts right vs how many times the model predicts wrong. This might not be the best way to see how our model works because we want to see how often our model says it’s true but is wrong.

Imagine someone uses our model, and it tells them they have a cancerous tumour, and they don’t. Due to this, we use a ROC curve. It’s a graph that shows true positives against false positives. The area under this curve is called the AUC and the closer to 1, the better the model and closer to 0.5, the worse the model. We can use sklearn’s function to plot the ROC curve and tell us the AUC.

print('Accuracy: {}'.format(knn.score(X_test,y_test)))
plot_roc_curve(knn,X_test,y_test)

If we run this, we get an accuracy of 0.96 and AUC of 0.98! That means our model is doing excellent! Now if you want to make predictions with this model you can run knn.predict(). Note that the data you put into predict has to have the same features as X_train.

I made a github repo for my original code here if you need a reference.

Thanks for reading, if you liked this article, follow me and check out my other articles.