KNN Classification on HAM10000

Ahmed
7 min readDec 12, 2022

--

This article will look at K-Nearest Neighbors (KNN) Classification as a tool for building a skin cancer detection algorithm with the HAM10000 dataset.

Photo by DeepMind on Unsplash

This article is part of a larger series on the HAM10000 dataset — please refer to the introduction article. It’s recommended to read the previous article in the series, Logistic Regression on HAM10000, as it’s assumed that the reader has the context from the article.

Motivation

K-Nearest Neighbor (KNN) classification is a statistical modeling technique that determines the label for an input based on its similarity to other examples. It calculates the “distance” between the input and the examples it has from the training set, determines the k nearest neighbors, and assigns a label based on the majority label of its k nearest neighbors. A more detail explanation and walkthrough of the algorithm can be found here: Needing Distance from Your Neighbors

In the context of HAM10000, the input will be an image of a skin lesion and it will be compared against the images in the training set. The model will calculate a distance between the input image and the training set images, rank the training set images in order of least to greatest “distance”, and select the label with majority representation in the first k training set images.

With KNN classification, a crucial component is determining how “distance” is calculated and the value of k. Since we’re working with image data, which is comprised of numeric pixel values, we can leverage numeric based distance formulas such as Euclidean and Minkowski distance. We can also create our own custom distance metric based on the pixel values, albeit there’s no guaranteed that it be useful — we’d need to do empirical studies to determine that. For this article, we’ll stick with the default distance metric used by the sci-kit learn library which is Euclidean distance because it has a very intuitive geometric meaning. The premise is that the length between two points is determined by the length of the straight line segment that connects them (thus another name for Euclidean distance is straight line distance). Below is a visualization of the Euclidean distance in two dimensions.

Created by Nabib Ahmed

The generalized mathematical formulation given two, n-dimensional points, p and q, is:

Created by Nabib Ahmed with CodeCogs

Notice that the formula assume that p and q are 1-dimensional vectors. In the case of images, which are 2-dimensional, we can flatten them to 1-dimension and treat them like a high dimensional coordinate point — the below shows the flattening process.

Created by Nabib Ahmed

In this flatten representation, it’s very straightforward to plug the images in to the Euclidean distance formula.

As for the value of k, it’s typically determined via experimentations (in machine learning lingo, this is called hyper-parameter tuning). We try different values of k, see which one provides the highest test accuracy, and deem that to be the best value of k. Later in the article, we’ll experiment for values in k when selecting our KNN classifier.

Similar to logistic regression, KNN is a useful first step and a great baseline model — it’s intuitive to understand, easy / fast to implement (especially with the scikit-learn library), and there’s lot of customizability / hyper-parameters to tune (e.g., the value of k, the distance metric, etc.). As we iterate on our models, we will revisit this KNN classification model and see if the additional complexity helped to improve our results.

Data Preparation

As discussed in Downscaling Images in HAM10000, we will be using a downscaled form of the original image data to reduce memory usage and speed up runtime. The downscaled images will be grayscale with dimension height 150 pixels and width 200 pixels.

Furthermore, many of the data preparation steps discussed in Logistic Regression on HAM10000 will be applicable (see that article for more context).

  • The image data is 2-dimensional spatial data — scikit-learn expects 1-dimensional vectors so we’ll flattened the images row-wise (as shown below). Furthermore, the flatten vector representation is needed for calculating the Euclidean distance metric.
Created by Nabib Ahmed
  • The scikit-learn library expects the labels to be numerics so we’ll convert the diagnosis to a number. The scheme will be 0 for akiec, 1 for bcc, 2 for bkl, 3 for df, 4 for mel, 5 for nv, 6 for vasc.
  • For evaluation, we need some test data that the model hasn’t seen — thus we’ll use an 80 / 20 train test split.

Model Results & Evaluation

Using scikit-learn’s default value of k = 5, we achieve an accuracy of ~68% — this is very similar to the accuracy we achieved from our logistic regression model. As we uncovered from our logistic regression analysis, this ~68% overall accuracy can be very deceptive since we have such an imbalanced dataset. Let’s investigate the confusion matrix for our KNN classifier with k = 5:

Created by Nabib Ahmed

We see a very similar story to our logistic regression model. The nv column has the highest number of correct and incorrect classifications, which demonstrates that our model is comparable to a non-sense classifier that always diagnoses nv (which would get ~67% accuracy).

Let’s experiment and see if there’s a value of k that makes the KNN classifier perform much better than the non-sense, always nv classifier. The range of values k can take is 1 up to roughly ~8000 (since our training set is ~8000 images and having a larger k than the training set doesn’t change performance). Trying every value of k is a bit redundant and a waste of time — typically, we don’t expect a major change in performance between k and k + 1. Thus, to save computation time, we’ll work on intervals. Furthermore, it doesn’t make sense to try very large values of k because after a certain point, all labels will start to get classified as the same label. Because nv is so over represented, if our value of k is very large, it might start to count a bunch of nv neighbors. Take for example the extreme k value of 8000 — at this point, all images will be labeled the dominant category in the training dataset which is nv. Thus, we’d want to experiment with lower values of k more and space out our interval towards higher values of k. If a particular range k demonstrate an interesting pattern, we can zoom in and try more values of k within in that range. Given these decisions, we’ll use a custom interval of k values that have smaller intervals for smaller values and larger intervals for larger values.

All of this planning is to save on time — often with modeling, there are many rabbit holes and experiments one can do, but our time is finite, so it’s important to think about your own time as you design experiments. It’s an important metric to consider when determining trade-offs. Furthermore, time could literally cost money — in this context where we’re using a local machine or Google Colab, compute time is essentially free (if we discard the cost of power, Internet, cloud storage, etc.). However, that’s not always the case — we could be using a cluster or cloud solution that charges based on compute time (e.g., AWS). In these instances, decisions like the one we made above is crucial to save costs.

Experimenting with various values of k and tracking the overall test accuracy, we get the following plot:

Created by Nabib Ahmed

The plot shows a flat line across the various k values we’ve experimented with at around ~68% accuracy (there’s a minuscule increase to 69% between at k=10 to k=15, but it doesn’t appear significant given how close it is in accuracy to the other k values). This indicates that our KNN classifier is not affected by the value of k and demonstrates that our classifier isn’t much better than the non-sense, always nv classifier. Since KNN works by comparing nearby neighbors and the nv class is so over-represented, regardless of the k value used, the nv class seems to always dominated. Thus, a potential next step would be address the imbalance and see how the KNN model performs.

The results for this logistic regression model can be found on this Google Colab. The full code is also provided below (note: it’s intended for a Google Colab environment and might not work on a local machine).

Created by Nabib Ahmed

Conclusion

In this article, we implemented a simple baseline model with K-Nearest Neighbors (KNN) classification. We discussed the mechanism and motivation for the model (it assigns labels based on majority label of the k nearest images using Euclidean distances), the necessary data preparation (flattening our images, encoding our labels as numbers, and performing a train-test split on our data), and the evaluation of our model results (it got an overall accuracy of ~68% but upon investigating the confusion matrix and label distribution, it’s performance is not much better than a non-sense classifier that always selects the dominant label; furthermore, after hyper-parameter tuning potential values of k, there was no change in performance indicating the model’s inability to overcome the dataset’s imbalance).

--

--