Beginner’s Guide to K-Nearest Neighbors & Pipelines in Classification
K-nearest neighbors (KNN) is a basic machine learning algorithm that is used in both classification and regression problems. KNN is a part of the supervised learning domain of machine learning, which strives to find patterns to accurately map inputs to outputs based on known ground truths. KNN is a non-parametric algorithm meaning we do not make assumptions about the distribution of our data. An example of such an assumption, often seen in linear regression algorithms, is assuming the distribution of our data is approximately normal. Moving on, as I have hit my quota for saying the word ‘assumption’ today.
A K-nearest neighbors algorithm uses distance metrics to try to separate clusters of observations in space. These separate clusters are classified as different groups based on features they carry.
KNN typically uses either Euclidean or Manhattan distance as the distance measure, but other distance metrics (e.g. Minkowski) are also used. If you remember the Pythagorean Theorem, the theorem used to calculate the distance between two points, this is the equation actually used in calculating Euclidean distance! Ahhh, high school geometry comes full circle.
Let’s apply all of these concepts to an actual data set, possum.csv. I obtained this data set from openintro.org at the following link: https://www.openintro.org/data/index.php?data=possum.
In this data set, we are given 7 features about two different populations of mountain brushtail possum (Trichosurus cunninghami) in Australia. My question to you is this: can we accurately classify if a mountain brushtail possum is from the Victoria population or the New South Wales/Queensland population using physical characteristics and the K-nearest neighbors algorithm? Let’s find out!
After reading in the .csv file to our jupyter notebook we are ready to start coding our KNN model! Here is a quick look at the data we are working with.
As you can see we have several columns/features:
- site — The site number where the possum was trapped.
- pop — Population, either Vic (Victoria) or other (New South Wales or Queensland). The target feature!
- sex — Gender, either m (male) or f (female).
- age — Age. Contains 2 null values.
- head_l — Head length, in mm.
- skull_w — Skull width, in mm.
- total_l — Total length, in cm.
- tail_l — Tail length, in cm.
We want to classify our possums as being from either the Victoria population or the ‘other’ (New South Wales or Queensland) population. So, the pop column is going to be our target variable while the other features, except for site, are going to be our predictor variables. In this case, as we are trying to classify the two populations of possums based on morphological differences alone, we don’t want the location the possums were captured to be used in the model. Therefore, we are going to drop the site and pop columns when we set up the X (predictor variable) object. The y object is our target variable, which consists of the two classes of the pop column (Victoria and ‘other’).
#set up the X and y variables for modeling
X = possum.drop(columns=['pop','site'])
y = possum['pop']
After setting up our X and y variables, we are going to perform a train/test split on the data. A train/test split will give us a training set to use for model creation and a holdout or testing set to score our model on.
#by default 75% of the observations will be in the training set
#by default 25% of the observations will be in the testing set
X_train, X_test, y_train, y_test = train_test_split(X,y)
Next, we want to transform our training data to get it ready for modeling. The transformers we want to use are OneHotEncoding the categorical variables, SimpleImputing any null values, and StandardScaling. When using the K-nearest neighbors algorithm it is important to standard scale the data, as you are modeling with distance metrics. If a feature has a larger scale then the others, then it will dominant the feature importance’s when modeling. Therefore, you want to make all predictor features have the same numerical scale.
As we want to do transformation on subsets of columns from the data set, we are going to pass features of specific dtypes into the make_column_transformer function. Below, we grab the columns to pass to the different transformers. The null_col variable is grabbing the only column in the possum data that contained any null values — the age column. We will pass null_col to SimpleImputer. The cat_cols variable contains the only categorical variable in our data — the sex of the possum. We will pass cat_cols to OneHotEncoder, as we only want numerical values to be passed into the KNN algorithm.
#grab columns of different dtypes for our different transformers
null_col = ['age']
cat_cols = ['sex']
Next, we will actually pass these variables to the make_column_transformer. When using make_column_transformer if we don’t specify the argument remainder = ‘passthrough’, then all columns we aren’t transforming will be dropped from the output. Clearly, that is not ideal. We want all of our predictor features to be used in the KNN algorithm at the end of the pipeline.
#column transformer with null_col and cat_cols passed as argumentspossum_col_transformer = make_column_transformer((OneHotEncoder(), cat_cols),(SimpleImputer(strategy='mean'), null_col), remainder='passthrough')
Here, we will create our pipeline. A pipeline is a way to automate the machine learning workflow by allowing preprocessing of the data and instantiation of the estimator to occur in a single piece of code. We can easily create a pipeline in Python using sklearn’s make_pipeline function.
#pipeline with preprocessing transformers first & estimator last
possum_pipeline = make_pipeline(possum_col_transformer,
Once the pipeline has been constructed we will fit our training data to the pipeline, in order to train our machine learning algorithm! In our possum_pipeline we will be training sklearn’s KNeighborsClassifier (a KNN algorithm used for classification problems) using our training data — X_train and y_train. In order to train the KNN algorithm we will call the fit method on the pipeline.
#fit the pipeline to the training data
After the training data is fit to the algorithm, we will get a machine learning model as the output! You guys! We did a cool thing! We have in our hands our very own machine learning model!
But the real question is: how well will our KNN classifier score? Is our KNN model able to accurately classify between the Victoria population of possums and the New South Wales/Queensland population of possums? Let’s find out!
In order to see the predictive power and accuracy of our model, we will test it on data it has never seen before — the testing set — X_test and y_test. If you remember, we created X_test and y_test when calling the train_test_split function several lines of code before.
#score the knn model on the testing data
Check that score out! Our default KNN model was able to predict what population, using morphological characteristics only, possums were from with 73% accuracy! Look at us. Learning stuff and being able to accurately classify Australian populations of possums. Not bad for a first model, huh? Do you think we could make our model even better?