Beginner’s Guide to K-Nearest Neighbors & Pipelines in Classification

Jennifer Boyles
Jan 26 · 5 min read
Free for commercial use from pixabay.com.

K-nearest neighbors (KNN) is a basic machine learning algorithm that is used in both classification and regression problems. KNN is a part of the supervised learning domain of machine learning, which strives to find patterns to accurately map inputs to outputs based on known ground truths. KNN is a non-parametric algorithm meaning we do not make assumptions about the distribution of our data. An example of such an assumption, often seen in linear regression algorithms, is assuming the distribution of our data is approximately normal. Moving on, as I have hit my quota for saying the word ‘assumption’ today.

A K-nearest neighbors algorithm uses distance metrics to try to separate clusters of observations in space. These separate clusters are classified as different groups based on features they carry.

KNN typically uses either Euclidean or Manhattan distance as the distance measure, but other distance metrics (e.g. Minkowski) are also used. If you remember the Pythagorean Theorem, the theorem used to calculate the distance between two points, this is the equation actually used in calculating Euclidean distance! Ahhh, high school geometry comes full circle.

Let’s apply all of these concepts to an actual data set, possum.csv. I obtained this data set from openintro.org at the following link: https://www.openintro.org/data/index.php?data=possum.

In this data set, we are given 7 features about two different populations of mountain brushtail possum (Trichosurus cunninghami) in Australia. My question to you is this: can we accurately classify if a mountain brushtail possum is from the Victoria population or the New South Wales/Queensland population using physical characteristics and the K-nearest neighbors algorithm? Let’s find out!

After reading in the .csv file to our jupyter notebook we are ready to start coding our KNN model! Here is a quick look at the data we are working with.

The first 2 rows of the possum.csv DataFrame

As you can see we have several columns/features:

  • site — The site number where the possum was trapped.
  • pop — Population, either Vic (Victoria) or other (New South Wales or Queensland). The target feature!
  • sex — Gender, either m (male) or f (female).
  • age — Age. Contains 2 null values.
  • head_l — Head length, in mm.
  • skull_w — Skull width, in mm.
  • total_l — Total length, in cm.
  • tail_l — Tail length, in cm.

We want to classify our possums as being from either the Victoria population or the ‘other’ (New South Wales or Queensland) population. So, the pop column is going to be our target variable while the other features, except for site, are going to be our predictor variables. In this case, as we are trying to classify the two populations of possums based on morphological differences alone, we don’t want the location the possums were captured to be used in the model. Therefore, we are going to drop the site and pop columns when we set up the X (predictor variable) object. The y object is our target variable, which consists of the two classes of the pop column (Victoria and ‘other’).

#set up the X and y variables for modeling
X = possum.drop(columns=['pop','site'])
y = possum['pop']

After setting up our X and y variables, we are going to perform a train/test split on the data. A train/test split will give us a training set to use for model creation and a holdout or testing set to score our model on.

#by default 75% of the observations will be in the training set
#by default 25% of the observations will be in the testing set
X_train, X_test, y_train, y_test = train_test_split(X,y)

Next, we want to transform our training data to get it ready for modeling. The transformers we want to use are OneHotEncoding the categorical variables, SimpleImputing any null values, and StandardScaling. When using the K-nearest neighbors algorithm it is important to standard scale the data, as you are modeling with distance metrics. If a feature has a larger scale then the others, then it will dominant the feature importance’s when modeling. Therefore, you want to make all predictor features have the same numerical scale.

As we want to do transformation on subsets of columns from the data set, we are going to pass features of specific dtypes into the make_column_transformer function. Below, we grab the columns to pass to the different transformers. The null_col variable is grabbing the only column in the possum data that contained any null values — the age column. We will pass null_col to SimpleImputer. The cat_cols variable contains the only categorical variable in our data — the sex of the possum. We will pass cat_cols to OneHotEncoder, as we only want numerical values to be passed into the KNN algorithm.

#grab columns of different dtypes for our different transformers
null_col = ['age']
cat_cols = ['sex']

Next, we will actually pass these variables to the make_column_transformer. When using make_column_transformer if we don’t specify the argument remainder = ‘passthrough’, then all columns we aren’t transforming will be dropped from the output. Clearly, that is not ideal. We want all of our predictor features to be used in the KNN algorithm at the end of the pipeline.

#column transformer with null_col and cat_cols passed as argumentspossum_col_transformer = make_column_transformer((OneHotEncoder(), cat_cols),(SimpleImputer(strategy='mean'), null_col), remainder='passthrough')

Here, we will create our pipeline. A pipeline is a way to automate the machine learning workflow by allowing preprocessing of the data and instantiation of the estimator to occur in a single piece of code. We can easily create a pipeline in Python using sklearn’s make_pipeline function.

#pipeline with preprocessing transformers first & estimator last
possum_pipeline = make_pipeline(possum_col_transformer,
StandardScaler(),
KNeighborsClassifier())

Once the pipeline has been constructed we will fit our training data to the pipeline, in order to train our machine learning algorithm! In our possum_pipeline we will be training sklearn’s KNeighborsClassifier (a KNN algorithm used for classification problems) using our training data — X_train and y_train. In order to train the KNN algorithm we will call the fit method on the pipeline.

#fit the pipeline to the training data
possum_pipeline.fit(X_train,y_train)

After the training data is fit to the algorithm, we will get a machine learning model as the output! You guys! We did a cool thing! We have in our hands our very own machine learning model!

But the real question is: how well will our KNN classifier score? Is our KNN model able to accurately classify between the Victoria population of possums and the New South Wales/Queensland population of possums? Let’s find out!

In order to see the predictive power and accuracy of our model, we will test it on data it has never seen before — the testing set — X_test and y_test. If you remember, we created X_test and y_test when calling the train_test_split function several lines of code before.

#score the knn model on the testing data
possum_pipeline.score(X_test,y_test)
0.7307692307692307

Check that score out! Our default KNN model was able to predict what population, using morphological characteristics only, possums were from with 73% accuracy! Look at us. Learning stuff and being able to accurately classify Australian populations of possums. Not bad for a first model, huh? Do you think we could make our model even better?

Analytics Vidhya is a community of Analytics and Data…

By Analytics Vidhya

Latest news from Analytics Vidhya on our Hackathons and some of our best articles! Take a look.

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Jennifer Boyles

Written by

Data Scientist. Biology/Chemistry Nerd. Find me on LinkedIn — Jennifer Boyles!

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store