Machine learning for the layperson: pt3

K-nearest neighbors, the easy one

Science-y

In the last installment of ML for the layperson: pt2, we covered linear regression as well as the core concepts of most machine learning models. While linear regression isn’t exactly the most “inspired” or sophisticated of machine learning methods, there were quite a few concepts to cover with a decent amount of math. We’re going to need all those concepts later but for now, we’re taking a break. With that in mind, we’re covering k-nearest neighbors for this installment.

the og

Just as we did in part two, the first thing we do is choose our features and treat those like they are values for position in space.

For quick refresher, features are just the characteristics we decide are important. I might say a house’s features are the number of bathrooms and its size in square feet. Then, if I had a house that had 3 bathrooms and 10,000 square feet, I might say it is three units to the right (x=3) and 10,000 units up(y-10,000). If we had four features(or 5,6,7….) that’s okay too. We don’t need to be able to picture them in 4 dimensional space. The math will take care of itself.

Let’s say we are trying to teach a computer to tell the difference between a cat and a dog using weight and height as our features. Dogs are generally bigger so you can imagine that they will tend to be to the upper right, while cats will tend to be towards the lower left

2 features because 2d is easy to visualize. Could be 30. Computer doesn’t care

And then we introduce a mystery animal with weight: 31 lbs and height: 32 inches

We can use k-nearest neighbors to try to decide what this thing is.

All we need to do is have the computer “ask” the k-nearest things what they are and use that to make a judgment call. But what does that “k” mean? k is just an arbitrary variable for us to choose. It could be the nearest one neighbor(s)

Of the 1-nearest neighbors, 1 is a dog and 0 are cats. Therefore, I think this is a dog

the nearest two neighbors:

Of the 2-nearest neighbors, 1 is a dog and 1 is a cat. Therefore, it’s a tie and I don’t know what this is

the four nearest neighbors:

Okay, that other dog is probably closer but you get the idea

Of the 4-nearest neighbors, 3 are dogs and 1 is a cat. Therefore, I think this is a dog

or the ten nearest neighbors:

What a mess

Of the 10-nearest neighbors, 5 are dogs and 5 are cats. Therefore, it’s a tie and I don’t know what this is

Our computer just reaches out to however many neighbors we tell it to, from the closest neighbor up to the furthest, and asks them “what are you?” Then it makes a choice. “Well, most of the neighbors are (blank), therefore, this must be (blank) as well.” As you may have guessed, the number we choose for k matters. If k is too small we may be too confident in our first inclinations. Too large and we may consider too many things and become wishy-washy. This is actually a topic in it’s own right (bias vs variance) to be covered soon.

Even though this seems super simple, It actually qualifies as machine learning. Our program was never told anything about what makes a dog a dog or what makes a cat a cat. It just compared a new animal to some examples and used that to make a guess. It is completely dependent on the examples it’s given. Our example had cats and dogs nicely separated because of the features we chose (and the animals I conveniently made up) but it wouldn’t do so well if the animals were scattered randomly in space according to less useful features.

In terms of the math we need behind the scenes to pull this off, all we need to know is the distance formula. In high school, you probably saw it written like this:

HS algebra called, it wants its formula back

This gives you the distance in two dimensions. It basically just says “take the difference of the X values and square them, do the same to the Y values, add them together and take the square root. This is the two dimensional version but it can be generalized to any dimensions. Here’s what it looks like in 3d:

pretty familiar

As you can see, it’s pretty much the same formula. We’re just doing the Same thing with the Z dimension. We can get even more general with this function:

more general

This just says “For every dimension we have: square the difference and add it to the others. When you’re done, take the square root”

That’s how we knew which neighbors to choose in our example. For 2-nearest neighbors, we considered the two neighbors shown with the red lines because those were the two shortest lines we could have drawn from our mystery animal to any other animal.

And that’s all there is. If you can take the distance between two points, we can represent just about anything as points in space and use those distances to classify things like we did with the cats and dogs.

If you enjoyed this post or have ideas to improve it, let me know! I like to write about things I think are relevant to my path as a growing developer so expect general ramblings on ~self improvement~ and ~Computer Science~ wooOOoo

info/contact at camwhite.io

Here’s some other stuff I wrote: