Machine learning for the layperson: pt3
K-nearest neighbors, the easy one
In the last installment of ML for the layperson: pt2, we covered linear regression as well as the core concepts of most machine learning models. While linear regression isn’t exactly the most “inspired” or sophisticated of machine learning methods, there were quite a few concepts to cover with a decent amount of math. We’re going to need all those concepts later but for now, we’re taking a break. With that in mind, we’re covering k-nearest neighbors for this installment.
Just as we did in part two, the first thing we do is choose our features and treat those like they are values for position in space.
For quick refresher, features are just the characteristics we decide are important. I might say a house’s features are the number of bathrooms and its size in square feet. Then, if I had a house that had 3 bathrooms and 10,000 square feet, I might say it is three units to the right (x=3) and 10,000 units up(y-10,000). If we had four features(or 5,6,7….) that’s okay too. We don’t need to be able to picture them in 4 dimensional space. The math will take care of itself.
Let’s say we are trying to teach a computer to tell the difference between a cat and a dog using weight and height as our features. Dogs are generally bigger so you can imagine that they will tend to be to the upper right, while cats will tend to be towards the lower left
And then we introduce a mystery animal with weight: 31 lbs and height: 32 inches
We can use k-nearest neighbors to try to decide what this thing is.
All we need to do is have the computer “ask” the k-nearest things what they are and use that to make a judgment call. But what does that “k” mean? k is just an arbitrary variable for us to choose. It could be the nearest one neighbor(s)
Of the 1-nearest neighbors, 1 is a dog and 0 are cats. Therefore, I think this is a dog
the nearest two neighbors:
Of the 2-nearest neighbors, 1 is a dog and 1 is a cat. Therefore, it’s a tie and I don’t know what this is
the four nearest neighbors:
Of the 4-nearest neighbors, 3 are dogs and 1 is a cat. Therefore, I think this is a dog
or the ten nearest neighbors:
Of the 10-nearest neighbors, 5 are dogs and 5 are cats. Therefore, it’s a tie and I don’t know what this is
Our computer just reaches out to however many neighbors we tell it to, from the closest neighbor up to the furthest, and asks them “what are you?” Then it makes a choice. “Well, most of the neighbors are (blank), therefore, this must be (blank) as well.” As you may have guessed, the number we choose for k matters. If k is too small we may be too confident in our first inclinations. Too large and we may consider too many things and become wishy-washy. This is actually a topic in it’s own right (bias vs variance) to be covered soon.
Even though this seems super simple, It actually qualifies as machine learning. Our program was never told anything about what makes a dog a dog or what makes a cat a cat. It just compared a new animal to some examples and used that to make a guess. It is completely dependent on the examples it’s given. Our example had cats and dogs nicely separated because of the features we chose (and the animals I conveniently made up) but it wouldn’t do so well if the animals were scattered randomly in space according to less useful features.
In terms of the math we need behind the scenes to pull this off, all we need to know is the distance formula. In high school, you probably saw it written like this:
This gives you the distance in two dimensions. It basically just says “take the difference of the X values and square them, do the same to the Y values, add them together and take the square root. This is the two dimensional version but it can be generalized to any dimensions. Here’s what it looks like in 3d:
As you can see, it’s pretty much the same formula. We’re just doing the Same thing with the Z dimension. We can get even more general with this function:
This just says “For every dimension we have: square the difference and add it to the others. When you’re done, take the square root”
That’s how we knew which neighbors to choose in our example. For 2-nearest neighbors, we considered the two neighbors shown with the red lines because those were the two shortest lines we could have drawn from our mystery animal to any other animal.
And that’s all there is. If you can take the distance between two points, we can represent just about anything as points in space and use those distances to classify things like we did with the cats and dogs.
If you enjoyed this post or have ideas to improve it, let me know! I like to write about things I think are relevant to my path as a growing developer so expect general ramblings on ~self improvement~ and ~Computer Science~ wooOOoo
info/contact at camwhite.io