Machine learning for the layperson: pt2
Foundations And Linear Regression
As I outlined in part one, anyone can (and should) have a general understanding of what machine learning is. In this article, we’re going to define machine learning in general as well as explore a type of machine learning call supervised learning using Linear Regression as an example.
Machine Learning is really just three things:
- Something we’d like to do: T (for Task)
- A way to measure how we did: P (for Performance)
- And instances of something to learn from: E (for Experience)
“A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E.” — Tom M. Mitchell
Many machine learning problems can be described as Supervised (there are other kinds but we’ll stick to this for now.) In supervised learning, we give our program a bunch of questions with the correct answer. The idea is that, after training our program by giving it a bunch of example questions with the right answers, we will be able to give it questions on their own and it will be able to produce good answers for us
Admittedly, Linear Regression isn’t the most interesting method of machine learning (the graphing calculator you had in high school could probably do it) but it’s pretty easy to reason about and the concepts will be good for understanding the cool stuff later. First, we will represent all of our data as points in space. Then we model our data with a line saying “All these data points? There’s some underlying line to them and, if we can find it, we can use it to make predictions” The line is our Model. What we’re trying to do is find the best version of the line that’s closest to all of the points in our training set
If you click ‘result’ below, you can see how we can fit a line to a data set
click result, and move the sliders around to try to get the line close to all of the red dots
This is a bummer on mobile, sorry. I’d skip it if you’re on your phone
As you can from playing with the sliders, those ‘m’ and ‘b’ values are going to make or break how close the line is to the data points and, therefore, our predictions. ‘m’ and ‘b’ are called the parameters of our model and they are what we are going to be learning
Because we’re using a line, this type of machine learning is best at answering ‘How much’ where we can answer from a continuous range of numbers rather than ‘which one’ (we’ll discuss more of why this is when we cover other techniques that are better at ‘which one’ style questions.)
Back to the three parts of machine learning (T,E,P)
First we are going to need our task: T. For this example, We’re going to be predicting the price of a car.
Second we’re going to have to provide some experience: E. This means we’ll need a data-set of cars and some stats.
An important step here is deciding on a Representation for our cars. How are we going to tell our algorithm about our cars? How is a ‘car’ defined/what makes one up? The things we decide make up a car are called Features. For example, I could reasonably decide that there are three features that make up a car: the retail price, it’s age in years, and the number of times it’s been to an auto shop. On the other hand I could say that cars have one feature: how red they are on the color spectrum. It seems ridiculous but it’s a valid representation. It might even have some success if you had a set of convicted gang-members’ cars and you’d like to label them ‘blood’ or ‘crip’ or something. You get the idea.
For our representation, I’m just going to say that a car has one feature: speed. This is because I wanted to be able to draw speed vs price in 2d. In reality, choosing more features would likely be better but I’m going to keep it simple for illustration. If you go back to the and play with the line again, you may notice the red dots are defined by the values I gave the cars.
Now that we defined our Task and have some data for Experience, all we need is a way to measure performance: P. Then we have all three of the parts we mentioned earlier. We’re going to do this by defining a Cost Function to essentially tell us how badly we predict things for a given set of parameters. I’m gonna call it a badness function That function is just this
(‘what we predicted’ – ‘the correct answer’)² , added up for every example
we square it to make sure that all of values are positive numbers (We don’t want to subtract when we guess too low. missing by 5 too high ‘+5’ should be just as bad as missing by 5 too low ‘-5’.) It also further punishes the larger errors. Now that we have a
Now that we have a function that maps from the parameters we choose for our model to how badly we did, we can find the lowest point of that function. We are ready to learn. If we find the lowest point on that graph we will have the parameters of the line that best predicts our training set on average. That means we can define a line that will (hopefully) give us good predictions on other data. That’s all learning is, finding the lowest our badness function can be.
As it turns out, this ‘badness function’ will always be convex. That will make it easier to find its lowest point. Being convex means that it will look like this:
note: I’m pretending ‘y=mx+b’ is just ‘y=mx’ for a second. Only having one parameter lets us graph the ‘badness function’ of our current parameter(s) in two dimensions rather than three
Notice how we could drop a ball anywhere on the graph and it would roll to the lowest point? That’s because it’s convex
Here’s what convex won’t look like:
We could get stuck at that red arrow if we dropped a ball there and never reach the lowest point.
In other words, it’s a bowl and it will always be a bowl. Why that’s the case is beyond the ‘layperson’ theme but I promise some really smart people proved it.
Here’s what it will look like in 3d
This means that we can use Gradient Decent. This is just fancy for “Rolling down a hill”
Here’s how that works:
- Pick a random spot in the bowl. (aka random parameters for our model)
- Figure out what direction points down
- Step that way
- Carry on figuring out directions and stepping over and over until you can’t go any further ‘downhill’
to decide what direction to step we just take the derivative of the point we are at in our ‘badness function.’ This just tells us the slope at that point which we can use to pick our new point in the bowl (aka our new parameters)
The new parameters are just:
‘current values’ –( ‘slope at current values in badness function’)*(step size)
step size is just something we choose. Big numbers will mean a bigger change. Small numbers will mean a smaller one. We can talk about when this matters in a later post but, for now, ignore it.
why does this work?
Lets say our current ‘m’ is too big:
As you can see the slope is going to be positive because the red slope line points up so the new parameters will be subtracted by some positive number. That will make us ‘step backwards’
If its too small:
In that case the slope line points down so we subtract by a negative number. That’s the same as adding some positive number so we step forward.
If we do this over and over again, we march toward the bottom of the bowl where we find our parameters with the lowest possible prediction error on our training set
Once we find these coveted ‘m’ and ‘b’ values, we can make predictions with the equation:
car price = m(car speed) + b
We did it. We machine learned
in summary we:
- Gathered some data and represented it as points in space
- Defined a function to tell us how close a line is to those points
- And used that function to find the best line we could
If you enjoyed this post or have ideas to improve it, let me know! I like to write about things I think are relevant to my path as a growing developer so expect general ramblings on ~self improvement~ and ~Computer Science~ wooOOoo
Here’s some other stuff I wrote:
- Machine learning for the layperson: pt1
- Thoughts on the cold shower method
- Not Quitting is Actually the Easy Part
info/contact at camwhite.io