Super Simple Machine Learning — Simple Linear Regression Part 3 [Validation]
This is the last part for Simple Linear Regression. Read Part 1 and Part 2 first.
This took a long time to write because this part bores me immensely, so I kept getting up to open and close the fridge and do other asinine things. Also I have an undying hatred for p-value and hypothesis testing that stems from not being able to fully understand it. Our relationship has improved slightly now.
Once again, let me know if you spot any errors, whether grammatical or mathematical.
What do your Results Mean?
Previously, in Part 2, we derived our regression line using Python. The graph looks like this
and the equation is:
y = 1.37x + 4.27
So.. what do the values 1.37 and 4.27 mean?
The Only Constant in life is the Constant
4.27 would be the Constant or the Y-Intercept in the equation.
What did we learn in Math class?
y = 1.37x + 4.27
so if
x = 0
then
y = 4.27
Since your x is the Poverty Rate, we assume that with ZERO poverty, there will be a 4.27% birth rate.
..
..
OR AT LEAST, THAT’S WHAT I THOUGHT.
On further reading, I realised that as much as this was sort of correct to say, there are two things working against this:
- Can poverty rate ever be 0? One can only imagine. Also, what if my constant was a negative number (let’s say -4.27 instead of 4.27), could my birthrate be negative (babies returning to the womb?!?!)
- My dataset does not have any records of poverty rate being at 0%. Which means the regression equation I have would not be relevant for a 0% poverty rate, as it was never trained on such data.
What then, is the 4.27 in my equation? Well, it explains the “etc” that my x doesn’t. It shows you that life isn’t perfect, not everything can be explained by what you know, and you just deal with it.
This really good article describes The Constant as the “garbage bin”, which is how I imagine my parents describe me as well.
Co-efficient . Yes, it has the word “efficient”. No it’s not a buzzword to throw around on linkedin to make you seem hip.
The 1.37 in my equation is like the “multiplier”.
For every 1 unit increase of x (for every 1% increase of poverty rate), the effect is multiplied by 1.37. Since this is a Simple linear Regression with only 1 variable, you can predict that for every 1 unit increase in poverty, the birth rate will increase by an average of 1.37 times.
It also indicates whether or not my x variable has a negative or positive impact on my y. So let’s say I have an equation that looks like this:
y = -2x + 4
My y is the number of clicks on an ad, and my x is the height of my ad. Since it is “-2”, it indicates that the longer my advertisements are, the less clicks I get, and if this relationship proves to be true, then shorter ads are the way to go if you need more clicks.
Prediction Time
The predictive part comes in when you insert the x to find out a y. This is why y is called the dependant variable, because it is dependant on the value of x.
Let’s say you’re in a city with a 15% poverty rate, and you’re trying to get the government to notice that it’s important to provide free education and to keep the poverty rate down.
Thus, you want to know what the estimated birth rate for 15–17 year olds will be, if poverty rate reaches 20%, to warn them.
Your equation would be:
y = 1.37(20) + 4.27
y = 31.67
From the data set, you can see that this result is pretty close to the actual data. Yay!
Is Your Model For Real? Introducing: P-value AKA “UGHHHHH”
Is the x variable really a good variable to go with? Can I say that the poverty rate truly has an impact on the birth rate or is it just the a line I drew through the points that happen to work well?
The p-Value is the answer to those questions.
Okay so, here’s the deal. There’s this guy called the Null Hypothesis:
Null Hypothesis, or Null Armstrong as I will call him *chuckles to self*, is basically the annoying dude who is telling you that you suck and that what you believe in is wrong and that in actual fact your x has no effect on your y.
To him, your equation will look like this:
y = 0x + b
Your slope is 0 because your x has no effect on y. Hence, your Null hypothesis states that
NULL HYPOTHESIS
H0: Β1 = 0 (beta 1 is your slope/coefficient/a)
However, because you do not stand down easily, you tell Null Armstrong to shut that dirty mouth, and prove him wrong by showing him that there is an alternative hypothesis to it, which says that
ALTERNATIVE HYPOTHESIS
Ha: Β1 ≠ 0
there is a significant linear relationship between x and y, and that poverty does indeed affect birthrate so the slope could not possibly be 0, and does anyone even like hanging out with you?
One way to prove Null Armstrong is wrong is through the p-value.
To start, we must always always assume that null hypothesis is true.
From there, a really small p-value would then state that it is very improbable that the null hypothesis is assumed to be true, THUS making the p-value of x statistically significant and giving more reason to believe in the alternative hypothesis.
Small P-value = not likely Null Armstrong is right = by elimination, you are right
What would we consider as “really small”?
Without referencing the size of what’s in your pants, people usually use 0.05 as a cut off point based on what was chosen by English statistician Ronald Fisher who wrote a book about p-value,
who btw also came up with the sexy son hypothesis which is totally irrelevant to this topic but SEXY SONS HYPOTHESIS are my favorite words to say now.
Note that your p-value DOES NOT prove you right, it only proves Null Armstrong wrong.
Basically:
High P values: You win this time, Null Armstrong!
Low P values: Suck ittttt, Null Armstrong!
But why should I torture myself this way?
Finding out the significance of your variable is especially useful when you are doing multiple linear regression, as you’ll need to decide which of your x variables are actually good for your model (Feature selection). I’ll elaborate more on this in the next post.
Scikit learn does not have a summary function to show the p-value, so I will be using statsmodel instead to find the p-value of my variable (birth rate).
The p value for our x variable is 0.000, which just means it was too small for it to be displayed. Instead I run this code:
print(reg2.pvalues)
and get
[9.79903930e-02 1.18781873e-09]
So the p-value of my x variable is actually 1.19e-09 which is really really small, obviously much less than 0.05, and hence, my poverty rate is a statistically significant factor when predicting birth rate.
What about the p-value of our constant, which is 0.098, which is higher than 0.05?
Well don’t worry, because our constant is here to stay.
For constants, the null hypothesis assumes that the constant has no effect, hence is 0.
The high p-value merely signifies that when my x= 0, the constant is not so far from 0, which is possible OR is like this because the dataset has no data points where x is 0 to prove otherwise.
Is the p-value sufficient evidence?
Like you and me, p-value isn’t perfect and has haters. Many researchers have argued about the p-value’s true significance and accuracy.
After all, it is not proof that your hypothesis is definitely correct. Yes, it’s statistically significant, but it’s because you consider it statistically significant, because you set the threshold level.
- is 0.05 really enough? By setting a higher threshold level, you’re just giving yourself more room to say that you’re right.
- p-value may change ALOT when you are using different samples from the same population, since it could be possible that a certain sample just gives a high p-value but happens to be inaccurate when you take the entire population into account. This may skew your resulting p-value. WHAT IS THE TRUTH!
- Is the null hypothesis even sensible? Is having a measurement to tell you that a hypothesis that is not possible isn’t possible a meaningful measurement????? Did you get that?
A research paper written about how p-value is bad and should feel bad can be found here.
People who are against the p-value hypothesis testing might instead prefer using Bayesian methods instead.
However, there are also many supporters of p-value as well. It seems that the key to being closer to an more accurate result would be to set a much lower threshold (0.001 or 0.005).
P-value has been used widely to evaluate models in Machine Learning, so I’d continue to use it. Just treat it as a way of scoring to compare models and/or variables.
For example, some schools may say that a pass is 50% and some may say that it’s 60%, but is 50% or 60% accurate measurements of whether the student knew the topic enough? Perhaps not, but it’s a good benchmark to judge against.
There are also other methods of evaluating the accuracy of my model as well is a good way to tell yourself that you’re doing fine. Of which I will discuss below:
Train/Test Split
If we have enough data to play with, we usually do a Train/Test split, where we split the data set into a Training and Testing Set (Usually 80% / 20%)
- The Training set is used to create the model, in this case, the regression line.
- The Testing set is used to test the model, by running the regression line against it to see how well your model performs data it was not trained on.
What’s being done here is to ensure that your model is not Overfitted.
Overfitting occurs when you train a model on a dataset, and it becomes really good at predicting what’s within the dataset itself.
If you were to check the accuracy of your model, it may be a good 98% accuracy but it’s kind of cheating since you are testing it against the data that it was modelled on.
Since you may not necessarily have other datasets outside of what you’ve used to create your model, we are going to create our own “outside” dataset by splitting the one we already have.
By validating it against the Test data, you get an accuracy score, which tells you how well your model performs based on data that was not involved in the model-creation, so it can be kind of considered as “outside” data.
Check your residual plot
As I previously mentioned in Part 1, if your regression model is doing good, your residual plot should show no signs of a pattern that can be explained.
Here’s what mine looks like this:
R squared
R squared is ALWAYS between 0 and 1, and the higher your R squared, the better.
R squared is the variation of y that is explained by your linear model.
R-squared = Explained variation / Total variation
Scikit learn’s function is:
print(reg.score(X,Y))
However, putting all your eggs into an R-squared basket is a mistake. Sometimes your predictions are biased, and R-Squared cannot see this.
In some cases, R-squared values will always be low no matter what. For instance, when modelling for datasets that attempt to predict human behaviour, it may be low because we are mostly unpredictable and devious bastards.
In other cases, R-squared values will be high, which is good because the model follows closely to the actual data. HOWEVER! This may be caused by Overfitting, which we all know is totally not cool.
RMSE (ye olde friend, standard deviation)
RMSE — Root Mean Square Error
RMSD — Root Mean square Deviation
Hello darkness my old friend, I come to talk with you again.
By darkness, I mean Variance, and Standard Deviation. Oh and also SSR/SSE from part 2.
RMSE measures the differences between the sample’s predicted values and actual values. It is basically the square root of the average variance of the residuals.
Sounds familiar? The Standard Deviation is also the square root of variance.
However, in the case of regression analysis, the error in the variance isn’t Y -Y bar but Y -Y pred , which makes it slightly different, but following the same principle.
How does the Root Mean Squared Errors (RMSE) relate to the Residual Sum of Squares (RSS: REAL Y - PREDICTED Y AKA regression’s form of a variance)?
The RSS is the squared errors, and the MSE is the squared of this, making the RMSE the root of the mean of the RSS.
SIGH.
Residual Sum of Squares =∑(Ŷ i−Yi)2
Mean Squared Errors =(1/n)∑(Ŷ i−Yi)2
For RMSE, the smaller the better, as it shows that there are lesser “errors”.
Quick Recap
- p-value = Smaller the better. Better to be <0.05
- Don’t ownself check ownself. Split data into Test and Training Sets
- R squared = 0<R squared <1 . The Bigger the better
- RMSE = Smaller the better
And That’s It for this post!
Yes, I have a headache too. My eyes have gone blur and my soul has left my body.
We’re done with Simple Linear Regression!
whoooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo
If there ever was an awesome guide to a comprehensive code for Simple Linear Regression in Python, it’s here.
The next post will be about Multiple linear regression which is like Simple Linear Regression but.. less… simple…
Thanks to Michael and Rumen for proofreading.