ai like im 5: an intro to regression and simple linear regression: our first machine learning model (article 10)

ai like im 5

19 min readJan 18, 2024

before i start, a couple things:

out of the loop? — https://medium.com/@ailikeim5/list/ai-like-im-5-in-order-87ef4064afe8
this content is not aimed at 5 year olds but instead simple
although linear regression is machine learning, i would not consider this artificial intelligence.
however, this model lays a foundation for understanding artificial intelligence and deep learning models!

prerequisite knowledge:

a. basic understanding of data, data features, and datasets

b. good understanding of ai and machine learning

c. good understanding of some data assumptions especially linearity and non-linearity

d. good understanding of training and validating!

e. good understanding of models and parameters

f. good understanding of parameter optimization and loss

ai is like a secret handshake, if you do not know the handshake, you will feel left out in this article! read the articles, they are super simple!

we have finally made it to our first machine learning model!

today we are going to talk about linear regression, the most simple and widely used machine learning model. we will talk about: when and why to use, go deep into it, and talk about why its really amazing but also has its drawbacks. this topic is rooted in years of statistics but this is going to be a more simple and machine learning focused explanation of things. if you are interested in a great source for a more statistic approach, here is a link.

anyways,

linear relationships

recall that: linear relationships are predictable and straightforward:

this relationship can be represented by a straight or linear line

for machine learning, we are learning the relationships in data
this learning is mathematical and logical process
we love a linear relationship, representing a straight line is about as easy as it gets!

so these can be visualized as!

as temperate increases, there is a direct and simple increase in ice cream sales (in a reasonable range)
i.e -> people eat more ice cream in the summer

regression

i introduced regression as

like predicting how much or how many when given our input data.

and that regression is a supervised learning task so it has it follows the equation

so what really is regression?

regression focuses on the relationship between the independent variable and dependent variable and making predictions based on that relationship… but what does this mean?

in regression:

the independent variable is the predictor and input of a model, it’s value does not change… therefore stays independent

i want you to conceptualize a singular independent variable as one specific data feature from a dataset
-> we are using this to learn the pattern and make predictions
therefore, multiple independent variables are just a collection of dataset features and we can work with as many independent variables as we want.

our dependent variable is the output and response, its value changes and is dependent on the interaction of our model and independent variables

i want you to conceptualize this as one specific data feature from a dataset that we are looking to predict and learn/make correction from
we usually only work with 1 dependent variable, but there are some exceptions.

the goal of regression is to understand and quantify how the dependent variable (what is being predicted) changes when the independent variables are directly modified and varied.

in plain english: the goal of regression is to know how much our prediction changes when we modify the contents our inputs.

let’s look at a dataset to get a better visualization of this :

let’s say we are interested in our change in price of a lego set with all the different factors

what would be our independent variables?

these are going to help us predict, they do not change!!!

and then what would be our dependent variables?

our price is dependent on all the other factors!!!

regression predicts a continuous outcome a or continuous number

a continuous number can take any value in some range, it is like an amount or a measurement!!!

this could prediction could be

what temperature tomorrow will be
what the stock price of tesla will be in 5 years
how much my lego ferrari daytona be cost in the future!

these predictions are super important to our lives!

regression does not predict a label or category, that is classification and will talk about this in the future.

terminology is important, and you will often see words used interchangeably!

i will often refer to the dependent variable as the response, output and the y
and the independent variable or variables as the predictors, input and the x

but don’t worry much about this terminology, things will start to click as this article goes on.

regression can take many forms, the most important and common being:

linear regression

linear regression attempts to model the relationships in our data and learn by fitting a straight line.

representing a straight line is a simple mathematical idea, you have probably been doing it since you were 12 years old.
so this is the simplest, easiest, and quickest approach to machine learning and it is widely used in business and real life for these reasons.

i have some good and bad news about linear regression though:

the good news: linear regression is great if there is some type of linear relationship in your data and it can be found

the bad news: linear regression is useless if there is no linear relationship in your data or it’s hard to find.

here are some great points about this!

we will often start out with linear regression as our first model we learn because it provides a foundation and lays a connection to other models we will talk about in the future.

before we take a dive into more complex applications of this, let’s start at the most basic approach:

simple linear regression

simple linear regression focuses on the relationship between a single dependent variable (a single response) and a single independent variable (a single predictor)
in plain english-> simple linear regression makes predictions with a single data feature and corrects itself/learns with a single data feature and these predictions look like a straight line!!!

here is a representation of this and what we want it to look like.

our actual values are not going to all sit on the line perfectly, we do not want to overfit our data… linear regression is great at generalization!!!

a lego example

i love legos and there are many different types of lego sets

complex ones like supercars, star wars ships, and more

simple ones like houses, trees, and more

lego’s are made of a lot of different pieces.

complex ones have a lot of pieces and are more complicated to build, and therefore they cost a lot more money $$$

the ferrari daytona is 4000 pieces and has a full recreation of a running engine, it is $500 as result!!!

simple ones are really simple, they don’t have a lot of pieces, and are pretty cheap!!!

as the number of pieces increases, so does the price!!!

based on this visual, i suspect there is a direct linear relationship between the number of pieces and the price! (it looks like a straight line)

this means that with enough data and the simple linear regression model, we can predict the price of a new lego set, just based on the number of pieces!!!

remember there are other factors that go into the price of legos, but if we just need to make a prediction of price, this relationship is simple and great!!

remember, simple linear regression is only interested in the interaction of two things… it is called simple for a reason!

the math of simple linear regression

mathematically, simple linear regression can be represented by the following equation

remember

input -> independent variable
output -> dependent variable

do not worry if you do not get the math, our machine is going to do all the math for us, but understanding the math terminology and relationships allows to make connections in the future. understand the plain english of this math if anything!!!

this is not rocket science too… we call this the slope-intercept form in mathematics

look at how similar they are, they both fit a straight line, almost the exact same way, the name of the terms is just different!

here is a great video for those who do not remember (we all come from a different math backgrounds, do not feel discouraged)

and so we can explain the coefficient like:

linear regression and slope differ in usage and methodology. simple linear regression uses statistical methods and data to find the optimal parameters and fit the line while in math we will solve a system of equations for the slope to fit the line.

and because of that and many reasons illustrated in my past articles, this statistic and data approach to solving the equation is going to have an error term

this error term captures the inherit randomness that cannot be explained by the relationships in the model and that arises due to modeling real world data.
i.e -> it is unobservable and not purely quantifiable
error also tells us how impractical our model is, meaning that if try to fit a try a straight line in a non-linear relationship, our error is going to big!

this error term can be attributed to 4 main factors

measurement errors -> our data is not recorded properly
omitted variables -> a data feature or features that influence our dependent variable are left out
inherit randomness -> modeling real world situations is hard!
model misspecification -> like modeling a non-linear relationship with a linear model

the parameters of linear regression

recall, models are like a cake

a cake takes:

great data
great recipe
and great steps for that recipe!

a great machine learning/ai model takes

great data
great model
and great parameters for that model

so what are the parameters and how do we get great ones!

in this picture, i illustrate the parameters of simple linear regression

it is our b0 (bias/y-intercept)
and our b1 (the regression coefficient/weight/slope).

how do we find the optimal parameters in linear regression?

we measure the performance of our predictions

i want you to think about something: if our line is the basis of our predictions, and we are predicting along this line, how could we measure the performance of a prediction?

this is not rocket science, the simplest way to do things is to look at the distance between our prediction and actual value it should be:

or in lego terms

remember our predicted price is done only with the number of pieces!!!

we are going to call this distance the residual, residuals are very important to us.

they tells us how inaccurate our predictions are
you can think of this as a way of quantifying or putting a number to our error term!

a residual can be represent and calculated by the following mathematical equation, we do not need to graph!!!

a small or minimal residual indicates a great prediction

a large or maximal residual indicates a terrible prediction

measuring overall performance

our model is going to make lot’s of predictions, so if were looking for an indicator of this model’s overall performance

we could add up all the residuals, but there’s a problem:

sometimes this value is going to be negative, sometimes its positive, and these positive and negative values can cancel each other, even if they are incredibly high! a positive and negative value mean the same thing though… the sign is mainly irrelevant!

so the most common method is to square the residuals, making all values positive, and adding them all up!

this has a very fitting name: the sum of square residuals

you will see often see it called ssr or rss for short!

ssr is a function and more importantly because it is a function that measures error, we can use it as our loss function!

and if you recall from my last article (read it!!!), loss functions are used to find the optimal parameters of a model and are a vessel for mathematical optimization!

remember loss functions are not one size fit all, there are many different ways to measure the error of our model

let’s take a look at this!

the optimization of simple linear regression and ssr!!! (skip if you don’t care for math)

the best part about simple linear regression and the sum of squared residuals is there exists a closed form solution
i.e -> there is a simple and quick way of solving for our optimal parameters
i’m not go to deep into the math behind this optimization and how we derive the optimal parameters, but there are two key parts and two formulas we have discovered thanks to ssr!

this solution is called ordinary least squares or ols

recall our coefficients look like this (the things we are solving for)

we can solve for our optimal b1: the regression coefficient and slope

the formula looks like:

2. and with that optimal slope/coefficient, we can solve the formula for optimal y intercept/bias!

that formula looks like!

so we have the two parameters of our model and we are good to go!!!

note: this is not the only method for getting these optimal parameters, but instead the most common, and as you can see it is relatively eas

the math is important, yes, but the most important thing is that you understand

all that it takes to calculate the optimal parameters of this model is the average and actual values and a little bit of mathematics (no predictions necessary)!
note: these formulas are still derived from minimizing the sum of squared residuals, which measure the error the predictions!

you might have a hard time wrapping your head around this, i do too!!!

it’s inception in a way: the optimal parameters in a simple linear regression model are found by minimizing the error in predictions we haven’t made yet! mathematics is really cool!!!

so our simple linear regression models do not start out with bad parameters, we will solve for the optimal parameters to fit an amazing line!

two things about this optimization

the parameters obtained through this optimization are unbiased

linear regression is one of the simplest and most general models.
recall, the more simple and general we get, the more bias we introduce (bias/variance tradeoff)
as a result, it is considered a high bias/low variance model when compared to other models we will highlight the future.

ordinary least squares (this optimization technique) is called the best linear unbiased estimator

obtaining unbiased parameters is incredibly important and although there are other solutions to obtaining parameters, we will often work with this due to it’s unbiased and simple nature
i.e -> if our model already has a tendency to be high bias, work with the parameters that minimize bias

2. in order for the parameters to be unbiased, we have to obey the assumptions of linear regression and more

meaning that we it can only be unbiased under special conditions
i am not going to go into all these conditions, but instead talk about the assumptions linear regression makes because they are much more important!!!
all simple linear regression models have to obey these assumptions

linear regression assumptions (skip if you want)

remember assumptions are a key to machine learning, they are like a starting point in our model process.
these are like arbitrary rules we believe are true before we start analyzing
assumptions are super important to linear regression and have roots in years of statistical theory!

we will mainly talk about them in the context of simple linear regression for now, but these also apply to the more complex applications!

why you need to understand the assumptions:

interpretation -> knowing the assumptions allows you to better understand the results of your model
reliability -> knowing the assumptions allows you to make a more reliable model
accuracy -> knowing the assumptions can improve the accuracy of the model
misuse -> knowing the assumptions prevent misuse of linear regression
and more!!!

linear regression makes a lot of assumptions, and although understanding them is important, understanding when you have violated them is more important!!!

here is a great but long statistical video of this!!!!

1. the first assumptions is very simple: linearity

ok, this one is common sense…

we are not interested in using a linear regression model if the relationship is not linear, it is clear violation of our first assumption: linearity
in plain english: if our model does not fit like a straight line, we have violated our first assumption and it will not perform good

this assumption also assumes the relationship between the independent and dependent variables is constant, and cannot change

for simple linear regression -> i.e the slope or regression coefficient does not change, it will always be a constant number!

the linearity assumptions illustrates an important limitation

because a linear regression model is static, it can only capture static relationships
i.e -> it is really hard to capture relationships between variables that change over time or across conditions (dynamic relationships)
many real world relationships and problems are dynamic

important idea: sometimes, we can turn transform a non-linear relationship into a linear one with the help of mathematics!

the relationship between our dependent variable and independent variable does not have to be linear in its raw data form.
it can be take this form through some mathematical and statistical transformation!

we will talk more about these transformations in the next article, but for now comprehend that linearity is not necessarily violated in the initial raw form of data, but instead violated when we prove there is no transformation that can be done to linearize our data

there are multiple ways of checking this assumption but there are two key ones we will highlight

a. creating a scatter plot -> graphing our independent variables and dependent variables (before we fit the model)

it is as simple as looking at a graph and seeing if it looks like a straight line!

again, we can only say this violation is true after we perform all possible transformations (this is a lot of information for right now, i will highlight these in the next article)

b. creating a residual plot (after we fit our model)

remember, a residual is a measure of our predictions

so this is a plot of prediction error!

if we plot these residuals against the actual value of our independent variable, and our relationship is linear, we will get a plot that looks like:

residuals can be negative and positive values (the meaning does not change regardless of the sign)
we are mainly interested in residuals being very small and zero or as close to zero as possible
if we were to put a line through this graph, it will tell us if we obey our linearity assumptions:

a violation would look like:

no longer a straight line or even close to a straight line!

so you can understand the importance of residuals, just graphing them tell us so much about our data!

there are more ways to tell if we violate our linearity assumption and other ways to fix it but this sufficient for now!!!

2. our second assumption: homoscedasticity

this one is not as simple as the first, but

in plain english this means:

the spread of our predictions remains constant or somewhat constant throughout this model
if our model starts out with good predictions (small residuals), our residuals should be small throughout!
if our model starts out with bad predictions (big residuals), our residuals should be big throughout!

this consistency is important because it ensures the prediction’s a model makes are consistent and reliable

this can be visualized perfectly with a residual plot

hopefully you can visualize this change in variance!
when our spread and variance is not constant, we call this heteroscedasticity

this assumption is rooted in years of statistics and regression theory and is crucial to estimating the optimal parameters of our model, and giving us better interpretation into our model!

so violating homoscedasticity can indicate that:

the parameters are no longer optimal and there is a better way to estimate them!
our model might make inaccurate and inconsistent predictions
and our model is misspecified for its purpose, again like assuming a non-linear relationship when linear and more!

3. our third assumption: normality

distributions measure the count of things and tell us a lot of information
recall that a normal distribution means most data points are close to the average (the middle), and the further away we get from the average, the less points there are!

most things in the real world follow a normal distribution like height!

so this means ideally we want our residual or error distribution to look like

where most of our residual sit around the middle and are hopefully pretty small!

you will often see the distribution plotted in a special plot called a q-q plot

a q-q plot compares the distribution of our residuals and the theoretical (what it should be) normal distribution

if the residuals are normally distributed, the points on the graph will lie on a straight line
strong deviations from this line, tell us we have violated the distribution assumption!
our distributions are divided into quantiles and this visualization represents them as points in a graph!

the ideal qq plot should look like!

and a violation should look like!

do not worry too much about this assumption, this is rooted in many deep statistic concepts that transcend these articles.

this does not affect the performance of our models like the other assumptions
and the others are much more crucial!

just understand that residuals should follow a normal distribution

4. our final simple linear regression assumption:

this says that the residual for one observation is not influenced by any others
we call this autocorrelation!!!
ideally, there should be no correlation or trend of our residuals

this one is more interesting, because it is mainly violated when time is involved

when we work with something like stock data
i.e -> the stock price of today is very influenced by stock price of yesterday, and the cycle continues
many things involving time, have trends, and this leads to a correlation in our the error of our predictions too

so linear regression is not the best for anything involving time!!!

how do we check for this?

looking at a residual plot

there is a clear trend in our second image, this should not exist and violates the independence assumption!!!!

we have techniques to address this and prevent this but these will not will illustrated in this series!!!

recap

linear relationships are very simple
regression is focused on the interaction of independent and dependent variables
regression makes a prediction of a continuous number
the most common form of regression is linear regression
linear regression fits a straight line and this is a lot like the slope from grade school
the most simple form of that is simple linear regression; fitting a straight line with a single independent and single dependent variable!
simple linear regression has a closed form solution, a simple and easy way to fit the line and find the optimal parameters
no predictions are necessary for this closed form solution, meaning our model in the right condition starts out smart.
linear regression makes a lot of assumptions, and as a result, it is highly limited!
meaning that even if we have a perfect linear relationship in our data, there are a lot of other little rules we have to obey!
but if we can understand these rules, we can better build simple linear regression models!!!

so anyways

this is a beefy article and i had a hard time making it fun and funny, but that’s okay! in the next article i will talk about multiple linear regression, transformations of linearity, and better measuring these models. and in the future, i plan on showing how to code this up in my series, code like im 5!!!

human moment of the day: i love legos and my goal in life is to collect and build every lego supercar! i finished up the ferrari daytona but it took about 2 months, but so worth it. shoutout to this absolute unit

have a awesome day and