Introduction to ML — Linear Regression From Scratch
You must have heard people talking about Machine Learning models and its various algorithms but you avoid them, as for you it is nothing more than jargon. Well, this article is here to completely change that!
What is Machine Learning ?
The basic definition that you will find everywhere is that “it is a field of study where we feed in a specific amount of data to an algorithm, on the basis of that it gains experience and improves itself without being explicitly programmed.”
“Machine Learning” is a term which consists of a large number of algorithms.
The main thing is that the algorithm makes its own logic by processing the given data. We don’t have to program it for every step.
For example, there is a popular and easy algorithm known as Logistic Regression. It predicts the probability of occurrence of a binary event using a function called “sigmoid” function. This is the same algorithm that is used for cancer detection problems.
Kinds of Machine Learning Algorithms
There are two main kinds of Machine Learning algorithms — supervised learning and unsupervised learning. They don’t have any complex differences.
For the understanding of the supervised learning we will take the most popular example of the prediction of the house prices.
Suppose that you are a property dealer and you can tell the worth of a house by just getting a glance at some of the parameters of the house such as the house’s location, no. of bedrooms, area etc. You have a brother who is new in real estate business and lacks the experience that you have.
Now to help your brother you make an application that will predict the house’s price on the basis of area, no. of bedrooms and location and at what price similar houses have sold for in the past.
First of all you have to collect data for your model. For the next 6 months you write down the data of all the houses that were sold in your city. It’s area, location, no. of bedrooms and most importantly the sale price of the house.
Now the data that we have come up with is called the “training data”.
Now this is the data that we have to feed in the model and this model will predict the price of the houses in that particular city.
The model will find out a relation between the parameters that have been in the training set. It is trying to find out the relations so that the math between them works out.
So, basically in supervised learning the main purpose of the model is to find out the relationship between the parameters and the sales price so that further prediction can be done on its basis. Once the model works out the math you can use it quite easily to predict the house’s price.
Now in this type of learning we basically have to categorize the data into categories that are not known to us. We feed data to the algorithm and in turn it finds a pattern based on the data. This is unsupervised learning.
It is like someone gives you various kinds of fruits and puts them on a table. You have to see patterns among the fruits and then categorize the fruits on the basis of appearance or any other parameter.
The same thing can be done by the Machine Learning Model by catching the pattern from the given data.
You can also use the unsupervised learning in the house problem to find out which type of customers like which type of houses. For example, maybe the college students like houses with less area and many rooms but a middle aged person may prefer to live in a large house and many luxuries.
So, we can see that unsupervised learning is as important as supervised learning and it is not less interesting than supervised learning.
Can we really consider it as “learning” ?
We human beings work on the basis of our experience, we have impulses.
For example, if you are a property dealer then you will just have a “feeling” what is the right price for that particular house, which type of clients would prefer to buy that type of house etc. Everyone is trying to achieve this feat using Advanced Artificial Intelligence but still in vain. There has not been a single system that can properly replicate what we humans can achieve but obviously some have got quite close.
The algorithms that have been made for Machine Learning are not yet that advanced, there is quite limited use and they can only be used to solve some specific problems.
Maybe after 20–30 years later, we would have figured out some algorithms that can make a Strong and Advanced Artificial Intelligence.
So, basically till now we have only made a machine that can predict the result for a specific problem based on some training data. This could not be called proper learning, because it is not having the “impulse” or the “inner feeling” that we humans have but we don’t have any other choice than to call it “Machine Learning” as it somewhat resembles what we humans can do.
Let’s try to think about the Algorithms
Suppose you are the property dealer. Now, think about what algorithm you would come up with to estimate the price of the houses whose detail is given.
If you don’t know much about Machine Learning then you will probably think of some basic algorithms to estimate the price of the house.
- First of all you will find out the average per square feet price of the house.
- You know that some areas cost more than the average so, you will add extra to them and some areas cost less than the average so, you will subtract from them.
- Now you will calculate the price based on the square feet of the house.
- Now, you will also see how many bedrooms that is if that particular house has no room, it’s value will be less but if it has many rooms then the value will be more.
You can take many hours in finding the numbers that you want to add or subtract but you will not be able to find the perfect numbers. It will also be hard when there will be fluctuation in the actual price.
It would have been much better if the computer would have done all the above calculations for you somehow. As long as it estimates the right price no one has a problem.
It is just like making some food, the parameters that we have are like ingredients. We just have to somehow find out the amount of the ingredients that we have to put in the food, this is, we have to find out by what amount which parameter affects the final price.
If we somehow got to know these amounts or say weights then our work will be really easy and we would be able to estimate the price of the houses with less error.
A simple and tiresome way to figure out these weights would be :
- Set the weight of all parameters as 1.
- Calculate the estimated price for each house and verify it with the actual price from the database. You have to check how far away you are from the real answer.
- Suppose you have 500 datasets then you have to do it by hit and trial and try to find the weights that would perfectly fit your database.
- If you somehow managed to find out the amounts then, you have successfully found out an algorithm to estimate the house prices.
The above algorithm is really dumb as it would take forever to find out the weights.
For that, some mathematicians have figured out really great ways to find out the weights of each parameter really fast, that is, without trying many numbers for hit and trial.
Linear Regression is a supervised learning algorithm where the predicted output is continuous and has a constant slope.
A simple linear regression algorithm uses a slope intercept form, where m and b are the variables that our algorithm is trying to “learn” to produce the most accurate result.
In the expression, x represents the input data and y represents the predicted result.
Let’s say we have a data set of a company where we have the total budget of radio advertising of each company and the total sales of their product. We are trying to figure out an equation that will predict the sales price of the company on the basis of their budget in radio advertising.
How to make predictions ?
The function that we have seen will estimate the sales of a company given that we have the budget of the radio advertising and our current values of weights and bias.
Weight is the coefficient of the radio variable.
Radio is the independent variable which is also called a feature.
Bias is the intercept where the line intercepts the y — axis.
The above shown figure shows us the final regression line that will come from a data set after applying the algorithm.
The points that you can see are the actual sales of the company’s at their budget of radio advertising, and all the points that lie on the line are predicted sales.
We have to start the weight with some random number but eventually we have to optimize the weight in order to get a slope and a constant that will perfectly fit in the data set. Here, the cost function comes into play to solve the optimization problem.
The cost is the average squared difference between an observation’s actual and predicted values.
We have to minimize the distance between the linear regression line and all the data set values, so we take the average of the distance between the line and the points this is, we add up all the distance and then we divide it with the total number of observations present in our database.
The main objective is to minimize the cost as much as possible so that the accuracy of the model can be increased.
The cost is the expression given below :
N is the total number of observations
yi is the actual value of the observation
mxi + c is our prediction
In order to minimize our cost function, we have to use gradient descent.
There are two parameters in our cost function that we can alter : weight m and bias b.
Since we have to figure out the impact of each of them in the final sales we have to use partial derivatives for each of them.
The thing is in order to find the values of m and b we have to make the cost as minimum as possible and gradient descent is a very effective algorithm.
Our cost function :
The formula that we have obtained from the partial differentiation is as follows :
The purpose of the gradient descent is to converge at a local minimum point.
Now we have to update the value of m along with using the gradient descent.
The updating of m is as follows :
m = m — (alpha) . (partial derivative of cost w.r.t m)
The updating of c is as follows :
b = b — (alpha) . (partial derivative of cost w.r.t b)
alpha is the learning rate or the rate by which we will find the local minimum.
But if the learning rate is too small, gradient descent will be very slow and if it is too large gradient descent can overshoot the minimum, it may fail to converge or even diverge that is the reason why we have to choose the value of alpha carefully.
As we approach a local minimum, gradient descent will automatically take smaller steps. So, there is no need to alter the learning rate over time.
The best way to check if the algorithm is working fine or not is by checking the value of cost after every iteration. The cost function should go down with every iteration.
Let me show you the cost history :
As we can see in the figure above, the cost function is decreasing with the number of iterations.
By learning the best values of weight and bias, we now have an expression that predicts the future sales of a company based on the radio advertisement budget.
I wonder how the model will work in real life. I would let you think about it!
Well, this is the most basic algorithm that is there for supervised learning.
In case there were more parameters, then we would have switched to multiple regression which is a bit more advanced algorithm than linear regression but still simple.
I would save multiple regression for another time.
How to learn more ?
If the curiosity bug inside you has awoken and you want to learn about Machine Learning in depth then I would highly recommend you Andrew Ng’s Machine Learning Course available on Coursera. It would be a pretty amazing next step and anyone with a little amount of mathematical knowledge can do this course.