
[Machine Learning] Linear Regression Part-1 from Scratch and with SciKit Learn Implementation
Thought for the post: “Tell me and I forget. Teach me and I remember. Involve me and I learn.”
Let’s continue our journey into Machine Learning with Linear Regression.
Prerequisites: Artificial Intelligence and Machine Learning for Newbies and Types of Machine Learning
Basic knowledge of Python. Don’t worry if you don’t know Python, I will try to simplify as much as possible to make you understand better. And I will write some articles introducing you to Python.
Why a Programming language is needed for ML? Well, ML is mostly about Mathematics consisting of Probability & Statistics, Calculus, Algebra and Optimisation. So, in order to implement ML algorithms, you need a programming language. It can be R, Python, Octave, Julia, Java, JavaScript or anything of your choice. Most preferred are Python and R in today’s world because of strong growing community and open source implementations and ease of understanding.
I preferred Python as it can be used for many other purposes as well like Automation, AI, Web development, Web Scrapping, Games, GUI Applications, Networks etc.,.
So, once we learn Python, we can use it for many purposes.
And R is mostly used for Scientific and Statistic computations and ML. Now a days, JavaScript community is also growing towards ML open source implementations.
That’s some outline about languages for ML, you can choose anything of your choice. Just because I am preferring Python here shouldn’t stop you exploring other languages for ML.
Before we get started about Linear Regression, let’s rewind Regression is for continuous value variable prediction given the input. Linear Regression is one technique(algorithm) used for ML regression problems.
I love learning by doing it. So, before we actually get into theory of Linear Regression, let’s see with example of how it works.
Let’s create Input and Output data to train the ML algorithm.
First, we will start with small and simple data for training and testing.
We will consider our input data as X and output(target) data as Y. Above we have generated sample data of 1 to 14 numbers.
What is the purpose of those numpy and time modules which I have imported? Explanation below.
Above data is in list (combination of different data types of array — In python, there are no specific array type like in other languages. Arrays are generally operated in Lists in Python). When working with ML algorithms, performance plays a key measure in processing data needs to be considered as real world ML systems will have more data to work on.
Between Python is dynamically typed language, it is intelligent enough to understand the data type by itself (you don’t have to declare the type while initiating like we do in other programming languages). I will cover more about this in Python articles which I will start in sometime. For now sample example below.
Data stored in lists is too slow compared to data stored in Numpy arrays. Below is sample code to understand better.
Observed the difference between Numpy arrays and Lists of time taken to generate 1 million data? Even saving 2–3 secs matters a lot in ML as you will be processing more features and data and training (as some data training takes months and years as well) in real time.
My above intention is show you the difference between normal lists and numpy arrays. Normally, in ML we will use numpy arrays a lot (some of the main frameworks were built on top of it).
As our data is small, we will continue with Lists now.
Let’s generate Y (output) data now as below. We will use x data and multiply with 11 and train and test our ML algorithm to see how it is performing on that data.
As we have generated some sample data, let’s create a dictionary and pass it to DataFrame.
You might be wondering what is DataFrame?
Well, it is simple. It is relational (tabular) structure just like how we store values in Excel in Rows and Columns.
And Pandas library provide the dataframes to hold and manipulate data as and when required.
Pandas is built on top of Numpy, Scipy, Matplotlib libraries. And is open source too.
Pandas is mainly used for Data manipulation, exploration, statistical computations, wrangling and data analysis. It is one the most powerful tool used for data manipulation for ML. Please refer to documentation. However, providing the link for most popular methods and techniques used in pandas which will be helpful for quick learning.
Below is how our data looks in DataFrame variable.
As you can clearly see now, X is our Input data and y will be our target variable data. Other 0–13 is indices which we will ignore when passing to ML algorithm.
Now we can clearly see by above, y is multiplication of 11 with X.
Let’s visualise the data on how it looks on a plot.
Matplotlib library is one of the powerful tool for Data visualisation. Why do you need to visualise the data? Well, it is very important to know and understand your data, based on data visualisation, we need to find out relation between the features of our input and target variables. We can derive insights into outliers, blank data, feature importance, relationship between all input columns. Your data speaks to you and you can discover patterns between data based on your visualisation only. It not only derives you the message, it is pretty useful for you to make some important decisions while passing data to ML algorithm.
And many more to speak the importance of data visualisation which can’t be covered here. In simple, Understanding your data is the most important and critical aspect in ML world.
And other popular data visualisation libraries in Python are Seaborn, Bokeh and many more. Please google and play with those tools.
Output of our above code is below picture:
We can understand there is complete Linear relationship between our input and output data which is Awesome.
Note: It is really too hard to find complete linear relationship (found seldom) between our Input and Output in real world application data.
If you are not understanding what is this Linear relationship and all. Don’t worry and keep going as I am covering those topics below.
So based on our above plot, our Linear (name itself says) Regression algorithm may perform better. Let’s check that out below if our intuition is correct or not.
First we will cover the example with Framework tools. Then we will actually go with Maths behind the hood and implement it.
Let’s understand the above steps one by one.
We are importing the Linear Regression algorithm which is already implemented in a library. And we are using that open source framework which is SciKit-Learn. It is one of the popular open source frameworks used for ML. There are many other popular frameworks which are StatsAPI, H2O, Spark MLlib, Pyspark, Tensorflow etc.,. There are many popular frameworks in Deep Learning as well. Few of them are Tensorflow, Keras, Caffe, CNTK, OpenCV, Torch, Theano etc.,.
You can play with those tools are per your choice and data.
So coming back to our above code, why had I split the data into Training and Testing? Well, how do you know or evaluate whether your ML algorithm working fine on your data?
For example: Imagine you had trained the ML algorithm with all that data that you have right now and simply deploy on production/real world environment, if it is not performing well or users are not happy with that performance or accuracy of prediction, it is bad experience to you and your business, isn’t? Probably you need to revoke the changes then.
To avoid that, we split our whole data that we currently have with us into training and test sets. After split, we train with our training data and test trained ML algorithm on test data (data that is has never seen) before going live and evaluate the performance/accuracy of our ML algorithm. Then, we will go for next steps like deploying our trained model into Prod if it is performing well. If it is not performing well, we tune our hyper parameters or data and iterate our process till we find good model (algorithm) which is performing well on our data (We will discuss about all these in detail in further articles).
So above, we randomly split our available data of 14 rows into 70 percent training data (9 rows) and 30 percent test data(5 rows). Keenly observe output indices of train and test data displayed above.
What does that random_state= ‘some number’ above. Explanation with an example below.
Output of above code example:
So, hope you caught the difference when giving random seed of same number and random seed of different number. And when giving random seed number and without random seed number.
In simple, if you don’t give seed number, your data changes when run multiple times.
Similarly, for our train and test split as well, if we don’t give random_state= ‘some number‘, our split of data changes when we run next time as split is randomly shuffling. With same number random_state number, we will retain our random split with same indices till we change the random-state = ‘new seed number’.
Why it is needed?. Cause we may end up running our ML algorithm multiple times while tuning our data again, hyper parameters etc., if our model is not performing good. So, when run next time, we can retain our same train and test data so that we can judge whether our model improved or not after tuning. If we don’t give random_state, we really can’t judge if our model is improved or not after tuning as for every run our random split data changes. Hope it gave you some context on that usage.
Let’s continue to next. We have loaded LinearRegression object into lr variable and fit our training data by calling fit( )method.
So, it basically means we have passed our training data to LinearRegression algorithm and then asked it to fit X and y data so that it finds logic (relation) between X (inputs) and y (outputs).
Now our algorithm is trained with train data. Let’s test our trained model on our test data (30%- 5 rows) which we had split above.
So, we have passed our whole test data to predict method of our trained model. And below is the results predictions done by our model.
Output: [ 154. 33. 110. 55. 22.]Let’s see our test data again which is represented above.
X_test: [ 14, 3, 10, 5, 2]Awesome, our model has predicted and understood the logic perfectly. We will analyse how it has understood the logic with Math behind it.
Let’s go further with evaluation metric of comparing our model accuracy. How do we determine accuracy in simple terms?
Example: Imagine you have been trained by an instructor on something before an exam and you were given some questions to test your understanding of the knowledge shared to you by instructor. Instructor does know the actual answers to the questions he/she prepared to evaluate your performance, isn’t it? Once you complete your exam and submit it to instructor, how does he evaluate you? He compares your submission with actual answers of the questions he had given to you in exam, isn’t it?
Same way, we compare the predictions with actual values that we have stored in y_test by applying R2 (called as R-Square) score metric for regression problems. Will explain what is R-Square and understanding error metrics in later articles.
For now, understand, R2 score stands between -1 to 1.
-1 being the least accurate which means very very bad model and 1 being the best which means excellent model which can’t be out beaten. I have multiplied with 100 to give in percentage.
Our model accuracy is 100% (we simply understood this by seeing our y_prediction outputs, however R2_score is a automated comparison between y_test_actual and y_prediction) which means our model is going to perform without any errors on any future data. And we can deploy the model to start predicting on any incoming data.
But, let’s double check to evaluate it more by predicting on new data.
Hope, you have seen the results and our model is accurately predicting our expected outcomes.
Hence, we can deploy the model in Production system without any doubts.
But what is the logic our model has understood? Let’s see that below.
Model coefficient is the logic understood by the algorithm when we had trained with our Input and Output variables. So, same logic (coefficient) is applied to all our future incoming input data. Constant is the y-intercept where it is meeting y-axis. In above data case, it is 0.
Confused on understanding these terms? Don’t worry, let’s revisit to our schooling math once.
Remember below Linear Equation in Geometry?
y = mx + cwhere m is called coefficient of x and slope of y and c is constant term and y-intercept which is where it meets y-axis when passing through points.
We were given two (x, y) points and asked to find m and c. Isn’t it? How we used to find?
m = y2-y1/x2-x1where two points (x1,y1) and (x2,y2) are given. In case, only point is given, we assume it one point to be passing through (0,0) origin coordinates. In that case, m becomes
m = y/xLet’s take an example given (2,10) and other point assumption is origin (0,0).
m = 10/2 = 5So in that case, logic is m= 5. For any given x, you apply mx to find y. Got it? c = 0. Cause y-intercept is meeting at origin (0,0).
y = 5x, c= 0Interesting, isn’t it? Revisiting our school math. Let’s draw a line passing through above logic. When x=1, y= 5; x=2, y=10, x=3, y=15.
Similar way, our Linear Regression is performed to find the logic between Inputs (X) and Outputs (y). It tries to find the best fit passing through all the coordinate points.
This algorithm works best only when there is a Linear relationship between X and y.
Now go and see our above scatter plot image of our input and output variables. Understood, what I was referring as a Linear Relationship there?
Providing one more example diagram to understand better.
Assume when we view your X and Y data on 2D representation graph looks like above. All those red dots are X and Y data points. We can clearly see there is a Linear relationship between them. Hence, all data points are lying nearby. Can we find a Straight line passing through all data points? Answer is No. If we could find the best fit, then that would be the logic of our data.
This Logic can be found by our Algorithm which would be best fit.
Of course, our best fit (Green line) is not passing through all data points. There are errors which we determine it through errors by calculating the deviations between actual and predicted. Hence, accuracy will not be 100% in such cases (can be 85–95%). But that is the best fit that we as Human can think as well (Apologies if it is a bad diagram representation in paint).
Our algorithm optimisation aim to find that best fit, it is the max accuracy it can determine (accuracy may stand between 85–95% in such cases).
What if algorithm predicts Average fit (Brown straight line)? Well, in such cases, algorithm is not performing well (accuracy can be 50–75%) on our data, we need to tune hyper parameters, data and applying other optimisation techniques to come closer to best fit line.
Similar way, if it is worst fit (Blue straight line), (accuracy can be 30–50%) we have to apply same techniques to reduce errors and come closer to best fit.
Both Blue and Brown lines are called as Model Underfitting (Algorithm is performing poorly).
What if algorithm is performing like yellow curve (Non-linear curve)? Well, that is the worst performance an algorithm can do. Wondering why? Even though it is passing through all the data points, in such cases, imagine a new data point comes. Where should the curve go? Curve can be stopped there itself. It cannot extend the logic and will perform poorly on any unseen data.
In ML, we call such cases as Model Overfitting (Algorithm is mugging the data).
Simple example: Imagine, when a child is given an addition problem and has answers besides it in text book. What if child is mugging the addition of two numbers data and its addition output?. Now, if child is given a new two numbers and ask him to do addition of those two numbers. Since, he hasn’t learnt the addition logic and mugged up the data from text book, he would output same number recollecting from text book as he remembered those numbers, correct? Isn’t it? Here, child hasn’t learned addition, he just mugged up the data. Hence, he cannot perform on any new data he hasn’t seen from his text book.
Now got it, what does overfitting means? It is same as above. Our yellow curve is doing the same thing as child.
That’s some overview of few topics.
Let’s go ahead and plot our regression line (y = mx+c) for above trained and tested data.
It is the best fit, isn’t it? Of course, it is. As our accuracy is 100% (R2-score =1).
By now, you may have got the context with simple example, how ML can identify the logic between X and y during Training when there is Linear relationship between Input and Output and same can be applied on Test data to evaluate model performance.
Let’s take a step forward by developing the algorithm from scratch without using Open source implementation.
In order to find best fit, we use a metric called ‘Residual sum of Squares’ (also called as “sum of squared residuals (SSR) or sum of squared errors (SSE)” ) and compare the lines to find out the best fit by reducing errors. Errors are sum difference between actual value and predicted value.
To find the errors for each dependent value, we will use the below formula.
yi = Dependent Variables and y¯ = Average of Dependent Variables.
So what is a Residual?
How do we reduce errors and find the best fit? Follow below diagram.
In above diagram, assume algorithm (we) start with Red Line, then we calculate the sum of distance between predicted line to all data points. In such case, errors will be more. Similarly, it continues predicting the line and calculating distances till it finds the minimum errors which would be our Best fit line. Hope you got the explanation (Apologies if diagram is not good).
Let’s see with example data of X and Y and calculating it’s SSE.
The sum of squared errors SSE output is 562.1. To do the best fit of line intercept, we need to apply a linear regression model to reduce the SSE value at minimum as possible. To identify a slope intercept, we use the equation
y = mx + bSo how do we reduce above SSE errors to find best fit? We apply Ordinary Least Squares method to find the best Line Intercept (b — it is c. Don’t get confused.) and Slope (m).
Ordinary Least Squares Method:
To use OLS method, we apply the below formula to find the equation.
We will calculate slope ‘m’ and line intercept ‘b’ for above data.
Let’s calculate ‘m’ by using OLS formula.
m = 194.5/82.5 = 2.357
b = 22.7 — (2.357*6.5) = 7.3795
Our calculated logic with best fit is y = 2.357x + 7.3795 using OLS method. Let’s take x =12 and apply on above equation, then we get y = 2.357*12+7.3795 = 35.6635. See it is clearly closer to our data points.
Let us calculate the SSE again by using our output equation of best fit and see errors.
See our RSS errors got reduced from 562.1 to 103.551.
That will be our algorithm intention and optimisation to find the best fit using OLS method.
As we have calculated manually, let’s develop same algorithm with Python code from scratch.
Above Line fit is the best fit we can find, isn’t it? Or can you visualise any other best fit for above data? I don’t think we can find better than that where our SSE (RSS) errors are minimum.
As we have developed the OLS method from Scratch, we can convert into function and then use on any input and output variable data.
Let’s see the same in SciKit Learn framework on same data below.
Observed the Output of Scikit Learn and Ours is same as Scikit learn uses OLS method for Linear Regression Algorithm.
When we use any open source frameworks, we can complete it with few lines of code compared to developing the algorithm from scratch as we have many open source ML frameworks out there. But it is very important to understand the working Math under the hood of any algorithms.
Let’s again check with Scikit learn with Train and Test split approach, code below.
So, with Train and Test split approach, our model has identified as Slope m : 2.27600849 and Intercept b: 7.6284501061571159.
Why there is difference when we are passing whole data and when we are passing 80% data? As we have less number of training examples, based on the train data passed, algorithm has found the best fit for it. If we can get more data and pass again, our Slope and Intercept values changes again.
Based on above split approach, when we tested on 20% test set, we have got as R2 score of 90.48% which is pretty good.
For now, understand R2 Score formulas as below. Will explain regression metrics and other errors in separate article.
SST is same as our SSE.
So, what is Linear Regression?
Our objective is to find the linear relationship between Inputs (X) and Output(Y) variables (provided X and Y are linearly dependent) which would be the best fit linear equation by using OLS method and reducing the errors of RSS to minimum (SSE).
This Linear Regression implementation with OLS Method is called as Closed Form.
The one we discussed is called Simple Linear Regression which is a Univariate Model.
We call Univariate when we have one independent variable (X) and one target variable(Y).
When we have multiple independent variable and one target variable, we call it as Multivariate Model.
In such cases, our Linear regression equation becomes as below:
y = m1x1 + m2x2 + m3x3.....+cwhere we will have multiple Slopes (Coefficients) and a intercept term to find.
I will cover Multivariate Linear Regression and Linear Regression with Gradient Descent in next article.
GitHub link for complete code at one place: Simple Linear Regression
Important Note: I have heard and seen some people who brag they have implemented ML/AI in their projects which in real they haven’t. Just because ML/AI are fancy terms to use because of the hype it is getting these days, people started falling into that pitfall to grab attention saying they have used ML/AI. In general, when asked about that, they don’t even know what is their Supervised Learning Target variable and Input variables. And worst part is they don’t even understand Train-Test set evaluation. Request to people reading this article, please don’t come under that pitfall. They even don’t know the concepts, methods and techniques. Use the terms when you can at least understand the basics of it.
Also, another Myth that people who don’t know ML/AI are thinking it can solve all our complex problems. Nope, it isn’t. But it is going to solve huge set of our complex/repeated problems in day-to-day and real life. It is going to expand human innovation to more. It has exceeded Humans performance in some cases with Deep Learning, Reinforcement Learning. So, you need to identify right set of use cases and problems where ML/AI can be applied. But not all your problems.
ML/AI/DS is really a vast, really interesting and awesome subject/field to learn, pursue and grow. You will always have something new to learn as data changes frequently. So, please learn basic concepts, methods, techniques at least even though you don’t understand the Math behind it. If you can’t understand the Math also, please don’t stop learning as you can learn while doing (You can implement using many open source libraries for faster implementation). But knowing the technicalities and Math of ML is very important.
Hope you have understood the topics explained in this article. Please share your feedback in comments in case you haven’t got any topic shared in this article or if still needed a simplified explanation so that I can rectify and improve explanation in further articles.
Please suggest any painting tool where Math graphs and diagrams can be represented in good way.
Catch you in next article.
Happy Machine Learning!
Connect me at my personal blog for more.
