Regression Talks-Part I

Learn And Code Method to understand the math of regression

Bilwa Gaonker
TheLeanProgrammer
5 min readJun 20, 2021

--

In the last article, I spoke about my anniversary. This time let us talk about relationships, shall we?. “Bilwa, since when did you start coaching people on their love lives?”. Oh no, not touching upon those relationships (it’s complicated xD). Let us talk about relationships between variables in our given dataset.

YAY YAY!

I am pretty sure you have heard the term ‘regression’ quite often as one of the top 10 machine algorithms ever! Regression is defined as a measure of the relationship between one dependent variable and one or more independent variables. And regression analysis is finding this relationship using a set of statistical processes. So we are shipping variables according to their compatibility, right?

Yes, definitely and this compatibility is predicted by linear regression, logistic regression, and many such methods. In this article, we’ll look at uni-variate linear regression using the scikit-learn python library. I am pretty sure you must have guessed what it means by now.

Uni-variate regression tells us about a relationship between one independent variable (explanatory variable) and one dependent variable (output variable). Regression is usually used when the relationship can’t be figured out just by looking at the data points. Since this is the first time we are diving into the topic of regression, let us keep it simple this time to get a fair idea about how it works.

Let us talk about the math of the uni-variate regression…

What is the first thing that comes to your mind, when I say the output variable is dependent on the input variable?

y=m*x right? to make it more generic let us say y=m*x+c

Yeah, the equation of the straight line we learned in high school, where c is our intercept and m is the slope. The whole motive of this algorithm is to find m and c such that the distance between this line and all the points is the minimum. Quite cool, right?
Usually, the equations used to find the distance is called as cost function. This equation can be basically the difference between the actual value and the predicted value by using your equation, squaring it, and then finding the sum of it and dividing it by the total number of the data points. Sounds familiar, yes it is our good ol’ friend variance.

This the approach used by the sklearn.linear_model.LinearRegression(). So now let us jump to the coding part to implement the basics that we have learned.

Let’s code…

I usually prefer to import all the libraries in one cell (yeah, I use Jupyter Notebook)

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import linear_model

For this code, we need NumPy, pandas, matplotlib, and sklearn libraries.

data=pd.read_csv('LRdata1.txt', header=None, names=['Population', 'Profit'])

Using the pandas library we read the CSV file in which our data is stored. The further code in the same line is to set the header names for our dataset.

data.head()
Our dataset :))

There’s no other preprocessing as such that is required for this dataset.

#getting the values from the first column= Population
X=data.values[:,0]
y=data.values[:,1]
m=len(X)
print('The total number of training examples ',m)

So we set our y(actual output) to be the ‘Profit’ column and X(input variable) to be the ‘Population’ column. And the ‘m’ is the number of training examples which is in turn the length of the ‘Population’ column.

plt.scatter(X,y, color='green', marker='*')
plt.rcParams["figure.figsize"]=(10,6)
plt.grid()
plt.xlabel('Population of City in 10000s')
plt.ylabel('Profit in $10000s')
plt.title('Scatter Plot of Training Data')

Now we plot the data to see how our data is distributed. A Scatter plot is preferred as it shows each data point’s placement.

Looking at the data below we can imagine what a line passing through the data would look like. This dataset has just been chosen this time to get an idea about what is Linear Regression and how it works. Not all times we could look at the data and predict the line/curve passing through it!

Output- Scatter Plot
model=linear_model.LinearRegression()
model.fit(X.reshape(m,1),y)

Now we use the linear_model that was imported from sklearn. The LinearRegression() method is used to initiate the model. The next line of code trains the model with the training dataset.

coefficient=model.coef_
intercept=model.intercept_
print('coeff/slope= ', coefficient)
print('intercept= ', intercept)

We get the slope (m) and the intercept(c) by writing the two lines of codes given above.

plt.scatter(X,y,color='green', marker='*', label='Training Data')
plt.plot(X, model.predict(X.reshape(m,1)), color='red', label='Linear Regression')
plt.rcParams["figure.figsize"]=(10,6)
plt.grid()
plt.xlabel("Population of City in 10000's")
plt.ylabel("Profit in $10000's ")
plt.title("Linear Regression Fit")
plt.legend()

Now since we trained the model, let us predict the output using the model.predict() method and plot the same.

Isn’t this the line we kinda imagined going through the data points?

Let us make a prediction now by entering the value of the population that doesn’t lie in the training data.

predict1=model.predict([[3.5]])
print("For population = 35,000, our prediction of profit using the model is", predict1)
Woohoo! Our model is working

Well, that’s it for this article I guess? The whole point of this article was to introduce Linear Regression! In the next article, we’ll try writing code from scratch for Regression without using the inbuilt sklearn library. Stay tuned for more regression talks like this! You can connect with me on LinkedIn if you have any queries related to my articles.

Don’t forget to follow The Lean Programmer Publication for more such articles, and subscribe to our newsletter tinyletter.com/TheLeanProgrammer

--

--

Bilwa Gaonker
TheLeanProgrammer

Love playing with data | Ardent Reader | I write newsletters sometimes