Simple Linear Regression with an example using NumPy

Published in

Analytics Vidhya

7 min readNov 12, 2019

In this article I am going to explain one of the basic algorithm in machine learning with an example problem of relationship between alcohol and happiness from all over the countries .

Let’s get started .

What is Linear Regression ?

Linear regression is the mathematical technique to guess the future outputs based on the past data .

For example, let’s say you are watching your favourite player playing football in today’s match , he is having very good track record against this opponent team with an average of 2 goals in every match , based on this simple calculation in your mind you may expect him to score at least 2 score or more than that , so what your brain did was calculating the simple average or mean.

average = total score against opponent team / number of match against opponent

Linear regression also similar to that but instead of taking an average, we are doing much better statistical guess using linear relationship between the input variable (x) and target variable (y) .

Note : Linear Regression can be applied only for continuous variable like rain vs humidity , heart rate vs running speed etc .

let’ see how to it works by implementing it in popular numerical computing python package NumPy .

Linear Regression using NumPy

Step 1: Import all the necessary package will be used for computation .

import pandas as pd
import numpy as np

Step 2 : Read the input file using pandas library .

data = pd.read_csv('/Users/arunramji/Downloads/Sourcefiles/Alchol_vs_Happiness.csv',',',
                  usecols=['Country','Alcohol','Happiness Score'])

Now let’s see the glimpse of how the imported data looks like,

data.head()

Step 3: Filter only the required variables

A = data[['Alcohol','Happiness Score']]
A.tail()

Step 4: Convert the pandas data frame in to numpy array .

matrix = np.array(A.values,'float')
matrix[0:5,:]    #first 5 rows of dataarray([[194.33333333,   7.526     ],
       [188.33333333,   7.509     ],
       [124.        ,   7.501     ],
       [123.        ,   7.498     ],
       [164.33333333,   7.413     ]])

Step 5: Let’s assign input and target variable , x and y for further computation.

#Assign input and target variableX = matrix[:,0]
y = matrix[:,1]

Step 6 : Feature Normalisation -It is one of the important step for many ML models , what we actually do is compressing all our input variable in to smaller and similar magnitude so that later computation will be faster and efficient . Below we have one of the feature normalisation technique to make the input variable x in similar magnitude .

#feature normalization# input variable divided by maximum value among input values in XX = X/(np.max(X))

Step 7 : Since it is one input variable and one output variable , we can plot the 2d plot and see how it is distributed . It will help us to understand the data and problem in better way .

import matplotlib.pyplot as pltplt.plot(X,y,'bo')
plt.ylabel('Happiness Score')
plt.xlabel('Alcohol consumption')
plt.legend(['Happiness Score'])
plt.title('Alcohol_Vs_Happiness')
plt.grid()
plt.show()

Now it is clear that there are some correlation between alcohol consumption and happiness score , which means we can see that country which consumes more alcohol tend to be more happy !!

now let’s begin computing the hypothesis .

hypothesis is the term used to define the approximate target value(y) for the given training sample and it will be computed by our ML models .

Hypothesis

We need to compute the hypothesis by evaluating the linear relationship between X and y , here alcohol consumption vs happiness_score .

so how we are computing hypothesis or approximate output value (y) for given input (x) ?

An idea is , if we plot the simple line on data which has less deviation or error from the actual values, then it can be used to predict the future value with very minimal error .

So our goal is to find that optimal line , below is the line equation we will need to compute .

here we know the value for x , y from sample data, using that we have to compute optimal theta0 and theta1 which has minimal error cost to plot the linear fit .

Cost or SSE (sum of squared error) is the difference between our hypothesis and actual data points

Step 8: let’s define the function to calculate the cost or SSE .

def computecost(x,y,theta):
    
    a = 1/(2*m)
    b = np.sum(((x@theta)-y)**2)
    j = (a)*(b)
    return j

Step 9 : Appending a term x0 in our existing matrix X for mathematical convenience ,x0 should be having values as ‘1’ .

and assign some initial theta as 0 .

#initialising parameterm = np.size(y)
X = X.reshape([122,1])
x = np.hstack([np.ones_like(X),X])theta = np.zeros([2,1])print(theta,'\n',m)[[0.]
 [0.]] 
 122

Let’s compute what would be the cost if theta is zero ,

print(computecost(x,y,theta))1941.7825705000002

Our aim is to reduce this cost J(theta) value further , so that we can achieve the optimal linear fit for our data .

Gradient Descend

Gradient descend is a one such algorithm used to find the optimal parameter ‘theta’ using the given parameters ,

x — Input values

y — output values

Initial_theta — in most cases NULL theta

alpha — rate at which gradient pointer descending to optimal value

iteration — setting how many iteration it should take

understanding “Gradinet Desecnd” may require bit of calculus , but it is not necessary to implement and using it for ML problems . Knowing the role of the above mentioned parameters is often enough for implementation .

Step 10 : Defining function for gradient descent algorithm .

def gradient(x,y,theta):
    
    alpha = 0.00001
    iteration = 2000#gradient descend algorithm
    J_history = np.zeros([iteration, 1]);for iter in range(0,2000):
        
        error = (x @ theta) -y
        temp0 = theta[0] - ((alpha/m) * np.sum(error*x[:,0]))
        temp1 = theta[1] - ((alpha/m) * np.sum(error*x[:,1]))
        theta = np.array([temp0,temp1]).reshape(2,1)
        J_history[iter] = (1 / (2*m) ) * (np.sum(((x @ theta)-y)**2))   #compute J value for each iteration 
    return theta, J_history

Now let’s use the gradient function for our data ,

theta , J = gradient(x,y,theta)
print(theta)[[4.22499706]
 [2.38031097]]

Now we have got the optimal theta computed by gradient descend , but how can we be sure that this the optimal one , using computecost function we can see it .

theta , J = gradient(x,y,theta)
print(J)
array([[1936.24274283],
       [1930.71941566],
       [1925.21254022],
       ...,
       [ 115.49668262],
       [ 115.49460323],
       [ 115.49255932]])

cost or SSE value is 115.42 which is much better than 1941.78 was calculated when theta = 0

Step 11: Now let’s plot our line on data to see how well it fits the data .

#plot linear fit for our theta
plt.plot(X,y,'bo')
plt.plot(X,x@theta,'-')
plt.axis([0,1,3,7])
plt.ylabel('Happiness Score')
plt.xlabel('Alcohol consumption')
plt.legend(['HAPPY','LinearFit'])
plt.title('Alcohol_Vs_Happiness')
plt.grid()
plt.show()

It seems’s to be reasonable for given data sample , let’s use this linear fit to compute new and unknown input value x .

Step 12: Let’s predict for new input value 😃

predict1 = [1,(164/np.max(matrix[:,0]))] @ theta #normalising the input value, 1 is for intercept term so not need to normaliseprint(predict1)[5.98606924]

Bonus Notes :

There are few other ways we can determine whether gradient descent works fine or not, one of them is plotting J(theta) for each iteration and see how the value changes , it is good if J value getting reduced in each iteration but if it’s increasing then there must be some problem with our algorithm or data .

let ‘s see how it worked for model .

#visualising J (theta0 , theta1)theta0_vals = np.linspace(-5,10,100).reshape(1,100)
theta1_vals = np.linspace(-5,10,100).reshape(1,100)#initialise J value to matrix of 0
J_vals = np.zeros([np.size(theta0_vals),np.size(theta1_vals)])#fill J_vals
for i in range(0,np.size(theta0_vals)):
    for j in range(0,np.size(theta1_vals)):
        t = np.array([theta0_vals[:,i],theta1_vals[:,j]])
        J_vals[i,j] = computecost(x,y,t)# Because of the way meshgrids work in the surf command, we need to
# transpose J_vals before calling surf, or else the axes will be flipped
J_vals = J_vals.T

Now let’s plot .

#surface plot for covergence
from mpl_toolkits.mplot3d import Axes3D 
fig = plt.figure(figsize=[12.0,8.0])
ax = fig.add_subplot(111,projection ='3d')
ax.plot_surface(theta0_vals,theta1_vals,J_vals)
ax.set_xlabel('theta0')
ax.set_ylabel('theta1')
ax.set_zlabel('J_vals')plt.show()

In the above figure , we could see that surface converging near theta0 → 4 and theta1 → 2 so that we can say that calculated parameter is a reasonable one .

Conclusion

The explained linear regression technique is a commonly used modelling technique for predicting continuous variable , so will it work all the time for all kinds of data ? that we cannot tell for sure but as long as we understand the data and problem , linear regression will definitely give us a good statistical guess for unknown and new input values .

keep hustle for the better future !!