Regression Series-02 : How Parameters are Estimated Linear Regression

Yashwanth S
5 min readJul 25, 2024

--

This blog is a continuation of Regression Series. Now we are in Second Blog. please check out if you missed Part 01.

In this blog we try to understand how parameters will be estimated by gradient descent and also we are going to look at practical code demo to perform Linear Regression on a dataset.

Parameter Estimation in Linear Regression

Let’s Keep our independent variable is the experience X and the respective salary Y is the dependent variable. Let’s assume there is a linear relationship between X and Y then the salary can be predicted using:

𝑦i^ = 𝜃1 + 𝜃2 𝑥𝑖​

or,

Y^ = 𝜃1 + 𝜃2 X

Here also have a look in diagram,

  • 𝑦𝑖ϵY (𝑖=1,2,⋯,𝑛) are labels to data.
  • 𝑥𝑖ϵX (𝑖=1,2,⋯,𝑛) are the input variable.
  • 𝑦𝑖^ϵY^ (𝑖=1,2,⋯,𝑛) are the predicted values.

The model gets the best regression fit line by finding the best θ1 and θ2 values.

  • θ1: intercept
  • θ2: coefficient of x

Once we find the best θ1 and θ2 values, we get the best-fit line. So when we are finally using our model for prediction, it will predict the value of y for the input value of x.

How θ1 and θ2 getting updated values to get the best-fit line?

To achieve the best-fit regression line, the model aims to predict the target value 𝑌^ .Y^ such that the error difference between the predicted value 𝑌^ Y^ and the true value Y is minimum. Through Gradient Descent Algorithm we can achieve for estimating the best parameters:

Cost function and Gradient for Linear Regression

The cost function or the loss function is nothing but the error or difference between the predicted value 𝑌^ and the true value Y.

In Linear Regression, the Mean Squared Error (MSE) cost function is employed, which calculates the average of the squared errors between the predicted values 𝑦^𝑖​ and the actual values 𝑦𝑖​. The purpose is to determine the optimal values for the intercept 𝜃1θ1​ and the coefficient of the input feature 𝜃2​ providing the best-fit line for the given data points. The linear equation expressing this relationship is 𝑦^𝑖=𝜃1+𝜃2𝑥𝑖

MSE function can be calculated as:

A linear regression model can be trained using the optimization algorithm Gradient Descent by iteratively modifying the model’s parameters to reduce the MSE of the model on a training dataset. To update θ1 and θ2 values in order to reduce the Cost function (minimizing MSE value) and achieve the best-fit line the model uses Gradient Descent. The idea is to start with random θ1 and θ2 values and then iteratively update the values, reaching minimum cost.

A gradient is nothing but a derivative that defines the effects on outputs of the function with a little bit of variation in inputs.

Let’s differentiate the cost function(J) with respect to 𝜃1:

Let’s differentiate the cost function(J) with respect to 𝜃2:

Finding the coefficients of a linear equation that best fits the training data is the objective of linear regression. By moving in the direction of the Mean Squared Error negative gradient with respect to the coefficients, the coefficients can be changed. And the respective intercept and coefficient of X will be if 𝛼 is the learning rate.

By this formula weight updation is gonna happen for each and every parameter😊😊.

Enough of theory let’s get hands on Linear Regression 👨‍💻👨‍💻

Code Implementation

About the Dataset: It has 506 records,13 features, and 1 target variable.

Steps for Gradient Descent:

  1. Standardize the data:

from sklearn.preprocessing import StandardScaler
sc=StandardScaler()
X_transform=sc.fit_transform(X)

2. Initialize the parameter and hyper_parameter

weight_vector=np.random.randn(x.shape[1])
intercept=0
learning_rate = 0.001

3. Find derivatives of loss w.r.t weight and bias.

def loss(y,y_predicted):
n=len(y)
s=0
for i in range(n):
s+=(y[i]-y_predicted[i])**2
return (1/n)*s
  • Derivatives of loss w.r.t “weight”
#derivative of loss w.r.t weight
def dldw(x,y,y_predicted):
s=0
n=len(y)
for i in range(n):
s+=-x[i]*(y[i]-y_predicted[i])
return (2/n)*s

4. Update the weight and bias till we get the global minima.

for i in range(epoch):
y_predicted = predicted_y(weight_vector,x,intercept)
weight_vector = weight_vector - learning_rate *dldw(x,y,y_predicted) #update weight
intercept = intercept - learning_rate * dldb(y,y_predicted) #update bias
  • The above figure is the plot between loss and the number of the epoch.
  • After each epoch, the loss is reduced
  • Initially, the loss is decreased drastically till the 1000 epoch
  • After the 1000 epoch, there is a minimal decrease in the loss.
  • This shows that we have reached a global minimum.

5. Using the final weight vector and bias, we can predict the output

this is the final weight and bias after 2000 epoch

  • The above table is the comparison between the actual target and the model predicted target.

--

--