Regression Series-02 : How Parameters are Estimated Linear Regression

5 min readJul 25, 2024

This blog is a continuation of Regression Series. Now we are in Second Blog. please check out if you missed Part 01.

Regression Series -01 :Overview about Linear Regression

Before Discussing about Linear Regression let’s take an overview about what exactly is regression?

medium.com

In this blog we try to understand how parameters will be estimated by gradient descent and also we are going to look at practical code demo to perform Linear Regression on a dataset.

Parameter Estimation in Linear Regression

Let’s Keep our independent variable is the experience X and the respective salary Y is the dependent variable. Let’s assume there is a linear relationship between X and Y then the salary can be predicted using:

𝑦i^ = 𝜃1 + 𝜃2 𝑥𝑖

or,

Y^ = 𝜃1 + 𝜃2 X

Here also have a look in diagram,

𝑦𝑖ϵY (𝑖=1,2,⋯,𝑛) are labels to data.
𝑥𝑖ϵX (𝑖=1,2,⋯,𝑛) are the input variable.
𝑦𝑖^ϵY^ (𝑖=1,2,⋯,𝑛) are the predicted values.

The model gets the best regression fit line by finding the best θ1 and θ2 values.

θ1: intercept
θ2: coefficient of x

Once we find the best θ1 and θ2 values, we get the best-fit line. So when we are finally using our model for prediction, it will predict the value of y for the input value of x.

How θ1 and θ2 getting updated values to get the best-fit line?

To achieve the best-fit regression line, the model aims to predict the target value 𝑌^ .Y^ such that the error difference between the predicted value 𝑌^ Y^ and the true value Y is minimum. Through Gradient Descent Algorithm we can achieve for estimating the best parameters:

Cost function and Gradient for Linear Regression

The cost function or the loss function is nothing but the error or difference between the predicted value 𝑌^ and the true value Y.

In Linear Regression, the Mean Squared Error (MSE) cost function is employed, which calculates the average of the squared errors between the predicted values 𝑦^𝑖 and the actual values 𝑦𝑖. The purpose is to determine the optimal values for the intercept 𝜃1θ1 and the coefficient of the input feature 𝜃2 providing the best-fit line for the given data points. The linear equation expressing this relationship is 𝑦^𝑖=𝜃1+𝜃2𝑥𝑖

MSE function can be calculated as:

A linear regression model can be trained using the optimization algorithm Gradient Descent by iteratively modifying the model’s parameters to reduce the MSE of the model on a training dataset. To update θ1 and θ2 values in order to reduce the Cost function (minimizing MSE value) and achieve the best-fit line the model uses Gradient Descent. The idea is to start with random θ1 and θ2 values and then iteratively update the values, reaching minimum cost.

A gradient is nothing but a derivative that defines the effects on outputs of the function with a little bit of variation in inputs.

Let’s differentiate the cost function(J) with respect to 𝜃1:

Let’s differentiate the cost function(J) with respect to 𝜃2:

Finding the coefficients of a linear equation that best fits the training data is the objective of linear regression. By moving in the direction of the Mean Squared Error negative gradient with respect to the coefficients, the coefficients can be changed. And the respective intercept and coefficient of X will be if 𝛼 is the learning rate.

By this formula weight updation is gonna happen for each and every parameter😊😊.

Enough of theory let’s get hands on Linear Regression 👨‍💻👨‍💻

Code Implementation

About the Dataset: It has 506 records,13 features, and 1 target variable.

Steps for Gradient Descent:

Standardize the data:


from sklearn.preprocessing import StandardScaler
sc=StandardScaler()
X_transform=sc.fit_transform(X)

2. Initialize the parameter and hyper_parameter

weight_vector=np.random.randn(x.shape[1])
intercept=0
learning_rate = 0.001

3. Find derivatives of loss w.r.t weight and bias.

def loss(y,y_predicted):
    n=len(y)
    s=0
    for i in range(n):
        s+=(y[i]-y_predicted[i])**2
    return (1/n)*s

Derivatives of loss w.r.t “weight”

#derivative of loss w.r.t weight
def dldw(x,y,y_predicted):
    s=0
    n=len(y)
    for i in range(n):
        s+=-x[i]*(y[i]-y_predicted[i])
    return (2/n)*s

4. Update the weight and bias till we get the global minima.

for i in range(epoch):
     y_predicted = predicted_y(weight_vector,x,intercept)
     weight_vector = weight_vector - learning_rate *dldw(x,y,y_predicted)  #update weight
     intercept = intercept - learning_rate * dldb(y,y_predicted)    #update bias