Gradient Descent for Multivariable Regression in Python

8 min readJul 28, 2021

We often encounter problems that require us to find the relationship between a dependent variable and one (or more than one) independent variable in visualization tasks, machine learning models, or even in the real-world application, where investors want to determine a housing price based on its area and its location. Using the foundation of Multivariable Regression, the correlation is generated from numerous training instances of housing products from the investors’ transactions previously. So, how exactly does the Multivariable Regression utilize Gradient Descent to find the best “hyperspace” of fit for these variables?

Multivariate Regression Insiutition

To gain a better understanding of the role of Gradient Descent in optimizing the coefficients of Regression, we first look at the formula of a Multivariable Regression:

The Multivariable Function

y or hθ(x) = Hypothesis function (the dependent variable), taking the model parameters theta as inputs
θ0, θ1,…, θn = Weights or model parameters
x1, x2,…, xn = Predictors (the independent variables)
n = number of features

There are several assumptions that the data sets must hold in order to construct a multivariable regression:

The model should be linear in general as there exists a relationship between the outcome and these variables.
The residuals distribution, formed by the difference between the actual value and the value predicted by the model, is assumed to be normally distributed centered at 0. This is also known as the Multivariate Normality property.
No Multicollinearity: These independent variables are assumed to not generated any high correlation with each other.

Choosing several values of theta(θ), even randomly, will generate a multivariable linear regression. However, if we take these theta values without any orders or rules, the hypothesis function will very be likely to be bad; thus, not offering the opportunity to predict outcomes based on input variables.

Let’s take the case of predicting the employees’ salary by their years of experience. As common sense, the more experienced a worker is, the higher salary he or she should earn. But our model has forecasted a completely opposite result, as the workers will probably earn less when they gain more experience at work. Therefore, to avoid such an underfitting model, we measure its mean squared error, or widely known as the cost function to evaluate the model performance.

The Cost Function

The cost function for Multivariable Linear Regression

J(θ) = The cost function which takes the theta as inputs
m = number of instances
x(i) = input (features) of i-th training example

As we can interpret from the formula, the regression generated from the salary and experience plot will produce such a high cost function as the residuals of the plot appear to be too large, demonstrating the fact that the line in the previous plot has a poor performance. Our goal is to find the parameters of theta such that these theta values will minimize the cost function. But, what if our initial guess of theta creates a bad model like the line in the salary and experience plot. That will be the time for us to use Gradient Descent can be used as the optimization strategy for our multivariable linear regression.

How does Gradient Descent work in Multivariable Linear Regression?

Gradient Descent is a first-order optimization algorithm for finding a local minimum of a differentiable function.

At the theoretical level, the idea of repeatedly taking steps in the opposite direction of the gradient will gradually lead us to the minimum of that function.

Using the Gradient Descent foundation, we can implement our own algorithm for the Multivariate Regression Cost Function by continuously updating our theta values after each model fits as follow:

The 𝛼 symbol represents the learning rate of the algorithm, which controls the rate at which the model learns. Choosing an appropriate learning rate is essential as it will ensure our cost function will converge in a reasonable time. If the model is failed to converge or if it takes too much time to determine its minimum value, the data implies that our learning rate is probably a wrong choice. However, in the Gradient Descent algorithm, the learning rate just plays the role of a constant value; hence, after taking partially differentiate of the cost function, the algorithm becomes:

x^(i)_j = value of feature j in i-th training example

We have finalized the algorithm for Gradient Descent in Multivariate Regression. However, if we have such a larger number of features, for instance, let’s assume the model has thousands of independent variables. It will make the training process becomes much more difficult and complex. There is one approach to tackle down this problem is to turn our formula and algorithm into vectorized forms with the generation of the model’s parameter vector, the instance’s feature vector, and the target values vector.

The model’s parameter vector is containing the bias term θ0 and others model parameters from θ1 to θn. On the other hand, the instance’s feature vector has the feature from x0 to xn. Note that our previous formula of the hypothesis function does not contain x0. This is because x0 is attached with the bias term of the model (θ0); therefore, it equals 1. We can interpret this as defining an addition 0-th feature, which always takes the value of 1. So, both the model’s parameter and the instance’s feature are n+1 dimension vectors, transforming the hypothesis function formula into vectorized forms by taking the dot product of the transpose of the model’s parameter vector and instance’s feature vector.

The Multivariate Regression in vectorized form

Applying these new vectors into our Gradient Descent algorithm, we have a much shorter and prettier version:

Instead of calculating each theta individually, we can generate the gradient vector to compute them in just one step. Before coming to the gradient vector, it is necessary to denote the last component of our multivariate regression:

X is the matrix that contains all the values in the dataset, not include the values of the outcomes.

We then generate the formula of gradient vector for the cost function:

Utilizing the gradient vector, we can shorter our updated theta formula as:

With carefully constructed formulas and algorithms, implement these from Python from scratch will not be too difficult!

Implementing Gradient Descent in Python

In most multivariable linear regression problems, it is not so complicated to split the independent variables set with the target values. However, the independent variables set is not usually contain a row that corresponds to x_0. Therefore, our first function should add these values to the original matrix.

def generateXvector(X):
    """ Taking the original independent variables matrix and add a row of 1 which corresponds to x_0
        Parameters:
          X:  independent variables matrix
        Return value: the matrix that contains all the values in the dataset, not include the outcomes variables. 
    """    vectorX = np.c_[np.ones((len(X), 1)), X]
    return vectorX

We also need to obtain a vector that contains the initial guess of theta. This model parameters vector lives in (n+1)x1 dimensions:

def theta_init(X):
    """ Generate an initial value of vector θ from the original independent variables matrix
         Parameters:
          X:  independent variables matrix
        Return value: a vector of theta filled with initial guess
    """
    theta = np.random.randn(len(X[0])+1, 1)
    return theta

Finally, with enough preparation for the main function, we start building our own Multivariable Linear Regression function. First, it takes two matrices as its training instance, a learning rate, and the number of iterations. It first reshapes the matrix y to match with the dimension of the target values vector in the gradient vector formula. The function follows by computing the upgraded gradient for each iteration, leading to a new model parameter vector that reveals a better performance. It then calculates the cost value of each iteration and stores the value in order to plot the cost function later.

def Multivariable_Linear_Regression(X,y,learningrate, iterations):
    """ Find the multivarite regression model for the data set
         Parameters:
          X: independent variables matrix
          y: dependent variables matrix
          learningrate: learningrate of Gradient Descent
          iterations: the number of iterations
        Return value: the final theta vector and the plot of cost function
    """
    y_new = np.reshape(y, (len(y), 1))   
    cost_lst = []
    vectorX = generateXvector(X)
    theta = theta_init(X)
    m = len(X)
    for i in range(iterations):
        gradients = 2/m * vectorX.T.dot(vectorX.dot(theta) - y_new)
        theta = theta - learningrate * gradients
        y_pred = vectorX.dot(theta)
        cost_value = 1/(2*len(y))*((y_pred - y)**2) 
        #Calculate the loss for each training instance
        total = 0
        for i in range(len(y)):
            total += cost_value[i][0] 
            #Calculate the cost function for each iteration
        cost_lst.append(total)
    plt.plot(np.arange(1,iterations),cost_lst[1:], color = 'red')
    plt.title('Cost function Graph')
    plt.xlabel('Number of iterations')
    plt.ylabel('Cost')
    return theta

Examine our Multivariate Regression Code

Great! Now we have built our own Gradient Descent code. It is time to evaluate our code to see whether it runs properly or not by comparing its optimized parameter values with the LinearRegression function from sklearn. We will use the Diabetes dataset from SkLearn.

from sklearn.datasets import load_diabetes
diabetes = load_diabetes()
X = diabetes.data
y = diabetes.target

Before we put the data into our function, we need to apply Feature Scaling for it. Feature Scaling is essential in Gradient Descent as the algorithm will converge much faster with a proper scaling than without it.

from sklearn.preprocessing import StandardScaler
sc=StandardScaler()
X_transform=sc.fit_transform(X)

Fitting the data with SkLearn.

from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(X_transform, y)
lin_reg.intercept_, lin_reg.coef_
>>> (152.13348416289594,
 array([ -0.47623169, -11.40703082,  24.72625713,  15.42967916,
        -37.68035801,  22.67648701,   4.80620008,   8.422084  ,
         35.73471316,   3.21661161]))

Fitting the data with our Gradient Descent.

Multivariable_Linear_Regression(X_transform,y, 0.03, 30000)
>>> array([[152.13348416],
           [ -0.47623165],
           [-11.40703078],
           [ 24.72625722],
           [ 15.42967913],
           [-37.68035063],
           [ 22.67648115],
           [  4.80619678],
           [  8.42208306],
           [ 35.73471041],
           [  3.21661164]])

Wow! Our Gradient Descent produces a very similar result of parameters to the SkLearn function. We have successfully implemented Gradient Descent for Multivariate Regression in Python from scratch.

Final Words

The code for this post can be found here:

I would love to hear your thought and discussions about the paper. It would be great if you can share any unique experiences related to Multivariate Regression or Gradient Descent. Looking forward to connecting with future data scientists at Linkedin.