Vectorizing Gradient Descent — Multivariate Linear Regression and Python implementation

Rohan Paul
Oct 7, 2020 · 20 min read

In this article, I shall go over the topic of arriving at the Vectorized Gradient-Descent formulae for the Cost function of the for Matrix form of training-data Equations. And along with that the Fundamentals of Calculus (especially Partial Derivative) and Matrix Derivatives necessary to understand the process.

So our target of this article is to understand the full Mathematics and the flow behind arriving at the below formulae, which is the Vectorized Gradient of the training-data Matrix

Image for post
Image for post

First a Refresher on basic Matrix Algebra

A matrix A over a field K or, simply, a matrix A (when K is implicit) is a rectangular array of scalars usually presented in the following form:

Image for post
Image for post

The rows of such a matrix A are the m horizontal lists of scalars:

Image for post
Image for post

and the columns of A are the n vertical lists of scalars:

Image for post
Image for post

A matrix with m rows and n columns is called an m by n matrix, written m*n. The pair of numbers m and n is called the size of the matrix. Two matrices A and B are equal, written A = B, if they have the same size and if corresponding elements are equal. Thus, the equality of two m * n matrices is equivalent to a system of mn equalities, one for each corresponding pair of elements.
A matrix with only one row is called a row matrix or row vector, and a matrix with only one column is called a column matrix or column vector. A matrix whose entries are all zero is called a zero matrix and will usually be denoted by 0.

Matrix Multiplication

The below image is taken from Khan Academy’s excellent linear algebra course.

Image for post
Image for post
Source

In above, each entry in the product matrix is the dot product of a row in the first matrix and a column in the second matrix

More explanation for higher dimension case — If the product AB = C is defined, where C is denoted by [cij], then the
element cij is obtained by multiplying the elements in the ith row of A by the corresponding elements in the jth column of B and adding. Thus, if A has order k * n, and B has order n * p then

Image for post
Image for post

then c11 is obtained by multiplying the elements in the first row of A by the corresponding elements in the first column of B and adding; hence,

Image for post
Image for post

The element c12 is found by multiplying the elements in the first row of A by the corresponding elements in the second column of B and adding; hence,

Image for post
Image for post

The element ckp below is obtained by multiplying the elements in the kth row of A by the corresponding elements in the pth column of B and adding; hence,

Image for post
Image for post

There are four simple rules that will help us in multiplying matrices, listed here

1. Firstly, we can only multiply two matrices when the number of columns in
matrix A is equal to the number of rows in matrix B.

2. Secondly, the first row of matrix A multiplied by the first column of matrix B
gives us the first element in the matrix AB, and so on.

3. Thirdly, when multiplying, order matters — specifically, AB ≠ BA.

4. Lastly, the element at row i, column j is the product of the ith row of matrix A and the jth column of matrix B.

The order in which we multiply matters. We must keep the matrices
in order, but we do have some flexibility. As we can see in the following equation, the parentheses can be moved:

Image for post
Image for post

The following three rules also apply for Matrix Operation.

Image for post
Image for post

Let's see an example of Matrix multiplication

Another example of Matrix multiplication

Image for post
Image for post

Implementing a dot production with numpy

import numpy as np# this will create integer as array-elements
A = np.random.randint(10, size=(2,3))
B = np.random.randint(10, size=(3,2))
This will create 2 Matrices as below e.g.
[[9 0 8] [3 5 5]]
[[5 2]
[0 7]
[0 7]
]
print(np.dot(A, B))# Output
[[45 74]
[15 76]]

Hadamard multiplication is defined for matrices of the same shape as the multiplication of each element of one matrix by the corresponding element of the other matrix. Hadamard multiplication is often denoted by as below, for two matrices A(n×m) and B(n×m) we have

Image for post
Image for post

Refresher on Multivariate Linear Regression

Suppose we have 2 equations as below

10 = 2x + 2y
18 = 4x + y

So the Matrix form

Image for post
Image for post

So in general Mathematic form for the single independent variable case

Image for post
Image for post

So the set of equations for all the observation will be as below

Image for post
Image for post

And the Matrix form of which is below

Image for post
Image for post

So Y is n * 1 matrix, X is an * 2 matrix, β is 2 * 1 matrix

Image for post
Image for post
Matrix form of SLR

Suppose that the response variable Y and at least one predictor variable xi are quantitative. Then the equation for a specific Y value under the MLR model is

Image for post
Image for post

for i = 1, . . . , n. Here n is the sample size and the random variable ei is the
ith error. So

Image for post
Image for post

Which is in compact notation below

Image for post
Image for post
Image for post
Image for post

And now in matrix notation, these n sets of equations become

Image for post
Image for post

where Y is the vector of the response variable and is an n × 1 vector of dependent variables, X is the matrix of the k independent/explanatory variables (usually the first column is a column of ones for the constant term) and is an n × p matrix of predictors, β is a p × 1 vector of unknown coefficients, and e is an n × 1 vector of unknown errors. Equivalently,

Image for post
Image for post

Where
Y : is output vector for n training examples.
X
: is matrix of size n*p where each ith row belongs to ith training set.
β : is weight vector of size p for p training features.

Different types of Matrix Differentiation

Differentiation of a structure (vector or matrix, for example) with respect to a scalar is quite simple; it just yields the ordinary derivative of each element of the structure in the same structure. Thus, the derivative of a vector or a matrix with respect to a scalar variable is a vector or a matrix, respectively, of the derivatives of the individual elements.

1.A-Derivatives of Vectors with Respect to Scalars

The derivative of the vector y(x) = (y1 , . . . , yn ) with respect to the scalar x
is the vector

Image for post
Image for post

The second or higher derivative of a vector with respect to a scalar is likewise a vector of the derivatives of the individual elements; that is, it is an array of higher rank.

1.B-Derivatives of Matrices with Respect to Scalars

The derivative of the matrix Y(x) defined as below

Image for post
Image for post

with respect to the scalar, x is the matrix

Image for post
Image for post

The second or higher derivative of a matrix with respect to a scalar is
likewise a matrix of the derivatives of the individual elements.

Differentiation of a function of a vector or matrix that is linear in the elements
of the vector or matrix involves just the differentiation of the elements, fol-
lowed by application of the function. For example, the derivative of a trace of
a matrix is just the trace of the derivative of the matrix. On the other hand,
the derivative of the determinant of a matrix is not the determinant of the
derivative of the matrix

Because differentiation with respect to a scalar does not change the rank of the object (“rank” here means rank of an array or “shape”), higher-order derivatives

Image for post
Image for post

with respect to scalars are merely objects of the same rank
whose elements are the higher-order derivatives of the individual elements.

Differentiation of a given object with respect to an n-vector yields a vector for each element of the given object. The basic expression for the derivative, from formula

Image for post
Image for post

for an arbitrary conformable vector y. The arbitrary y indicates that the derivative is omnidirectional; it is the rate of change of a function of the vector in any direction.

The derivative of a scalar-valued function with respect to a vector is a vector
of the partial derivatives of the function with respect to the elements of the
vector. If f (x) is a scalar function of the vector x = (x1 , . . . , xn ),

Image for post
Image for post

if those derivatives exist. This vector is called the gradient of the scalar-valued
function, and is sometimes denoted by ∇f (x)

The derivative of an m-vector-valued function of an n-vector argument consists of nm scalar derivatives. These derivatives could be put into various structures. Two obvious structures are an n × m matrix and an m × n matrix.

For a function,

Image for post
Image for post

we define

Image for post
Image for post

to be the n × m matrix, which is the natural extension of ∂/∂x applied to a scalar function. The above the notation is more precise because it indicates that the elements of f correspond to the columns of the result.

Image for post
Image for post

if those derivatives exist. This derivative is called the matrix gradient and
is denoted by ∇f for the vector-valued function f . (Note that the ∇
symbol can denote either a vector or a matrix, depending on whether the
function being differentiated is scalar-valued or vector-valued.)

The m × n matrix is called the Jacobian of f and is denoted by Jf as below

Image for post
Image for post

General Form of the Jacobian

For arriving at the general Mathematical form of Jacobian I would refer a quite well-recognized Paper in this field.

Let y = f(x) be a vector of m scalar-valued functions that each take a vector x of length n = |x| where |x| is the cardinality (count) of elements in x. Each fi function within f returns a scalar just as in the previous section:

Image for post
Image for post

For instance,

Image for post
Image for post

So the set of equations are as below

Image for post
Image for post

So we have m = n functions and parameters, in this case. Generally speaking, though, the Jacobian matrix is the collection of all m × n possible partial derivatives (m rows and n columns), which is the stack of m gradients with respect to x:

Image for post
Image for post

Each of these

Image for post
Image for post

is a horizontal n-vector because the partial derivative is with respect to a vector, x, whose length is n = |x|. The width of the Jacobian is n if we’re taking the partial derivative with respect to x because there are n parameters we can wiggle, each potentially changing the function’s value. Therefore, the Jacobian is always m rows for m equations.

Now the Cost Function under Multivariate Linear Regression

The equation for the hypothesis function is as follows

Image for post
Image for post

The general notations that I will use for extending the above function

Image for post
Image for post

So if we are predicting house-price with the above MLR equation, then θ0 will be the basic/base price of a house, then θ1 as the price per room, θ2 as the price per KM-distance from the nearest Airport.

Using the definition of matrix multiplication, our multivariate hypothesis function can be concisely represented as:

Image for post
Image for post

This is a vectorization of our hypothesis function for one training example;

Now, using the fact that for a vector z, we have that

Image for post
Image for post

Applying the above identity to the right-hand-side of the Cost function (below)

Image for post
Image for post

So now the Cost function takes the following form

Image for post
Image for post

And our target is to

Image for post
Image for post

Wher the thetas θ are the weights, and the above partial derivative for any weights wj will be as below

Image for post
Image for post

So the Gradient-Descent process for Multivariate case becomes

Image for post
Image for post

Where θ and x are column vector given by

Image for post
Image for post

And that's why we take the transpose of θ to multiply with column-vector x to get the hypothesis (as earlier mentioned in this article)

Image for post
Image for post

Refresher — Matrix-Derivative Identities required for the Mathematical Derivation of the Gradient of a Matrix w.r.t. to Vectors

The derivative of a matrix is usually referred to as the gradient and denoted as ∇. Consider a function

Image for post
Image for post

That is the Matrix will be as below

Image for post
Image for post

Then,

Image for post
Image for post

Thus, the gradient ∇Af(A) is itself an m-by-n matrix, whose (i, j)-element is

Image for post
Image for post

For example, lets take a look at a very simple case. Suppose

Image for post
Image for post

is a 2-by-2 matrix, and the function

Image for post
Image for post

is given by

Image for post
Image for post

Here, Aij denotes the (i, j) entry of the matrix A. We then have

Image for post
Image for post

Matrix Transpose Identities

Each of the below identities can be proved separately mathematically proved.

Image for post
Image for post

Another related one, If 𝐴 and 𝐵 are two matrices of the same order, then

Image for post
Image for post

The sum of the diagonal elements of a square matrix is called the trace of the
matrix. We use the notation “tr(A)” to denote the trace of the matrix A:

Image for post
Image for post
Image for post
Image for post

Because of the associativity of matrix multiplication, this relation can be
extended as

Image for post
Image for post

Now proceed to find the Gradient of the Cost function.

To implement Gradient Descent, you need to compute the gradient of the cost function with regard to each model parameter θj. In other words, you need to calculate how much the cost function will change if you change θj just a little bit. This is called a partial derivative. It is like asking “What is the slope of the mountain under my feet if I face east?”. and then asking the same question facing north.

Now you would recognize the very well-known cost function

Image for post
Image for post

And the following the Jacobian identity discussed above the Gradient vector of the cost function will be

Image for post
Image for post

Notice that this formula involves calculations over the full training set X, at each Gradient Descent step! This is why the algorithm is called Batch Gradient Descent: it uses the whole batch of training data at every step.

Once you have the gradient vector, which points uphill, just go in the opposite direction to go downhill. This means subtracting ∇θMSE(θ) from θ. This is where the learning rate η comes into play:5 multiply the gradient vector by η to determine the size of the downhill step

Image for post
Image for post

Now repeating below section of the Matrix form of the training dataset, from our earlier part of this article —

The general form of multiple linear regression (MLR) model is

Image for post
Image for post

for i = 1, . . . , n. Here n is the sample size and the random variable ei is the
ith error. In matrix notation, these n sets of equations become

Image for post
Image for post
Here X is matrix and Y, β and e are vectors!

The Y vector is the response variable and is an n × 1 vector of dependent variables, X is the matrix of the k independent/explanatory variables (usually the first column is a column of ones for the constant term) and is an n × p matrix of predictors, β is a p × 1 vector of unknown coefficients, and e is an n × 1 vector of unknown errors. Equivalently,

Image for post
Image for post

Where
Y : is output vector for n training examples.
X
: is matrix of size n*p where each ith row belongs to ith training set.
β : is weight vector of size p for p training features.

Note that β in the above is not a scalar, but a vector.

Now we have the RSS defined as

Image for post
Image for post

Alternative-1 — Vectorized Calculation of Gradient of our Matrix form training data

Note the above is directly derived from using the identity that for a vector z, we have

Image for post
Image for post
Image for post
Image for post

Now let,

Image for post
Image for post

Then for the following assumption

Image for post
Image for post
Image for post
Image for post

Therefore,

Image for post
Image for post

We have, for each of k=1,…, p

Image for post
Image for post

Then for the whole matrix (i.e. the whole set of training data set or the whole set of Hypothesis Equation ), we will get

Image for post
Image for post

Which in the final Vector form

Image for post
Image for post
Vectorized Gradient

Note, that in the last equality, I had to get the Transpose of X because when doing matrix multiplication — that's a dot product of rows of the first matrix to columns of the second matrix. The number of columns of the 1st matrix must equal the number of rows of the 2nd matrix.

So by transposing the p-th column of X ends up being the p-th row of the X-Transposed. Thus, when doing a dot product between the p-th row of X-Transposed with (y — Xβ) it will match perfectly as I am using all of the entries of the p-th column of X

Alternative-2 — And below is an alternative calculation for arriving at the same Vectorized Gradient formulae for training-data in Matrix form.

Here, I am denoting the coefficients with θ or Theta (instead of β that we used above in our Alternative-1 Gradient Calculation — only to make the presentation differentiable)

Again assume we have our Training Set of data as below

Image for post
Image for post

Also, let y be the m-dimensional vector containing all the target values from the training set:

Image for post
Image for post

And we have the Predicted Value or the Hypothesized value as below

Image for post
Image for post

So we can say the below

And now again, we need to use the same vector identity mentioned above, that for a vector z, we have

Image for post
Image for post

Using the above we have the below relation for the Cost function

Image for post
Image for post

We have already introduced the trace operator of a Matrix, written as “tr.” Now we need to use a couple of more matrix derivatives Identities (that I am just stating below here, and they all have robust Mathematical proofs, the details of which I am not including here).

So below 2 Matrix Derivative Identities hold true and we need to use them to arrive at the Gradient Calculation.

Image for post
Image for post
Matrix Derivative Identities involving Trace

Combining the above two Equations or Identities we derive

Image for post
Image for post

So now Final Gradient Calculation will be as below

Image for post
Image for post

In the third step above, we used the fact that the trace of a real number is just the real number; the fourth step used the fact that

Image for post
Image for post

And the fifth step used below equation that we already mentioned

Image for post
Image for post

And also the below Matrix Identity

Image for post
Image for post

With

Image for post
Image for post

Take a note of the final result of the Gradient, which is the same form that we arrived at earlier under the Alternative-1 calculation of Gradient

Image for post
Image for post
Vectorized Gradient of the Cost function of Matrix form training data-set

And above is the exact formulae that we will implement in Python/Numpy very soon below.

So now let's go back to the original Cost Function

Image for post
Image for post
Numerical Cost Function for Linear Regression

Which in Vectorized Form for the Mean Squared Error is defined as below

Image for post
Image for post
Vectorized MSE

And after calculating the Gradient of this MSE in Vectorized form, which we did above the Gradient-Descent Algorithm will update the weights (θ / Theta values ) as below

Image for post
Image for post
Vectorized form of Gradient-Descent for updating θ values

Compare the above with the Gradient-Descent formulae for the Numerical case

Image for post
Image for post

A manual example of the Gradient-Descent implementation

Let's say for simple single variable training dataset we have the following values

x,y
1,1
2,2
3,3
4,4

Further, assume,

α (learning rate) = 1

m (number of training examples) =4

Setting θ0 to 0 and θ1 to 1

So we have the linear equation

Image for post
Image for post

Now by convention,

Image for post
Image for post

at the i−th row

Further I denote,

Image for post
Image for post

(i.e. after k repetitions of the GD algorithm).

So after a single update, with GD algorithm, i.e. applying the below

Image for post
Image for post
Image for post
Image for post

So, regardless of how many times I apply the GD algorithm, the value of θ1 will be constantly equal to 1, since at every iteration we have θ0=0 and θ1=1

Assume theta values have been picked at random as below

Image for post
Image for post

And the training-dataset is

Image for post
Image for post

So here, first, to calculate the hypothesis Equation, I need to transpose θ to give our initial vector θ

Image for post
Image for post

And the GD-Algorithm is,

Image for post
Image for post
Vectorized GD
Image for post
Image for post

And for applying the GD algorithm again, I need to evaluate

Image for post
Image for post

Python/Numpy Implementation

Please refer to the jupyter notebook

First, generate a training dataset in Matrix form

NumPy zeros() function in above you can create an array that only contains only zeros using the NumPy zeros() function with a specific shape. The shape is row by column format. Its syntax is as below

Image for post
Image for post

For example, the code to generate a Matrix of 2 by 3 (2 rows and 3 columns)

import numpy as npA = np.zeros(shape = (2, 3))
print
# Output below
[[0. 0. 0.]
[0. 0. 0.]]

Which produces an array like the following:

Image for post
Image for post

If I run the above gen_data() function above for a set of 5 training data-set as below with bias and variance of 20 and 10 respectively

gen_data(5, 20, 10)

I will have the following form of output

(array([[1., 0.],
[1., 1.],
[1., 2.],
[1., 3.],
[1., 4.]]),
array([22.38023816, 24.18406356, 28.01360908, 26.80051617, 29.30101971])
)

And now the function for Gradient-Descent implementing the Grdient formulae for a Mactrix that we derived above

Image for post
Image for post
Vectorized GD

And now finally invoke the above 2 functions to create some linear data and run the gradient-descent and also plot it to a graph.

And the resultant graph of the Linear line will be this

Image for post
Image for post

The full notebook is here

Matrix Multiplicationhttps://en.wikipedia.org/wiki/Matrix_multiplication

Matrix-Calculus — https://en.wikipedia.org/wiki/Matrix_calculus

Vector_Fieldhttps://en.wikipedia.org/wiki/Vector_field

Matrix Transpose Properties -https://en.wikipedia.org/wiki/Transpose#Properties

Matrix Cookbook — https://www.math.uwaterloo.ca/~hwolkowi/matrixcookbook.pdf

Online Calculation of Matrix Derivativehttp://www.matrixcalculus.org/

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data…

By Analytics Vidhya

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

Rohan Paul

Written by

DataScience | ML | 2x Kaggle Expert. Ex Fullstack Engineer and Ex International Financial Analyst. https://www.linkedin.com/in/rohan-paul-b27285129/

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Rohan Paul

Written by

DataScience | ML | 2x Kaggle Expert. Ex Fullstack Engineer and Ex International Financial Analyst. https://www.linkedin.com/in/rohan-paul-b27285129/

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store