Understanding the Gradient in Higher Dimensions

7 min readDec 21, 2023

In this article, we will be exploring the mathematical concept of gradient which is a fundamental part of multivariate calculus and is found in almost every advanced STEM field such as electrodynamics, machine learning and more. My hope is to both build an intuition for students new to multivariate calculus as well as introduce some form of rigorous text for better clarity.

Some prerequisites :

Univariate calculus
Linear algebra
Basics of multivariate calculus (just enough to know how to pronounce nabla )

What is a gradient ?

A gradient is just a fancier way to differentiate for multivariable functions, aka, functions with multiple input values. In fact, when we say “differentiating a multivariable function”, we don’t mean taking its partial derivative but rather its gradient.

In particular, we will be looking at gradients of scalar functions, that is, functions that can have multiple input values but only one real output value. More formally:

where V is any arbitrary vector space.

But what is it exactly ?

A gradient is essentially a way of packing together all the partial derivatives of a function in a single structure. It's like finding how much each input variable contributes to the change in the output of the function.

Consider the function:

Its gradient is given as :

So, to calculate the gradient we have partially differentiated f(x) with respect to each element xᵢ and the resulting vector contains all of these partial derivatives. This allows us to encapsulate all the partial derivatives of the function in a single mathematical structure.

Gradients are not just for vectors !

The concept of gradient is not just limited to functions whose domains are Rⁿ, aka, simple multivariable functions that we are all used to.

For instance, the function

where M(R) is the set of all real m * n matrices

will have the following gradient:

Here, we have partially differentiated f(A) with respect to each element in A : x_11, x_12, … x_mn.

Do note that even though the function is taking a matrix as an input, it still returns a real number and is in fact, a scalar function. This allows us to differentiate it seamlessly.

There’s this thing about vectors…

Vectors can actually be represented in two ways:

The first is the more conventional way to represent vectors as rows or columns -

However, we can also represent as -

where e_1, e_2, … e_n are the standard basis vectors of Rⁿ

So, we can rewrite our first formula as :

Let’s try it in 3 dimensions :

Armed with our knowledge so far, let’s try to apply the gradient to a three-dimensional function f(x, y, z)-

An example :

Geometrical meaning behind the gradient :

You may or may not have heard the phrase “gradient points in the directions of the steepest slope”. What this actually means is twofold:

You may or may not have heard the phrase “the gradient points in the direction of the steepest slope.” This statement holds a twofold meaning:

The direction of the gradient (when expressed with their basis vectors) indicates the direction of the steepest slope of the function — where the function’s output changes most rapidly concerning its input(s).
The magnitude of the gradient, calculated using the L2 norm (||.||), provides the overall magnitude of this slope in Euclidean space.

Understanding this geometric interpretation is fundamental in grasping how the gradient guides us through the terrain of a function, pointing toward the direction of the most significant change and specifying the magnitude of that change.

A visualization of the gradient in 3D space

Comparing the gradient with the derivative :

Until now, most of you would have noticed that the gradient of a multivariable function is very much analogous to the derivative of a univariable function.

Some standard gradients of common functions are as follows:

If you look closely, most of these are pretty similar to the regular derivative d/dx. Even rules such as product and division rules are applicable to the gradient the same way they are applicable to the derivative.

This type of analogy arises because the gradient is nothing more than a collection of partial derivatives and partial derivatives are subject to the same rules as regular derivatives.

Chain rule for differentiation:

Most of us would remember the dreaded chain rule from high school without actually remembering said rule. Well, allow me to refresh your memory :

However, the chain rule for gradient is slightly more complicated and for that, we must take a look at the next concept.

Generalizing the gradient :

Up until now, we were only looking at gradients with respect to the input of the function itself, that is, all the partial derivatives were with respect to elements of the function input such as xᵢ of x vector in f(x). But this need not be the case, we can take the gradient of f(x) with respect to any arbitrary vector v.

More explicitly :

Note that we have partially differentiated f(x) with each element of v vector and not with each element of x vector as we were previously doing. In fact, our previous formula is just a special case of the generalized gradient :

This type of gradient where the partial derivatives are with respect to the inputs of the function is called the standard gradient and can be denoted simply as

without the subscript.

Similarly, for a matrix input function :

Here, v_ij belongs to B and we are partially differentiating f(A) with respect to every element in B.

Chain rule for gradient : The chain rule for gradients is given as -

The Ultimate Gradient !

So far, we have seen both the standard and generalized gradients but what if we were to take it one step further? How about taking the gradient of a scalar function with matrix inputs with respect to a vector?

Intuitively, this seems completely nonsensical. How does one differentiate a function that takes matrices as inputs with respect to a vector?

Ironically enough, it turns out that this operation is indeed well defined. The reason is that even though the function is taking a matrix as an input, its output is still a scalar real value which is computable by a gradient. The result of such a gradient will follow our previously stated rules and its shape will be the same as the value to whose respect we are taking the gradient to, in this case, a vector.

Therefore, we have :