Calculus

Published in

Journey to Machine Learning/Deep Learning/Artificial Intelligence

10 min readSep 1, 2019

Yeah you gotta do calculus like that to solve problems.

Just kidding!! Okay so back in my previous posts I have shared with you’ll the concepts of Vector Algebra and Probability and how we apply it in Machine Learning or Deep Learning. In this post I will talk about the use or rather the major role of Calculus in Machine Learning or Deep Learning. So let’s deep dive into it.

1. Differentiation

The differentiation of a function generally means the rate of change of a quantity represented by a function with respect to another quantity on which the function is dependent on.
The derivative of a function is defined as d(f(t))/dt and is generally expressed by the following formulae based on whichever is convenient.

When we deal with function that is dependent on multiple variables, then the derivative of the function with respect to each of the variables keeping the others fixed is called a partial derivative, and the vector of the partial derivatives is called the gradient of the function.

To understand this let’s take a short example. Let’s us consider the housing price (z) is dependent on the number of rooms (x) and area of the house (y). Then we can express z as a function of x and y as z = f(x,y).

The partial derivative of z with respect to x is represented by:

and that with respect to y is represented by:

2. Gradient of a Function

For a function with two variables z=f(x,y), the vector of the partial derivatives is called the gradient of the function and is denoted by ∇ z.

A generalised form is : f(x₁, x₂, ….., xₙ) can also be expressed as x = [x₁, x₂, ….., xₙ]ᵗ which belongs to a set of ℝ of order n × 1. The gradient vector for the function will be

For example if f(x,y,z) = x + y³ + z² , then ∇ f = [1 3y² 2z]ᵗ

At maxima or minima of a function, the gradient vector is zero.

3. Successive Partial Derivative

We can have successive partial derivatives of a function with respect to multiple variables. For example if z=f(x,y), Then

Successive partial derivative with respect to x and y

it is the partial derivative of z with respect to x at first and then with respect to y. Similarly,

Successive partial derivative with respect to y and x

is the successive partial derivative of z with respect to y at first and then with respect to x.

If the second derivatives are continuous, then the order of partial derivatives doesn’t matter, i.e.

3. Hessian Matrix of a Function

To be honest this plays a vital role in Multi-Variate Cost Function.

The Hessian of a multi-variate function is a matrix of second-order partial derivatives.
For a function f(x,y,z) , the Hessian is defined as follows:

Hessian matrix representation for f(x,y,z)

The Hessian is useful in the optimisation problems that we come across so frequently in the machine-learning domain.

4. Maxima and Minima of a function

Rules for maxima and minima for a univariate function:

The derivative of a function with respect to x would be zero at maxima and minima
The second derivative of f(x) , which is nothing but the derivative of the first derivative represented as d²(f(x))/dx² , needs to be investigated at the point where the first derivative is zero.

→ If d²(f(x))/dx² < 0 then we are getting a maxima

→ If d²(f(x))/dx² > 0 then we are getting a minima

→ If d²(f(x))/dx² = 0 then we are getting inflection

3. At point of inflection the sign of curvature changes.

Graphs showing the above three conditions

The points at which the derivative of a univariate function is zero and gradients of a multivariate function are zero are called stationary points.

Rules for maxima and minima for a multivariate function:

Let us consider a function f(x) = x²y³ + 3y + x + 5, now to determine the stationary points , the gradients has to be zero. i.e.,

Setting 𝔡f/𝔡x and 𝔡f/𝔡y to zero we get:

𝔡f/𝔡x = 2xy³ + 1 = 0 and 𝔡f/𝔡y = 3x²y² + 3 = 0

We need to compute the Hessian matrix as well , and for that we need to do the following operation:
→ 𝔡²f/𝔡x² = 2y³
→ 𝔡²f/𝔡x𝔡y = 6xy²
→ 𝔡²f/𝔡y² = 6xy²
→ 𝔡²f/𝔡y𝔡x = 6xy²

We can see from the above successive differentiation that this function has continuous second derivative i.e, 𝔡²f/𝔡x𝔡y = 𝔡²f/𝔡y𝔡x.

Let’s say that the gradient is zero at (x = a, y = b):

If (𝔡²f/𝔡x² . 𝔡²f/𝔡y²) — (𝔡²f/𝔡x𝔡y)² < 0 at (x = a, y = b) then (x = a, y = b) is a saddle point
If (𝔡²f/𝔡x² . 𝔡²f/𝔡y²) — (𝔡²f/𝔡x𝔡y)² > 0 at (x = a, y = b) then (x = a, y = b) is an extremum point , i.e. ,minima or maxima exists.
* If 𝔡²f/𝔡x² < 0 and 𝔡²f/𝔡y² < 0 at (x = a, y = b) then f(x,y) has a maxima at (x = a, y = b)
* If 𝔡²f/𝔡x² > 0 and 𝔡²f/𝔡y² > 0 at (x = a, y = b) then f(x,y) has a minima at (x = a, y = b)
If (𝔡²f/𝔡x² . 𝔡²f/𝔡y²) — (𝔡²f/𝔡x𝔡y)² = 0 at (x = a, y = b) then (x = a, y = b) then more advanced methods are required to classify stationary points correctly.

For a function with n variables , the following are the guidelines for checking for the maxima , minima, and saddle points of a function:
→ Computing the gradient and setting it to zero vector would give us the list of stationary points.
→For a stationary point x₀ which belongs to the set ℝ of order n×1, if the Hessian matrix of the function at x₀ has both positive and negative eigen values, then x₀ is a saddle point. If the eigen values of the Hessian matrix are positive then the stationary point is a local minima whereas if the eigen values are negative, then the stationary point is a local maxima.

5. Local Minima and Global Minima.

Functions can have multiple minima at which the gradient is zero, each of which value is called a local minima point.
The local minima at which the function has the minimum value is called the global minima. The same applies for maxima.
Maxima and minima of a function are derived by optimisation methods.
The minima and maxima are most often derived through iterative approaches such as gradient descent, gradient ascent and so forth.

In the iterative way of deriving minima and maxima, the optimisation method may get stuck in a local minima or a maxima and be unable to reach the global minima or maxima. In iterative methods, the algorithm utilises the gradient of the function at a point to get to a more optimal point. When traversing a series of points in this fashion, once a point with a zero gradient is encountered, the algorithm stops assuming the desired minima or maxima is reached. This works well when there is a global minima or maxima for the function. Also, the optimisation can get stuck at a saddle point too. In all such cases, we would have a sub-optimal model.

Illustration of Global Minima and Local Minima as well as Global Maxima and Local Maxima of a function

6. Positive Semi-Definitive and Positive Definitive

A square matrix A which belongs to a set ℝ of order n×n is positive semi-definitive if for any non-zero vector x belonging to set ℝ of order n×1 the expression xᵗAx ≥ 0. The matrix A is definitive if the expression xᵗAx > 0.
All the Eigen values for a positive semi-definitive matrix should be non-negative i.e. ≥ 0 , whereas for definitive matrix all the Eigen values should be positive i.e. > 0 .

7. Convex Set

A set of points is called convex if , given any two points x and y belonging to the set, all the points joining the straight line from x to y also belong to set.

Illustration of Convex Set and Concave Set

8. Convex and Non Convex Function

Convex Function

A function f(x) defined on a convex set D, where x belongs to set ℝ of order n×1 and D is the domain, is said to be convex if the straight line joining any two points in the functions lies above or on the graph of the function.
Mathematically, this can be expressed as :
f(tx +(1-t)y) ≤ tf(x) +(1-t)f(y) ∀ x,y ∈ D, ∀ t ∈ [0,1]

For a convex function that is twice continuously differentiable, the Hessian matrix of the function evaluated at each point in the domain D of the function should be positive semi-definitive , i.e. for any x belongs to set ℝ of order n×1
xᵗHx ≥ 0.

A convex function has the local minima as it’s global minima. Thus there can be more than 1 global minima but the value of the function at each global minima would be the same.

Non-Convex Function

A non-convex function can have many local minima , all of which are not global minima.

In any machine learning models building process where we try to learn the model parameters by minimising the cost function, we prefer the cost function to be convex as with a proper optimisation we would attain the global minima for sure.
For any non-convex cost function, there is a high chance that the optimisation technique will get stuck at a local minima or a saddle point, and hence it might not attain it’s global minima.

9. Multivariate Convex and Non-Convex Functions

Multivariate Convex Function

Let us consider a function:
f(x,y) = 2x² + 3y² — 5

→ 𝔡²f/𝔡x² = 4
→ 𝔡²f/𝔡x𝔡y = 0
→ 𝔡²f/𝔡y² = 6
→ 𝔡²f/𝔡y𝔡x = 0

𝔡f/𝔡x = 4x , 𝔡f/𝔡y = 6y

Since we need to set the gradients to zero vectors to obtain the maxima or minima of a function hence 𝔡f/𝔡x and 𝔡f/𝔡y when set to zero vector we get stationary points at (x = 0, y = 0).
Now for condition check we see that (𝔡²f/𝔡x² . 𝔡²f/𝔡y²) — (𝔡²f/𝔡x𝔡y)² = 24–0 = 24. and 𝔡²f/𝔡x² > 0 , 𝔡²f/𝔡y² > 0 .
∴ The function has a minima at (x = 0, y = 0) and the minimum value of the function at (x = 0, y = 0) is -5.

Illustration of the above convex function *f(x,y) = 2x² + 3y² — 5*

Multivariate Non-Convex Function

Let us take a function given by:
f(x,y) = log(x/y) = logx — logy
→ 𝔡²f/𝔡x² = -1/x²
→ 𝔡²f/𝔡x𝔡y = 0
→ 𝔡²f/𝔡y² = 1/y²
→ 𝔡²f/𝔡y𝔡x = 0

Now the preceding is a non-convex function cause if we look at the Hessian matrix which is as follows:

We can see that the Eigen values -1/x² and 1/y² , one eigen value is negative and the other is positive. Since -1/x² will always be negative regardless for all values of x
∴ Hessian is not positive semi-definite , making the function non-convex.

Illustration of the non-convex function *f(x,y) = log(x/y) = logx — logy*

10. Taylor Series

The Taylor Series comes into play in the machine learning domain when we deal with multi-variate feature cost function , in that case we need to compute the Taylor to find the calculate the optimal parameter vector.

The Taylor Series expansion of a univariate function around a point x can be expressed as follows:

f(x+h) = f(x) + hf’(x)+h²/2! f’’(x)+ … + hⁿ/n! fⁿ(x).

The term h has the same dimension as that of x, and both h,x are scalars.

If f(x) is a constant function then all the derivatives are zero and hence f(x+h) = f(x).
If the function is linear around the neighbourhood of x , then for any point (x+h) that lies in the region of linearity , f(x+h) = f(x) + hf’(x).
If the function is quadratic around the neighbourhood of x , then for any point (x+h) that lies in the region of linearity ,
f(x+h) = f(x) + hf’(x) + h²/2! f’’(x).
Taylor Series expansion becomes very important in iterative methods such as Gradient Descent methods and Newton’s methods for optimisation as well as in numerical methods for integration and differentiation.
Taylor Series for multi-variate functions around a point x belongs to set ℝ of order n×1 can be expressed as :

f(x+Δx) = f(x) + Δxᵗ ∇f(x) + 1/2 Δxᵗ ∇²f(x)Δx + higher order terms

where ∇f(x) is the gradient vector and ∇²f(x) is the Hessian matrix for the function f(x).

Okay, that’s a lot of concepts to grasp in a go,

but on these very concepts lies the foundation of your so beloved domain Machine Learning/Deep Learning. So my suggestion is take time and go through it if required once more.
Thanks for reading.