Mathematics for Machine Learning: Multivariate Calculus

28 min readSep 5, 2018

It has been quite a while since I posted. I have decided to finished the last two parts of the Coursera Specialisation in Mathematics for Machine Learning prior to doing any posting on Medium. Coursera courses cost a monthly subscription fee so finishing it quicker means you pay less. Another reason why this post was delayed was that I took some time off to train for a sports competition.

All that aside, as I am typing this post right now, I have completed the Specialisation so that is one milestone down. So here is all I have learnt from my studies.

Multivariate Calculus

What is a function?

A function is a relationship between input and output. Where f(x) is a function of the variable x. This relationship can be seen as the rise over run of how one variable change in relation to another.

A good example we can look at is the speed over time and acceleration over time graphs superimposed into one graph.

The acceleration is the local gradient at that certain point in time in the speed graph. The acceleration graph is also known as the derivative of the speed function graph. Further derivation, from the acceleration graph, gives us the jerk function. This is the first derivative of the acceleration function and also the second derivative of the speed function.

Going the other way, we can take the total area between the function and the x-axis to find the anti-derivative, also known as the integral of a function. The integral of the speed function is the displacement function.

In certain cases, there is no derivative.

Finding the gradient

The gradient of a variable in a function is the rise over run against the other variables. In the classic x, y coordinate plot, the rise over run is computed using two points plotted on the graph.

The notation for the derivative, or the gradient can also be shown as such:

For a non-linear function, the gradient changes as the value of the variable change. Taking any two points to compute the gradient may result in an inaccurate gradient.

As dx tend to 0, the gradient becomes more accurate

Let us take the graph above as an example. Taking two points x and x + dx, we can get an estimate of the gradient at x. However, if we were to change the second point, where dx tends to 0, the gradient we compute will be more accurate. Hence, we should aim to get a dx where it is infinitely small yet not 0.

Using the notation given above, let’s try an example where f(x) = 3x + 2.

Sum Rule

The derivatives of the sum of two functions is the sum of the derivative of each function on its own. This rule extends indefinitely.

Power Rule

The derivative of a function with respect to a variable with a exponent is given by taking the value of the exponent and multiplying the function with it. Then deduct one from the value of the exponent.

Let’s try an example with the notation and see if it fits with the Power Rule.

Taking f(x) = 5x²,

Using the Power Rule,

Special Cases

The f(x) = 1/x function is a special one which includes a discontinuity.

This function also has a negative gradient for all values of x, excluding x = 0 which is undefined.

Using the limit notation,

Discontinuity in the f’(x) = -1/x² graph

Another special case:

This has the special property in which f(x) = f’(x), the function is equal to its derivative. Where f(x) = f’(x) holds, its values are either always positive, negative, or 0. Taking the negative of the above equation gives the negative example while taking f(x) = 0 gives the 0 example.

Trigonometric Functions

The two trigonometric functions that we are focusing on will be the sine and cosine functions.

The derivative of the sine function is the cosine function.

The derivative of the cosine function is the negative sine function.

These two function form a derivative loop that returns to the start every 4th derivation.

Product Rule

dA = sum(1 + 2 + 3)

The third part shrinks the fastest, hence can be left out of the equation altogether.

This Product Rule can be extended by following the same formula.

Chain Rule

To understand the chain rule, let us consider an example.

Let happiness be a function of pizzas eaten. Also let the number of pizzas eaten be a function of money earned.

Hence we get two graphs

Having these two functions, we combine them by substituting p(m) into the p of h(p), giving us:

Getting the derivative, h’(m):

Next we use the Chain Rule to check whether it produces the same derivative.

Quotient Rule

This rule can be derived from the Product Rule.

Multiple Variables

In a function, there are independent variables, dependent variables and possibly constants as well.

Given the example of plotting a car’s speed over time, time is the independent variable while speed is the dependent variable. The relationship between these two variables is such that at any given speed, there can be time periods matching that particular speed, while at each time period, there is only one speed.

As such, the speed of the car is dependent on the time period referred to.

The parameters in a function may differ in their variable types depending on the context in which the function is used.

Taking the example F = ma + dv².

Two contexts for this function are Driving and Manufacturing.

In driving, the two variables that can be controlled would be acceleration and indirectly velocity.

In manufacturing, the manufacturer has control of the mass and drag variables, having the ability to tune these variables by using different materials and builds. In this context, the acceleration and velocity variables are taken as constant as they are in the driver’s control.

In any given function, the parameters can be varied to find the best fit for the situation.

Another example would be the surface area of a cylinder, each parameter may be a constant or a independent variable. These of course depend on which variables are controllable.

Partial Differentiation

Partial Differentiation is differentiating with respect to each variable in turn. For each partial differentiation, regard the other variables as constants in the differentiation context.

For the cylinder surface area formula, the partial derivatives are as such:

multivariate vs single variate function when differentiating

Multivariate vs single variate function when differentiating

Total Derivative

The total derivative is the derivative with respect to a specific variable where the function is dependent on the variable not only directly but also on other variables that are dependent on that specific variable.

Substituting t into f(x, y, z), we get:

Using the chain rule, we will get similar results.

Jacobian

The Jacobian is a row vector comprising of the derivatives of the variables in a single function.

For a general case, the following Jacobian, J, is obtained.

Let us look at a specific example:

A special point to note for the Jacobian is that J(x, y, z) returns the vector that points up the direction of the steepest slope of the Jacobian function.

The steeper the slope, the greater the vector magnitude. The Jacobian vector points to the direction of steepest gradient around it, towards a maximum. This may be a local or global maximum depending on the point in the function.

A Jacobian output of 0 signifies that it is at a maximum, minimum, or a saddle point.

Jacobian in matrix form

The Jacobian can be a matrix if it represents multiple functions containing similar variables.

The derivative of linear functions can be computed directly. For non-linear functions, zoom in on the function and consider each region approximately linear.

How scaling of the space spanned can be found via deriving the determinant.

Add all the determinants to find the change in the size of the space.

This figure helps with the night-time hill-walking analogy. The goal is to find the tallest peak. Using the Jacobian, we can find the direction of steepest ascent from any given location. So starting from a random point, using the Jacobian, we will slowly move towards the peak.

The alternative would be taking random points, by moving around randomly, and calculating the height of each point.

Hessian

The Hessian is a matrix of second order derivatives. The row determines the first derivative variable while the column determines the second derivative variable. It explains the curvature of each point in the function.

The Hessian is symmetrical along the leading diagonal as the variables mirror along that axis.

The Hessian is applicable for continuous functions with no step changes.

Take a 2D example of the Hessian.

x² + y² graph illustrated with the origin marked

Where the origin X is 0 and the darker the area, the higher the corresponding value.

When the determinant of the Hessian is positive, the function has a maximum or minimum point.

Looking at the first term, at the top left of the Hessian matrix, if it is positive, we are dealing with a minimum. Likewise, if it is negative, we are dealing with a maximum.

Another case would be when the determinant of the Hessian is negative.

When the Hessian is negative, there is neither a maximum nor minimum but a saddle point. A saddle point is a point in the function where its derivatives are zero.

When dealing with real world problems, the solution may not be as simple to derive. The real world is n-dimensional, where n borders on infinity, hence, not all variables can be accounted for.

Also, there may be no nice analytical function to fit the model of variables. These functions may hence be computationally expensive and each iteration is time-consuming.

The variable may also constitute sharp features in which there is a discontinuity in the function. Or perhaps some function may be noisy due to random distribution.

To deal with these problems, we will need to use Principal Component Analysis which will be covered in the next course. These help greatly reduce the number of variables used. For noisy function, taking different step sizes and then averaging them would work.

If there is no function to optimise in the first place, calculate the partial derivatives in turn using computed points in the function.

Ensure that the step size used is appropriate. Too big and the derivative will not be accurate, too small and there may be numerical issues with computing it where the computer have limited significant figures.

Multivariate Chain Rule

The chain rule can be applied to a function with multiple variables. The difference from the uni-variate chain rule is that vector multiplication is used. Specifically, the dot product is used to get the scalar product.

Each x is a function of the variable t, i.e. x(t). We will use the multivariate chain rule to find df/dt.

This equation can be rearranged as:

Let us look at an example with multiple links with one variable each.

Using the chain rule,

Using substitution, we get the same results.

Now, lets look at the same example with multiple variables at each link.

Dimensions: (1x2)(2x2)(2x1) = (1x1)

Simple Neural Networks

Y = f(X)

This model has multiple inputs and multiple outputs.

Single node connection

Looking at a node connection example, we use the following equation:

The activation function introduces non-linearity which allows for increased complexity in the model. Simply leaving out the activation function would result in a Linear Regression Model, which does not have the complexity required to learn more complex data models.

The role of sigma can be taken on by different functions. An example would be the tanh function.

The slanted S shape is the sigmoid shape. The set of functions that model these particularly shaped graphs are also known as the sigmoid functions.

Another activation function that is more widely used nowaday is the ReLU function or the Rectified Linear Unit. It is essentially max(0,x) where all negative values get rounded up to 0.

Hidden Layers

A neural network must contain hidden layers to be effectively non-linear. The number of hidden layers in a neural network is the number of layers in between the input and output layers. The neural network model shown previously has one hidden layer.

In general, we use the following equation for neural networks:

Training the network requires that we use a cost function.

This cost function is in the form:

The aim would be to minimise this cost function through gradient descent, that means using the Jacobian.

We find the Jacobian via the multivariate chain rule with respect to the weights and bias variables.

Focusing on the first two layers, we have the following equations:

From these equations, we compute the partial derivatives for the Jacobian.

Likewise, when extending the following equations to multiple layers, we get the following:

Ck is the cost for training example k. Compute the average cost for n examples using:

This is essentially the same equation for a single layer except that there are the derivatives of each layer with respect to the previous layer for all the layers.

Each of these derivatives compute as follows:

Backpropagation

Backpropagation is the process used to calculate the gradient required to adjust the weights in a neural network. Using the multivariate chain rule, the cost function is accordingly attributed to each weight/bias variable with respect to the magnitude of their gradient.

Backpropagating this cost or error, we multiple the gradient by a learning rate value and then subtract the result from the weight/bias variable, updating it. This descents the gradient, adjusting the variables to minimise the cost of the training set of examples.

The learning rate is a percentage ratio which determines how fast the gradient descent is achieved. The higher the learning rate, the faster the model trains, however the trade-off is the accuracy of the training.

Code

Packages:
import numpy as py
import matplotlib.pyplot as plt
Functions:
sigma = lambda z: 1/(1 + np.exp(-z)) # Logistic Function
d_sigma = lambda z: np.cosh(z/2)**(-2)/4 # Derivative of the Logistics Function
Initiation:
global W and b variables
apply np.random.rand()
Feedforward:
ff()
z1 = W1 @ a0 + b1
a1 = sigma(z1)
…
a3 = sigma(z3)
return [a0, a1, a2, a3]
Cost:
cost(x, y)
(np.linalg.norm(ff(x)[-1]-y)**2)/x.size

Compute derivatives starting from the last layer:

Last Layer:

Partial derivatives:

Second Last Layer:

Example dimensions for the variables in the above neural network:

Taylor Series for Approximations

The Taylor series is the sum of a infinite series of a function’s derivatives in increasing order at a single point, which is used to approximate the function.

Taylor series are also power series. The Taylor series of a function is the power series associated to the approximation of that function.

Power series come in the general form as such:

The power series goes on infinitely, computing an approximation can be done via using the truncated series, which calculates based on a number of terms depending on the order of approximation.

Maclaurin Series

This series is a special case of the Taylor series where the point of estimation is at 0.

Equations for the different orders of the Taylor series

Hence in the general form, we get:

Let’s take e^x as an example. All its derivatives are also e^x. When substituting f(x) = e^x into the Maclaurin series, we get the follow series:

Taylor Series

When not approximating around 0, we use the general Taylor series instead.

m is the gradient at point p

Generalising this equation to the other order derivatives, we get the following:

Thus the following Taylor series obtained is one that approximates around p.

Special cases

There are some series where derivatives are skipped as their coefficients equate to 0. An example would be the cos(x) function.

Let’s explore the Maclaurin series for f(x) = cos(x).

Here we can spot the trend where every even-order derivative has a coefficient of 0. Hence the odd number powers of x are all absent.

This results in the following equation:

The cos(x) function is a well behaved function. However, not all functions behave well for the Maclaurin series. The 1/x function is an example of a badly behaved function.

The problem with the function lies with the vertical asymptote at x = 0.

Let f(x) = 1/x. f(0) = 1/0 which is undefined, however, f(1) = 1/1 = 1.

Hence a solution would be to use the Taylor series where the point used for approximation is not at the asymptote.

The approximation will cover estimations of the function for x > 0. We will need a similar Taylor series at a point where x < 0 as well to perform estimates for the function on the other side of the asymptote.

This gives us the following Taylor series:

Linearisation

Linearisation is used to find the best approximations for the function or data set given. The linear approximation for Taylor series is first order accurate.

Let us look at a 2D function, y = e^x.

Run over Rise as the first order approximation

To get a linear approximation of a function, we find the gradient at a chosen point in the function. The gradient, also known as the rise over run, is obtain via taking another point, x+cx, and putting it into the gradient equation.

We can also obtain this equation from the first order of the Taylor series.

f’(x) is the first derivation of the equation, which is the gradient

Earlier, we had the following equation:

This is the first order approximation for the Taylor series. Letting g(x) be the linear approximation, we get the following:

p is the chosen point for approximation, and x is the variable using this approximation.

The O(dx)² error term comprises of the rest of the Taylor series. We can tell from this approximation error that its magnitude increase with dx.

Another form for the Taylor series in term of dx would be:

Odd and Even Functions

Odd functions are those with the attribute where -f(x) = f(-x).

Even functions are those with the attribute where f(x) = f(-x).

If the function have neither attributes, then it is not an even or odd function.

Even, Odd, and functions that are neither

Multivariate Power Series

The power series can extend to include multivariate functions. Let us take for example the Gaussian Function:

The Gaussian Function has a maximum point at (0, 0).

Here we will examine the power series for three orders of approximation:

Zeroth order => Plane

First order => Plane at an angle

Second order => Parabola

For the power series, we recall that:

When applying this to a multivariate function, we have to make use of partial differentiation.

Let look at an example.

The second and third terms in the equation is required for the first order derivative.

The use of the Jacobian helps with the equation formation.

Putting the variables in vector form and using the Jacobian in the equation, we get:

Likewise, when computing the second order derivative term, we will use the Hessian which comprises of the second order partial derivatives.

We have to manipulate the terms around when using matrices to obtain a scalar result, getting the following:

Hence the full equation for the second order multivariate taylor series expansion is obtained as such:

To put this into practice, we will look at an example.

Putting the whole equation together, we get back the equation:

Newton-Raphson Method

The Newton-Raphson Method is a method to find the roots of a function. The function in use must be real valued, having all its values being real numbers.

We can obtain the Newton-Raphson Method from the first order approximation equation in the Linearisation section:

dx is the difference between a variable point and the current point.

When near a root, we take y = 0.

Solving for x(n+1), we get the following iterative equation:

Starting at the mark, the first three iterations are the green, red, and blue ones.

The method uses approximation to close in on the roots of a function. In the graph above, we see that each step in the Newton-Raphson method brings us closer to the actual root. We can stop iterating when the root gets close enough, which is when each subsequent iteration does not change much.

Looking at an example:

At the third iteration, we see that the y value is close to 0, hence the x value at that iteration would be our approximation of the root.

One issue that arise from using this method is that it does not work for closed loops.

Observing the pattern above, we see that the x and y values will oscillate between (0, 2) and (1, 1) indefinitely, hence we will not arrive at a good approximation of the root.

Another problem with the method arises when the gradient of the slope is too small. The gradient is the denominator of the equation, hence if its value is too small, the value of x will diverge relatively further away in the next iteration.

There are methods to deal with these issues such as successive over-relaxation, which helps converge the approximation.

Gradient Descent

Gradient Descent is the popular method for finding the minima of a function. It uses the property of the gradient which always points up the direction of steepest ascent. Taking the negative of the gradient allows us to descent and find the minima.

When using the gradient descent method, we take the grad of the function, denoted by gradf. This is a vector of the partial derivatives of the function with respect to each of its variables. gradf is also known as the Jacobian of the function.

A method of gradient descent would be to find the directional gradient of the function. The directional gradient, rhat, is maximised when it is parallel to gradf, when cos(theta) = 1. Simply put, the directional gradient is maximised when they both point in the same direction.

Essentially, rhat is the unit vector of the gradient. When we take the dot product of gradf with rhat, we get the magnitude of gradf.

Hence, the normalised gradient can be obtained by dividing gradf by its own magnitude as such:

The normalised gradient is used when control of the step size of each iteration is prioritised. Else, just using the gradient alone will allow the step size to vary with the magnitude of the gradient.

Putting this into the iterative equation:

Gradient Descent Methods

Steepest Descent using the Jacobian:

Hessian to control step size:

Using the Hessian, we get use the curvature attribute of the function to determine the step size of each iteration. However, the Hessian reacts to any extrema, that is both the minima and the maxima. Due to this attribute, it may lead to a step in the wrong direction. Also, there is a possibility that the Hessian may increase the step size too much.

Using a hybrid of the two methods solves these issues:

Lagrange Multipliers

This method is used to find extrema points for functions with equality constraints.

This method uses a unique property where the gradient of the primary function points in the same direction as the gradient of the constraint which intersect with the primary function at an extrema.

The two gradients are parallel to each other, hence we get the Lagrange Multiplier equation:

Solving for solutions:

Here we obtain four solutions:

Another way to denote the inequality constraints for the Lagrange Multiplier:

The number of inequalities to solve is equal to the number of variables plus 1, which is the lambda variable.

Simple Linear Regression

Simple linear regression is a method studying the relationship between two variables, the independent and the dependent variable. When making use of data to perform statistical analysis, be sure to perform a proper cleanup of the data.

Cleaning up refers to handling of partial data, zeroes, missing, duplicate, or erroneous data. Also, it would be preferable if the data is sorted and outliers are identified.

Standard deviation, s.d. is the measure of the variance of a set of data from its mean.

The parameters to note for simple linear regression are as follows:

The general equation:

Gradient/constant plot with concentric ellipses

Each concentric ellipse shows points at which there is a constant X² value. The center point of these concentric ellipses is where X² is at a minimum.

The downside of this plot is that it is very computationally demanding.

The minimum is at the point where the gradient of X² is 0.

Solving the partial derivative with respect to c,

Uncertainty in fit for m & c,

Anscombe’s quartet

This quartet of datasets share the same X², mu, best fit line, and uncertainty values. However, each of these datasets contain different datapoints.

The first dataset is suited for linear regression. The second dataset is more suited to quadratic regression. The third and fourth datasets both have an outlier which when removed will drastically change its linear regression results.

Deviation from center of mass

Shifting the line equation away from the center of mass of the data points lets us remove the sigma-m term from the constant uncertainty equation.

Next we remove the gradient uncertainty term from the constant uncertainty equation. The resultant m, c plot is one with concentric circles.

The gradient does not affect the intercept constant in the new equation.

Gradient/constant plot with concentric circles

General Non-Linear Least Squares

The general non-linear least squares method models a function around multiple parameters. This involves using a linear approximation at first and then refining the parameters through multiple iterations.

A common notation used is:

An example of a function would be:

Adding to the notation, we account for each observation and also its associated uncertainty.

The measure used to determine how well the function fits is known as the Goodness of fit parameter:

X² measures the difference between the actual target dependent value, yi, and the predicted value through the model with xi and parameters ak. The differences are squared so that they do not cancel each other out.

The difference obtained is then divided by the uncertainty, which reduces the weightage placed on data points that have greater uncertainties.

Fitting the parameters

Let the parameters fit in a vector of size k, [a1, …, ak]. The parameters are initialised and then iteratively improved.

grad(X²) is given via partial differentiation with respect to each variable.

This gives us the equation,

Using the example given previously,

Different Solvers for Non-Linear Least Squares problems

Jacobian & Hessian (Taylor Series) controls the direction to that of the steepest descent as well as the step size taken.

Levenberg-Marquardt uses a damping factor to reduce the step size toward the minimum.

Gauss-Newton uses the Hessian directly to find the minimum.

BFGS builds information about the Hessian over successive iterations. It uses an approximation of the Hessian which is less computationally intensive and hence quicker to compute.

An important measure to evaluate the method is the robustness of the fit. A robust fitting method would not be bothered by outlier data, given a better function approximation.

Instead of using the least squares method, take the absolute of the squared deviations. This reduces the weightage of the points that are far from the fitted function.

Practical Application to Solving Non-Linear Least Squares Problems

Using Matlab:

Import Data
Apps… Curve Fitting Toolbox
Set own function or pick one (Optional)

Using Python:

scipy.optimize.curve_fit

Numpy and Scipy Documentation - Numpy and Scipy documentation

Welcome! This is the documentation for Numpy and Scipy.

docs.scipy.org

A good estimate for the model may be required. In some cases without a good estimate, the model will not overlap with the data. This fitting difficulty is due to the Jacobian/gradient not being informative on the direction.

Mathematics for Machine Learning: Multivariate Calculus

Numpy and Scipy Documentation - Numpy and Scipy documentation

Welcome! This is the documentation for Numpy and Scipy.

Written by Isaac Ng