Numerical Method Considerations for Machine Learning

Overflow & Underflow to Gradient Based Optimization to Directional Derivatives

Jake Batsuuri

Published in

Computronium Blog

4 min readJan 1, 2020

Motivation

Machine learning algorithms often require a lot of computation, and often it is of the iterative kind, that happens over and over again. Instead of analytically solving for a variable, we compute over doubles and floats with precisions.

Another reason to understand Calculus is because there’s often a need to optimize (maximize or minimize a certain function).

Overflow and Underflow

One of the difficulties going from math to computers is the inherent impossibility of representing a real number on a finite memory. This means we incur some approximation error for every calculation.

Rounding error by itself is bad enough, when we iterative over a number multiple times over many operations, this rounding error starts to snowball.

Underflow

This occurs when numbers close to 0 are rounded to 0.

Many functions behave qualitatively differently when their argument is 0 versus when its epsilon. Think logging a 0, or dividing by a 0 et cetera.

Overflow

This occurs when very large numbers get approximated as infinity or negative infinity.

Both times the number we’re interested becomes a NaN.

Example

The SoftMax Function is often used to predict the probabilities associated with a Multinoulli Distribution

If x is really big, exp(x) will overflow and the whole expression becomes NaN.

Solution

softmax(z) where:

If x is really small, exp(x) will underflow and the whole expression becomes 0.

Similar solution to the above, low level library coders should be aware of these things, but for most people, we can rely on these functions to work just right.

Theano

Software package that automatically detects and stabilizes many common numerically unstable expressions

Poor Conditioning

Conditioning refers to how rapidly a function changes with respect to small changes in its inputs.

Functions that change rapidly when their inputs are perturbed slightly can be problematic for scientific computation because rounding errors in the inputs can result in large changes in the output.

Consider the function on vector x:

Assuming:

… has an eigenvalue decomposition, its condition number is:

… this ratio is the magnitude of the largest and smallest eigenvalues. When this number is big, matrix inversion is particularly sensitive to error in the input.

Gradient Based Optimization

Most deep learning algorithms involve optimization of some sort. Optimization refers to the task of either minimizing or maximizing some function f(x) by altering x.

The function we want to optimize is the Objective Function, sometimes called the Criterion.

When we are minimizing it, we call it the Cost Function, Loss Function or Error Function.

We often denote the value that minimizes or maximizes a function with a superscript *.

For a single variable function:

For a multi variable function:

The Directional Derivative

The directional derivate in direction u, a unit vector, is the slope of the function f in direction u. It’s the derivative of the function f(x + au) with respect to a.

To minimize f, we would like to find the direction in which f decreases the fastest., we can do this using the directional derivative:

… where theta is the angle between u and the gradient.

Where ||u||2 = 1 and ignoring all things dependent on u we get:

Up Next…

In the next article, I’ll talk about Jacobians and Hessians. If you would like me to write another article explaining a topic in depth, please leave a comment.

For table of content and more content click here.