Numerical Method Considerations for Machine Learning
Overflow & Underflow to Gradient Based Optimization to Directional Derivatives
Motivation
Machine learning algorithms often require a lot of computation, and often it is of the iterative kind, that happens over and over again. Instead of analytically solving for a variable, we compute over doubles and floats with precisions.
Another reason to understand Calculus is because there’s often a need to optimize (maximize or minimize a certain function).
Overflow and Underflow
One of the difficulties going from math to computers is the inherent impossibility of representing a real number on a finite memory. This means we incur some approximation error for every calculation.
Rounding error by itself is bad enough, when we iterative over a number multiple times over many operations, this rounding error starts to snowball.
Underflow
This occurs when numbers close to 0 are rounded to 0.
Many functions behave qualitatively differently when their argument is 0 versus when its epsilon. Think logging a 0, or dividing by a 0 et cetera.
Overflow
This occurs when very large numbers get approximated as infinity or negative infinity.
Both times the number we’re interested becomes a NaN.
Example
The SoftMax Function is often used to predict the probabilities associated with a Multinoulli Distribution
If x is really big, exp(x) will overflow and the whole expression becomes NaN.
Solution
softmax(z) where:
If x is really small, exp(x) will underflow and the whole expression becomes 0.
Similar solution to the above, low level library coders should be aware of these things, but for most people, we can rely on these functions to work just right.
Theano
Software package that automatically detects and stabilizes many common numerically unstable expressions
Poor Conditioning
Conditioning refers to how rapidly a function changes with respect to small changes in its inputs.
Functions that change rapidly when their inputs are perturbed slightly can be problematic for scientific computation because rounding errors in the inputs can result in large changes in the output.
Consider the function on vector x:
Assuming:
… has an eigenvalue decomposition, its condition number is:
… this ratio is the magnitude of the largest and smallest eigenvalues. When this number is big, matrix inversion is particularly sensitive to error in the input.
Gradient Based Optimization
Most deep learning algorithms involve optimization of some sort. Optimization refers to the task of either minimizing or maximizing some function f(x) by altering x.
The function we want to optimize is the Objective Function, sometimes called the Criterion.
When we are minimizing it, we call it the Cost Function, Loss Function or Error Function.
We often denote the value that minimizes or maximizes a function with a superscript *.
For a single variable function:
For a multi variable function:
The Directional Derivative
The directional derivate in direction u, a unit vector, is the slope of the function f in direction u. It’s the derivative of the function f(x + au) with respect to a.
To minimize f, we would like to find the direction in which f decreases the fastest., we can do this using the directional derivative:
… where theta is the angle between u and the gradient.
Where ||u||2 = 1 and ignoring all things dependent on u we get:
Up Next…
In the next article, I’ll talk about Jacobians and Hessians. If you would like me to write another article explaining a topic in depth, please leave a comment.
For table of content and more content click here.