Mathematical Optimization: ‘simplicity is all you need’

Animesh Karnewar
BuzzRobot
Published in
9 min readJan 3, 2018

In machine learning, mathematical optimization lies at the heart. It is ultimately the objective function that is optimised for statistically fitting the curve through the given data. The prior statement is true for both supervised and unsupervised learning. In case of the latter, the distinction is based only in the definition of the objective function, which in this case doesn’t involve any label information.

Mathematical Optimization (a.k.a. mathematical programming) forms a completely separate field of study and is almost as varied as the field of Machine Learning itself. Refer to the wiki page for a brief summary.

The main objective of this optimization (Machine Learning specific) is to minimize or maximize the value of a mathematical function (a.k.a. the objective function, loss function or cost function). Let the function be f(x). The outcome of the minima optimization of this function would be a value for x so that the value of the function f(x) would be minimum.

The most popular and the most basic algorithm that has been in use in Machine Learning is known as ‘Gradient Descent’ algorithm (More technically, the Stochastic Gradient Descent, since we only feed a part of the data for making an update). It is an iterative algorithm that updates the parameters in such a way that the cost (function value) gets lesser and lesser over time. The update equation of gradient descent is:

x = x - (alpha * dy_dx)

where, ‘alpha’ is the learning rate that needs to be manually tuned and ‘dy_dx’ is the derivative of ‘y’ wrt. ‘x’.

This was followed by tremendous research efforts to make the gradient descent algorithm faster and better. Two of the most prominent ones were to use Momentum and RMSprop. It was later realised that the alpha shouldn’t be a constant and should be decayed (mathematical decay) over time. All these researches lead to the current standard optimizer: ADAM. The adam optimization combines momentum, rms-prop and learning rate decay and currently gives the best optimization performance. However, it has quite a few parameters (hyper-parameters, technically): alpha, beta1, beta2 and epsilon. Although, one rarely tweaks the beta and epsilon parameters, they are there.

The full procedure of Adam optimizer is:

Adam update pseudocode

The penultimate step that I didn’t mention earlier is known as the bias correction of the calculated moments.

It is clearly evident that the equations are complex. Moreover, very recently, this research has highlighted a number of problems in the ADAM optimization algorithm. Prominently, the research work pronounces the mistake in the proof of convergence in the original paper. So theoretically, ADAM does not even converge for convex optimization problems. Some problems with the model’s generalization have also been raised through this.

Much of adam’s complexity is what accounts for it’s performance. I perceive that this complexity arises from the fact that the cost dimension is considered separate from the parameter dimensions. This is where I like to quote the concept of spacetime (not space-time) from the Einstein’s theory of General Relativity. “Space and time are not separate, but one”. Following this principle, I discovered that using the cost dimension in the update equation can hugely affect the optimization. I hypothesise that ‘the cost dimension and the parameter dimensions are not separate and should be used together while making updates during optimization’.

With this thought, I furthered my study and found the Newton-Raphson equation for root finding.

x = x - (y / dy_dx)

Note that the dy_dx appears in the denominator as opposed to the original gradient descent algorithm. According to the Newton-Raphson method, the derivative term is inversely proportional to the update value. Perhaps this is the reason why gradient descent is so heavily dependant on the alpha to balance out this inverse relationship.

In order to check it’s validity, I tried running it on a one dimensional function. And, what’s a better function than the square function to test it.

Let y = f(x) = x², shown below:

Running the Newton-Raphson algorithm on this function produced a very smooth optimization curve. It was obtained by running the algorithm for 50 iterations and starting with x = 1000.

The cost dropped to zero (close to ) in less than 10 iterations. This showed a lot of promise and convinced me of the presence of some meaning to my original hypothesis.

The question that arises is: Can we use Newton-Raphson algorithm directly in Machine Learning? Well, the answer is No. But it definitely gets very close.

In case of Machine Learning, the objective functions are defined as positive real valued functions, because of which, the minimum value of it could only be 0 (or greater) and the function wouldn’t cut the parameter axes. It could only smoothly touch it. So, then that’s perfect for using Newton-Raphson, right? Finding the root is equivalent to minimising it. Isn’t it? Well, unfortunately it cannot be guaranteed that a root exists and assuming that it does is just too optimistic which I learnt the hard way.

I studied the behaviour of Adam closely to find out what makes it so good. And, I found out that the momentum term in the numerator and rms term in the denominator is kind of emulating the inverse derivative behaviour. This (and a series of trial and error equation finding) lead me to the following update equation.

x = x - ((y * dy_dx) / (y + (dy_dx)²))

Please note that the (dy_dx)² is just the square of the derivative term and not the second order derivative of y wrt. x.

Using this equation on the square function produced the following optimization curve:

Quite smooth right! When it comes to optimization equations, the study cannot be complete without testing it on the exponential function.

Let y = f(x) = exp(x)

Honestly, I was a little worried, because the derivative of exp(x) is itself. So, is the equation stable enough on the unshaken, undeviating (differentiation proof): exp(x)?

Well, yes, the equation is stable on exponentiation also. In the next section, I will explain why it is stable.

The best way to understand this equation is to visualise the 3D surface of the update term in the equation. The surface can be interacted with here.

Let g(k, v) = (k* v) / (k + v²);

In this function, we substitute k = y (i.e. cost) and, v = dy_dx (i.e. derivative)

The above image is a still shot of the surface from an explanatory angle. The x dimension is the cost dimension and the y is the derivative dimension as defined in the g(x, y) function above.

It is clearly evident that the surface tapers toward the z = 0 plane as we go away in either directions in the y (derivative) axis. This means that the update slows down if the slope becomes too much (i.e. too steep); a behaviour that is desired. Also, the surface cuts the z = 0 plane exactly at y = 0 (i.e. derivative = 0). This shows that the equation is stable and will converge when the derivative becomes 0. Finally, the most important feature is that the properties of these tapelines are controlled by the x (cost) dimension. As the cost becomes 0, the z approaches zero too. Which is the most desired behaviour, i.e. the algorithm needs to stop when it reaches a root (the minimum value).

The following section presents the results obtained on using the equation in practical Machine Learning settings

So, with all this synthetic mathematical experimentation, can we assume that the equation be used for Machine Learning? Why assume? Let’s try it and find out.

I carried out an experiment on the most popular Machine Learning dataset: MNIST (Drosophila).

Configuration: The classifier that I used is a three layers deep Neural Network with ReLU activation function. I didn’t use batch normalization here since the network is quite shallow and the input data is normalised. The data distribution wouldn’t shift too much in just three layers giving enough scope for ReLU to create non-linear boundaries. No regularization has been used in order to keep things simple since the focus is on the optimization part and not the classification.

I obtained the following loss plot after running the algorithm for 12 epochs:

The important configuration of the algorithm that needs to be mentioned here is the initialization of the weights. This experiment uses the Xavier initializer.

Comparing this curve with the Adam optimizer with all default hyper-parameters and under the same configuration gives following plots:

I simulated so many optimizers that tensorboard ran out of colours and coincidentally the colours of the two optimizers turned out to be the same. So, the one that is below is Adam.

So, although both of them reached almost the same convergence point, one would argue that Adam is still better. Well, to some extent, I wouldn’t disagree. But, one also has to admit that having no hyper-parameters to worry about while not increasing computational load with higher order derivatives and getting almost similar performance is still quite comforting. Besides, I’ll mention further distinction between these two in some time.

One premise that is open for exploration is: checking the performance of this optimizer under different initialization settings. I ran one more with the tensorflow’s Truncated Normal initialization. Plots as follows:

The orange coloured curve is for the Truncated Normal initialization. From the image, I’d say the TN is marginally better.

Well, since all this comparison has been made only with adam; so, how does this optimization perform compared to the others? Well, I ran this configuration for a whole bunch of optimizers from tensorflow:

The reason I didn’t mention the others is that they were way worse. Although upon hyper-parameter tuning, those curves might have come closer, but it destroys the main purpose of this research. The whole point was to discover an optimization algorithm that doesn’t require any tuning at all!

There is one important distinction between adam and this optimization algorithm that I observed (as mentioned earlier) is: This optimization equation trains a more flat (wide ranged) distribution that highly approximates to a Laplacian Distribution as opposed to the initial Gaussian Distribution compared to adam. The following Histograms tab of TensorBoard provides for this evidence.

Well, again, sorry for the similar colour, but the first ones in the histogram pairs are for adam and the second ones for this optimization algorithm. It is important to note that the weights in both the runs were initialized to same values by setting the same seed value.

Following image compares the adam histograms to the TN initialized ones (orange).

With this, I would like to conclude the explanation and the comparison of this optimization algorithm. The two key advantages that this algorithm has over adam that I would like to highlight are: simplicity and autonomy. Once again, note that I have not used any higher order derivatives or Hessians here.

The code for this experiment can be found on my repository here in the IDEA_2 section.

So, the algorithm has now been tested theoretically and practically as well. Given promising results on MNIST it would be interesting to see haw far can it go for bigger models such as CNN, ResNet, RNN etc.

Throughout my work, the main principle that I kept in mind is the Ockham’s Razor. In my belief, ‘Simplicity is the Key’!

--

--