Beautiful Trap of Neural Networks

Published in

Analytics Vidhya

5 min readNov 21, 2019

Nowadays, Neural Networks (NN) have become a ‘fashionable’ attribute of every modern trend discussion. Most individuals with some degree of technical and even non-technical background can speculate about them with varying levels of confidence, oftentimes making a skeptical impression on an informed listener. If someone attempts to talk about the connection between these mysterious technologies and the human brain’s anatomical structure — this is a grave folly and not the best way to gain credibility as an experienced data scientist.

Simply put Neural Networks are a family of machine learning algorithms able to recognize patterns. While this definition might seem fairly vague, and not very telling, it is still probably the most accurate common language description. Indeed, the algorithms are quite similar in their structure as they all connect input with output through some rigorous and highly intense calculations and are capable of taking on practically any type of raw data with minimal pre-processing. And the outputs easily satisfy every engineer’s demand — whether a regression or a classification problem is at hand. Using these algorithms we can approximate any smooth (and even not-too-smooth) function with a required degree of precision. Sounds like a dream, what could go wrong?!

But, as is always the case, this dream comes at a steep price, which comes in the form of quite a high level of computational complexity of even the simplest NN models. The number of the model’s parameters follows a complexity increase in a multiplicative pattern, meaning these models are very prone to overfitting and require regularization. Model’s hyper-parameter tuning becomes an intuitive hide-and-seek game with very lengthy iterations.

Given these limitations, what is the cornerstone of NN’s success? It is their high performance level! In many situations they outperform other models and their results are unparalleled. Neural Networks simply keep adjusting their parameters until their error is further irreducible. This is called parameter optimization, and there’s a number of algorithms based on the gradient descent approach which I’ve included a non-exhaustive below:

Adam — adaptive moment estimation
AdaGrad — adaptive gradient
AdaDelta — AdaGrad extension
Nesterov’s algorithm
RMSProp — Root-Mean-Square propagation
SGD — Stochastic gradient descent
LBFGS — Limited-memory Broyden–Fletcher–Goldfarb–Shanno algorithm
Line — Bresenham’s line algorithm
Conjugate gradient
Hessian Free (aka truncated Newton) algorithm

Sounds quite straight-forward so far, doesn’t it? A variety of optimization options with a common denominator in the gradient descent approach.

But wait a second, what can possibly be undesirable in such an idillic situation? It is the gradient descent approach which rings the bell for anyone with some background in calculus and applied math. The basic algorithm itself is as old as the hills, but has a very unpleasant caveat. It can fool you, to be blunt. As basic calculus tells us, we could be descending to a local minimum instead of the desired global one which is understandably a problem. There are several types of local minima — saddle, plateau, flat areas as well as other irregularities such as cliffs and exploding gradients. But even quite a simple polynomial

graphed around the origin can look like this beauty!

Yes, one can argue that such situations can be dealt with analytically. But the beauty of most machine learning algorithms simply begins where calculus ends — basically, we wouldn’t really need these computational approaches were we able to simply reach the result using a pen and some (much) scratch paper. We do need and use these algorithms extensively for the functions where WE COULD NOT find the desired minimum analytically. Quite often it’s even proven mathematically impossible!!!

Local and Global Minima (Source — Wikimedia Commons)

So, it sounds like a trap. Our best performing models deliver not absolutely trustworthy results! In a situation when there are many local minima with equal loss function values it can be deemed not very important, but what would happen if our ‘true’ global minimum has a dramatically lower cost than any of the local minima we found with our variety of gradient descent optimization approaches? In most cases we cannot even plot our loss function (as with multidimensional problems — 99% of models with more than a couple of features), and even if we could our gradients quite often remain indistinguishably small.

This poses an open question and a large field for academic research. As Ian Godfellow, Yoshua Bengio and Aaron Courville point out in their course book on deep learning : “Whether networks of practical interest have many local minima of high cost and whether optimization algorithms encounter them remain open questions. For many years, most practitioners believed that local minima were a common problem plaguing neural network optimization. Today, that does not appear to be the case. The problem remains an active area of research, but experts now suspect that, for suﬃciently large neural networks, most local minima have a low cost function value, and that it is not important to ﬁnd a true global minimum rather than to ﬁnd a point in parameter space that has low but not minimal cost.” (Ian Godfellow et al. “Deep Learning”, MIT Press 2016)

So we still can use Neural Networks and achieve great results, hoping that the future theoretical findings will explain more of their characteristics. We can experiment with their topologies being limited only by our imagination and computational power, but we must remember that our results might be not (yet) be 100% credible.

Beautiful Trap of Neural Networks

Written by Ilya Hopkins