Gradient Based Optimizations Under The Deep Learning Lens

How much gradients you got? A lot…

Jake Batsuuri
Mar 12, 2020 · 11 min read

Consider the Keras operation,

output = relu(dot(W, input) + b)

… where W and b are tensors that are attributes of the layer. They’re the trainable parameters or the weights of the layer, specifically kernel (W) and bias (b). Consider the ReLU operation, from above. We’ll explore the significance of the Rectified Linear Unit operation in this article and the next few probably. Keep this in mind as we explore gradient learning throughout the article.

Quick quiz…

What are the 4 components of a machine learning or deep learning algorithm? Just try to guess or remember what they are…

I’ll wait…

  • The data set
  • The cost function
  • The model family
  • Optimization procedure

Remember that making machine learning algorithms component based makes it easy for us to combine them in different configurations to find appropriate uses for different solutions.

Machine learning has algorithms like linear regression, logistic regression and SVMs. These algorithms largely have guaranteed global minimums or analytic solutions. These are the convex set of optimization problems. Consider that even proving that there’s no global max or min is considered a solution, because there is proof and you can stop searching. While with deep learning we get to flirt with non-convex problems more, as you can imagine a lot of interesting problem may be the non-convex type.

Furthermore, neural networks are nonlinear, this nonlinearity causes many loss functions to become non-convex. Making it so that its hard to find a global minimum or maximum.

This forces us to start our search from a random place and use gradient based optimization to make the function as low as possible. This brings our model generated value to be as close to the true data generating value as possible, at least in a localized sense.

Stochastic Gradient Descent applied to cost functions with non convex solutions of course has no guarantee, and are heavily dependent on the initial conditions. One can always run several trials from different starting positions to make sure you have good enough of a solution.

Another consideration is that although linear models and convex set of models have analytics and guaranteed solutions, if the data type and size is overwhelming, we can use gradient descent to optimize them.

Stochastic Gradient Descent on Neural Nets

Generally the way it works is like this:

  1. The trainable parameters are initialized with random values
  2. Then we adjust these gradually, based on a feedback signal

If 2 alone is gradient descent, both steps 1 & 2 are stochastic gradient descent. Similar to encryption, sometimes you need an initialization vector. We will explore the initialization vector in depth later on.

If you do step 2 enough times, eventually you get a pretty good predictor program. That’s the general idea.

Let’s look at step 2 with even more detail now:

  • We have samples x and labels y
  • We run the network once to get prediction y_pred
  • This is called a forward pass
  • Then we compute the difference between the actual y and the predicted y_pred
  • This is called the loss function
  • We update all the weights so as the minimize the loss function, or the difference between y and y_pred
  • This is called backpropagation

Almost all of the above steps are easy enough, almost all of them. The last step of backpropagation is the most annoying one. Because how do you know which way to move minimize the loss?

Let’s say the loss function was a simple curve of:

We know that it’s minimum, its global minimum is at 0. If our first random guess, eventually lead us to the point of x at 2.3, how would we know to decrease x to go closer to 0.

We would take the derivative and realize that it’s 2x, and at 2.3, its 2(2.3) which gives us a positive value, so to we realize we have to go back to go lower. Otherwise if we go forward, we end up going higher and higher.

We adjust the weight so as to get to 1.5, then we calculate 2(1.5) and get positive, 3.0 to be specific. We are not at a minimum point, but at least we’re closer.

But the idea is we keep repeating till we get to 0.

Now this was a simple almost hypothetical example, what if we have to deal with more complicated functions with multiple variables and cost functions that look like this? What about not just a 2 dimensional graph, what about higher dimensional functions?

For the multi-variable case, we use the gradient, which is just a multivariable case of the derivative. Roughly the same principles apply to find the minimum point.

Unfortunately, there’s no guarantee of finding the global minimum, just the closest minimum, whether by luck it happens to be global or just local. Again this is non-convexity.

Now let’s define 2 things to help ourselves develop gradient learning for neural nets:

  • Cost or loss function
  • Output from the model, in essence the data structure of the prediction Y’

Again consider the Keras operation:

output = relu(dot(W, input) + b)

The weights W are initialized to some small random values and the biases b should be zero or small positive numbers. But the left side of the equation is the output. We will come back to this… but keep this in mind while we explore the cost function in depth.

Cost Functions

We generally use the cross-entropy between the training data and the model’s predictions as the cost function. Remember that cross entropy comes to us from Information Theory, it is a measure of the distance between probability distributions.

Typically we will have a parametric model that defines some sort of distribution and we simply use the principle of maximum likelihood. A conditional distribution is:

We have established that at least for linear models the principle of maximum likelihood and minimiziation of the mean squared error are equivalent.

So we will use maximum likelihood to develop our algorithm. You will see the benefits soon.

Maximum Likelihood Principle

Consider a set of m examples:

… drawn independently from the true data generating function:

Let p:

… be a parametric family of probability distributions over the same space indexed by:

In other words, p maps any configuration x to a real number estimating the true probability:

The maximum likelihood estimator for theta is then defined by:

This product of many probabilities can be inconvenient for many reasons:

  • It is prone to numerical underflow
  • Underflow in the sense that a certain nonsensical value ruins the final result, in this case if one of the probabilities is 0, it ruins the entire product

Logging the probabilities makes it slightly better and removes the underflow problem, yet it doesn’t change the arg max.

Now if we rescale this, it doesn’t change the arg max. And we can express the expression as an Expectation:

One way to interpret the maximum likelihood estimation is to think of it as minimizing the dissimilarity between the empirical distribution defined by the training set and the model distribution.

The degree of dissimilarity between the two measured by the Kullback-Leibler Divergence:

In other words, this divergence is the expectation of the logarithmic difference between the model and data sets.

Kullback-Leibler Divergence is a type of cross entropy measure.

The data generating process, is constant, therefore when we train the model to minimize the KL divergence, we need only minimize the second part of the expression:

Which is equivalent to the maximization of the probability generating process of the model.

Empirical distribution defined by the training set, is our closest approximation to the true data generating process, alpha.

The probability generating function defined by our trained model, is the thing we have, beta.

So what we are trying to do here is to minimize the cross entropy between alpha and beta.

  • Minimizing the cost function is an attempt to match the model distribution to the empirical distribution
  • Maximizing the likelihood of the estimator is an attempt to match the model distribution to the empirical distribution
  • Minimizing the KL divergence, cross entropy, is also an attempt to match the model distribution to the empirical distribution
  • When the above 2 are equivalent, the second becomes useful because we know that KL divergence has a known minimum value of zero
  • The negative log likelihood can become negative

Conditional Log Likelihood and Mean Squared Error

We can utilize the maximum likelihood estimator to estimate the conditional probability:

… which really is the point of machine learning, to predict y, given x.

This is the most common situation because it forms the basis for most supervised learning.

The conditional maximum likelihood estimator is:

… assuming the examples are assumed to be independent and identically distributed, the above can be decomposed into:

Linear Regression as Maximum Likelihood

Linear regression takes in inputs:

and spits out:

… we map from x to y, with the implicit criteria that we minimize the MSE. Now we revisit linear regression from the point of view of maximizing the likelihood estimation, which accomplishes the same goal.

Instead of a single y prediction, we can think of the model as producing the conditional probability of y given x.

If we had an infinitely large dataset, we might sample it a bunch of times. And sometimes you might get the exact same samples. But these identical samples might produce different conditional probabilities.

The goal of the learning algorithm now is to fit the distribution p(y|x) to all of the different y values.

What is a Parametric Model?

We define p(y|x):

… where the:

… gives the prediction of the mean of the Gaussian distribution. In this example, we assume the variance is fixed.

The maximum log likelihood has several desirable properties…

Properties of the Maximum Likelihood Estimator


The main appeal of the max likelihood estimator is that it can be shown to be the best estimator asymptotically, meaning as the number of examples m goes to infinity, the estimation gets closer and closer to the real deal.

Under the right conditions the max likelihood estimator has the property of consistency. So what are the right conditions?

  • The true distribution must lie within the model family distribution
  • The true distribution must only have one value of true parameter


On top of being consistent, max likelihood estimators are statistically efficient, meaning, they can reach the same level of generalization error with fewer examples, m.

These two properties then make max likelihood estimators the choice for machine learning.

We often denote our cost function as J and it is equivalent to the cross-entropy between the training data and the model generated data, given by:

This is the most generic cost function, depending on the specific model, it might change. But the part that changes is the log p part.

Another advantage of using the maximum likelihood method is that once you choose the model p(y|x), there is no need to determine a cost function as that is log p(y|x).

Learning a Conditional Statistic

Sometimes you can describe a distribution pretty well just from a couple of numbers. Take for example that the average IQ score is 100 with a standard deviation of 15. From that alone you know that 68% of people have IQs between 85 and 115.

Some of the most famous statistics measures are the central tendency measures like mean, median. Let’s say we wanted to learn the mean of y. We call our predictor f:

Our neural network is equivalent to f. Given a class of function, our neural network mimics this function. Minimizing the mean squared error cost function gives a function that predicts the mean of y for each value of x.


Another cost function that gives a different statistic is:

yields a function that predicts the median value of y for each x. This cost function is commonly called mean absolute error.

Mean squared error and mean absolute error often result in poor results when used with gradient based optimization. The output units produced by these have really small gradients. Whereas cross-entropy cost functions have the desirable high gradients.

Up Next…

Coming up next, output units. If you would like me to write another article explaining a topic in depth, please leave a comment.

For table of contents and more content click here.


Adams, R. A. (2017). Calculus. Prentice Hall.

François, C. (2018). Deep Learning mit Python und Keras. MITP-Verlags GmbH & Co. KG.

Goodfellow, I. (2017). Deep Learning. MIT Press.

Nicholson, K. (2009). Linear Algebra with Applications.

Sutton, R. S. (2018). Reinforcement Learning. A Bradford Book.

Wackerly, D. D. (2007). Mathematical Statistics with Applications. Belmont, CA: Nelson Education.

(n.d.). A First Course In Linear Algebra — Open Textbook Library. Retrieved February 24, 2020, from

The Startup

Get smarter at building your thing. Join The Startup’s +724K followers.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium