On Why Gradient Descent is Even Needed

Daniel Burkhardt Cerigo
4 min readOct 29, 2018

--

Gradient descent is taught as a de facto part of machine learning, but when I got asked some questions that brought up why we even use it, I realised I wasn’t crystal clear on an answer, so I went and made sure of why myself.

I was giving a presentation to a set of very talented young mathematicians at King’s College London Mathematics School, and during that talk I showed a slide from the classic Stanford’s Andrew Ng’s MOOC Machine Learning course. Here it is:

It shows how the Cost Function J(or Error or Loss) varies as we alter our model parameters θ1 and θ2, or as we “move” in parameter space — thus creating a surface. This slide is shown to visually represent and help to understand how gradient descent works. We start at the upper most point (black x-mark), and take a short step in the direction of the gradient of the surface at that point (strictly it’s the opposite direction of the gradient so we go “down” and not “up”), with the goal that we get to a trough or minimum of the cost function and thus our model makes preditions that are close(r) to the actual labels of our training data.

We had already had a Q&A post talk, but after a few students approached me with more detailed questions. And there we a number of interrelated questions raised specifically by this slide, and at the time I don’t believe I gave (or even had) a completely satisfactory answer to them. So here goes the revised questions and answers.

Q: “Do you actually have that surface? If so why not just take the minimum of it?”
Q: “Do you really have all the gradients? If not then how do you do gradient descent? And if so, why not also just pick one where the gradient is zero?”

[EDIT]: These are very much contextual questions, specfically direted to that slide, and as well as given what we had already discussed during the talk. So to clarify this: 1. By this time (in the slides/talk) we had already agreed that we wouldn’t be able solve any of this analytically. 2. When they asked “just take the minimum of it?” they specifically meant take the minimum of the points that have been plotted to generate the cost surface in the slide. Not the true/analytical minimum. 3. Again with regards to the gradients, their reasoning was, if we can calculate (and compute!) the gradients at enough points to do gradient descent, then why not just pick a vector from that computed vector field that has the smallest norm.

One-liner Answer: It’s due to computational limitations we hit with large models that we use gradient descent and nothing to do with it being a special “learning” algorithm —for example just repeatedly choosing random sets of parameters and keeping the best “would work” if we had a powerful enough machine+enough time.

Abridged Answer:
You mathematically and theoretically have the entire Cost surface — it’s functional form is fully specified (that’s what your model and cost function are), all you have to do is put actual values in for your model parameters and compute it. And you mathematically and theoretically have the entire vector field of gradients for that surface — i.e. you have all the gradients. But in reality you only have what you are able to compute with your limited resources and time. The aim of the game is to find a set of parameters that make for a low error, and there’s no rules on how you do that. But trying to do a grid search (which is akin to plotting/computing the loss surface) in a 1million+ dimension parameter space is going to be a slow progress (think taking years even on current state of the art computers). It is precisely a computation limitation problem that gradient descent overcomes, not a mathematical one. In reality it might be better to think of it as a “search” algorithm rather than an “optimisation” algorithm, i.e. it efficiently searches about in parameter space for areas of low(er) loss.”
A nice way to think about the extent of the computational efficiency/gains that gradient descent gives you over say a grid search over the model parameters is to think about the dimension of the space each traverses. If your model has a million paramters then a volume in that space is a million dimensional volume and we know that volumes grow like V=l^D — that’s going to be painfully big. But gradient descent only traverses a line through parameter space, and lines are only 1D. Hopefully that’s persuasive to you.

Ulta-Extended And Complete With Coded Examples Answer:

See this notebook

--

--