Nothing Deep about Deep Learning

But everything about its capabilities are.

5 min readMar 6, 2022

At Chicago, I recall undergraduate students gawking about deep learning to Professor Lafferty after class. (The class being a very popular machine learning fundamentals course.) I recall professor Lafferty had hesitation in his voice at the time. It felt as though he was discussing a controversial, politically-sensitive issue. At that time, we knew only a fraction of what we know now and many of us were still wondering how deep learning could be so much more than non-linear regression. I myself had no motivation or curiosity to understand the subject and even the trio at Stanford — the ones who gave us the best-selling ML book of all time — only put a few paragraphs in the first edition of their textbook saying just that.

The first day of Prof. Lafferty’s extremely popular machine learning course. There were so many attendees, that a group of us had to sit on the floor. You can see the corner of my head in the lower right.

As someone who, in some ways, has been trying to keep one foot in academia and another in industry, I can understand the mixed feelings: deep learning is effective to the point of being exciting, leading to a renaissance of sorts in the field over the past couple decades. And yet, it can be theoretically unfulfilling and horribly challenging to intuit.

The current state of the art.

The past five years have seen the largest ever interest in understanding how these things really work. Recent research from Google Brain points to the possibility that networks and kernel methods are closely related; suggesting that infinitely-wide networks may in fact be kernel machines. Power players in the field have also made seemingly-impossible discoveries about smoothness conditions and the structure of the manifolds we optimize on. Like how the stochasticity of certain optimization techniques help avoid saddle points and other ridge, cliff, or trench-like structures in the loss manifold. And I’ll forever be amused by Professor Recht’s 2008 paper, which suggests we could get deep learning performance without deep learning by trading optimization for randomization. This is the same guy that taught us a full semester course on optimization algorithms and advocated for using random search instead, claiming that it is unsuspectingly effective. He is a realist.

The sweetest taboo.

Still, deep networks violate the “good” properties we hope to see in the carefully-designed models we grew familiar with and studied. Those of us with more classical theoretical training (such as myself) find these machines hard to accept. Statistically, they are not parsimonious when it comes to parameter specification; they are grossly over-parametrized. From a mathematical programming point of view, parameter optimization is not only a departure from convexity, but also can depart from differentiability in a highly non-trivial way. Computationally, state of the art deep learning technologies may be the only thing that push our current digital (binary system-based) computers to their limits. Lastly and perhaps most dissatisfying of all, interpretability is as speculative as Wall Street analysts are with martingales. Especially when you consider the fact that you’re just looking at a bunch of highly recursive terms inside a regression problem.

An example of how datum x gets tossed around inside a deep network.

But are these things really not “good”, or are we moving into a new age in which we need to accept that the extents of our own knowledge on theory will trail far behind our empirical abilities? As magical as they seem, nothing is inherently deep, supernatural, or mysterious about the deep networks or their design. But everything about their capabilities are.

Becoming fluent in deep learning.

My own journey and career in data science has taught me that, while I may often feel like an expert, I in fact know very little. This field is incredibly broad and all I can truly appreciate is my fluency in mathematics that allows me to keep consuming new research, exploring, and learning. While deep learning is not something we fully understand, it is unlikely to be the final frontier in our field that results in breakthroughs. But for now, it’s hard to argue that it isn’t.

For those who would like the tools to get as close as possible to understanding these technologies and potentially contributing to their development, here is a list of recommended course material that will put you on footing with the best. I’ve broken the reading into three major categories. But I’ll caution: the more you know about these things, the more perplexing their effectiveness becomes.

1. Some Fundamentals

Linear Algebra: For understanding dimensionality, how it applies to data, and why we’re always running into limitations therewith.

Mathematical Analysis: If you want to read any of the research papers on deep learning, you’ll really need this. Knowing this is a life skill for a data scientist and will give you the ability to keep learning and understanding new technologies for years and years to come.

Mathematical Statistics: This may be a controversial addition to the list, but if you want to understand how to interpret results of your models beyond the empirical level, this is critical. This is a huge, overly-comprehensive text so a bit of selectiveness in what to focus on is helpful.

2. Computational Mathematics & Methods

Matrix Algebra: For understanding computation, some of the basic algorithms, and how linear algebra is used practically. This text is short and a real pleasure to read.

Matrix Calculus: Not a popular topic since, in practice, it gets reduced to some off-the-cuff rules in the index of other application textbooks. But it is an important pre-requisite for understanding optimization. I would skim this text and do a few proofs just to get comfortable with the concepts and notation. (Also, this is a great reference to keep handy.)

Algorithms for Non-Linear Optimization: For understanding the computational implications of network topologies and how their parameters are solved for. If you’re already comfortable with optimization, you can probably skim this text. The algorithms section of Boyd Vandenberghe is also an additional resource and the book is chalk-full of motivating applications—it’s also a joy to read.

3. Modeling

Classical Statistical Learning Models: For understanding how networks are like extensions of all the basic models we study and intuitively understand. An extremely comprehensive text and a joy to read.

The Big Book of Deep Learning: This text is arguably the industry standard deep learning text. If you managed to read and understand the previous texts, you can skip part one entirely.

Theory of Deep Learning: I haven’t read this text as it is relatively new, but Yan Le Cun co-signed it. If it’s good enough for him, it’s good enough for me. I skimmed a few pages and it seems very well-written. From what I’ve seen so far, this text cumulates and interprets what we know so far about how these black boxes work.

If you plan to work in specialized areas, such as with audio recognition, you may need additional readings for example on digital signal processing, compressive sensing, etc.