Reading the Deep Learning book

Chapter 1

Andreas Kirsch
BlackHC
3 min readMar 1, 2017

--

https://mitpress.mit.edu/books/deep-learning

I’ve started reading the Deep Learning book. You can find an online version at http://www.deeplearningbook.org/. I’m trying to take notes and really interact with the material. This post contains my thoughts for Chapter 1. The ones that still make sense to me, that is.

Disclaimer: The following all represents my personal opinion and is in no way related to my employer etc. Also I don’t know much, so please correct me when I’m wrong :)

Marovec’s Paradox

The main lesson of thirty-five years of AI research is that the hard problems are easy and the easy problems are hard.

Marovec’s paradox could be a simple reflection of our evolution. We have spent most of our evolution getting the ‘simple’ things right. This is why they are simple and this is why there is so much complexity hidden in them. We have only spent very little time playing chess or Go.

Definition of Deep Learning

We want to learn a hierarchy of concepts: more complex ones are learnt on top of simpler ones. Seen as a graph, this hierarchy is deep with many layers.

Meta-observation about the physical book

Modularity in Deep Learning

We want to have already learnt/trained modules easily available, so we don’t have to constantly relearn the same things. This would save both compute resources and time and lower the entry barrier for new experiments.

A summary

Machine Learning leans to predict or classify for given representations of data that are made up of features. The choice of representation is important, but it can be difficult to know what features to extract. Deep Learning helps with representation learning to disentangle factors of variation.

Deep Learning graphs as parallel programs

Another perspective on Deep Learning is seeing the Neural Network as a multi-step computer program in which each layer represents the execution of a parallel step. The information that is passed from layer to layer doesn’t necessarily only contain/encode factors of variation, but can also contain “program state”.

Some definitions

  • Computational neuroscience tries to understand how the brain works at an algorithmic level.
  • Connectionism is about the idea that a large number of simple computational units can achieve intelligent behavior when networked together.” So what we call intelligence is emergent behavior of simple pieces, or put in a different way: intelligence is an emergent property of a chaotic system?
  • Distributed representation” is about representing each input by multiple features to avoid a combinatorial explosion in the representations. So we avoid it by actually making use of it. For example: we don’t want to have features like “black cat”, “white cat”, “black dog”, “white dog”. It’s better to have features like “black”, “white”, “cat”, “dog”. It’s a sum vs a product. (This allows for nonsensical combinations like “black white” or “cat dog” though.) Obviously with Deep Learning, we don’t want to have to come up with the distributed representations, we want the computer to find the best ones (see also word embeddings).

Rule of thumb

When training a deep learning model:

  • 5k examples per class get you acceptable performance,
  • 10M examples get you super-human performance.

This is it. More substantial notes for other chapters to follow as I keep reading.

PS: it’s a pity Medium does not support LaTeX :-/

--

--

Andreas Kirsch
BlackHC

DPhil student at AIMS in Oxford; former RE at DeepMind, former SWE at Google; fellow at Newspeak House.