Ingeniously Simple
Published in

Ingeniously Simple

Shallow learnings on Deep Learning

I’m really enjoying the Deep Learning course from Andrew Ng on Coursera. I’d highly recommend it! This article gives the insights I got (having done a bit of machine learning 20 years ago).

What did I know about roughly similar stuff years ago?

In a galaxy far away (about 20 years ago) I did a PhD in Computer Science. The half-baked idea was that everyone’s gait is unique; therefore, you can recognize a person by the way they walk. I wasn’t the only one to notice this, Shakespeare said this in the Tempest:

Great Juno, comes; I know her by her gait.

So, how do you recognize someone by the way they walk? Here’s an overly simplified version (mainly because it was 20 years ago!).

Well, firstly, you gather a tremendous amount of training data. Back then, a vast data set meant filming a hundred or so people who walked back and forth in front of a green screen. This data was all labelled with the ID and available information.

Pre-processing images to separate out a silhouette

Next up, you do some pre-processing in a consistent way across all that test data. High-resolution data was too complex to handle back then. Walking was filmed in front of a green screen, so that made it pretty simple to extract a black and white recording of the subjects’ silhouette. Pre-processing reduced the dimensionality of the data from 1024 x 768 x 3 to a more manageable 64 x 64.

Now is the tricky bit. What features can you find in that data that can uniquely identify the test subject? There are many things that people tried:

  • Label the key points, fit a mathematical model. The parameters of the model determine the gait. (here’s a more recent example called the Eigenwalker model).
  • For the training set, project into a high dimensional space. Use principal component analysis to determine the maximum directions of variation. Use a subset of these dimensions to reduce the dimensionality down to a smaller set that describes a high percentage of the variance and use that as a feature vector (closely related to Eigenface).
  • Convolve each image with a masking function and observe how the area changes over time. Use this as a feature vector. I did this; the intuition is that (from the side) gait is self-occluding and gives you a sinusoidal like shape. It turns out this shape is pretty good at identifying humans!

Once you’ve got your feature vector, it’s now time to see how good it is; time to open the classifier toolbox. Again, you’ve got many options — here’s a couple I tried.

  • Linear discriminant analysis — For the training data, find a space that maximizes the separation between classes and minimizes it within classes.
  • Support Vector Machines — One of the drawbacks of LDA, is that it’s linear (a straight line separation). SVM allows you to draw non-linear surfaces by doing the “kernel trick”. The kernel trick results in better accuracy, and you get to use more Greek symbols in your thesis.
Straight lines are not much good for separation on the left image! By Alisneaky, svg version by User:Zirguezi — Own work, CC BY-SA 4.0,

The final challenge; break out your cross-validation toolbox and evaluate your combination of pre-processing, feature detection and classification and see what you get. (90% recognition!)

What’s changed with all this deep learning stuff?

OK, after that trip down memory lane what stood out for me after a few hours of running through some of the courses at

The core idea behind Deep Learning is neural networks. A neural network (NN) is a network of functions that map an input (commonly known as X) into an output (Y). As an example, a deep learning network might take a bunch of inputs representing an advert and compute an output Y that represents the probability of a user clicking an advert. Or a NN might take an image, and output the probability that a face is in the image.

Each neuron has 1 or more inputs and an output. For each input, there is a weight associated with it (the more weight an input has, the more significant the difference it makes to the result). Each neuron has an activation function that determines the output value.

A neuron outputs something like tanh(WX + b)

The basic (hugely oversimplified) idea is that you train the network by:

  • Running it forward (forward propagation) to see how good your prediction was against the training data
  • Adjust it backwards (backward propagation) to improve your performance against the training data.

Modern frameworks like TensorFlow, Keras and PyTorch make this infinitely easier! They compose together the functions, so you build the forward step, and the library handles the more complicated process of backwards propagation. Neat!

The learning environment in Coursera makes heavy use of Jupyter notebooks to see deep learning in action. Python is a lovely language for writing this and (at least for toy examples) running it in an interactive notebook is a great way to explore.

What about this “deep” stuff? Deep learning networks have a HUGE thirst for data. One of the face recognition datasets used in the course consisted of over 100M images! And it’s not just training data; there’s an enormous number of parameters too The State of AI Report gives some examples ranging from “just” 94M for ELMo through to 175B for GPT-3. The cost of training these models is a tremendous amount of computation and time. I’m looking forward to firing up my GPU to do this (again this should be relatively simple).

Neural networks come in various shapes:

  • Standard Neural Networks train on inputs to outputs. They work well for structured data
  • Convolutional Neural networks find patterns in 2D / 3D sources. You’d typically use these for image processing applications.
  • Recurrent NN work well for signals, such as natural language processing or music generation.

If you’ve ever done a computer vision course, you’ll probably be familiar with edge detection (such as Sobel)). You take a matrix (that you define), convolve it with the image and voila the edges appear.

The Sobel operator in action.

The intuition for convolutional neural networks is that you don’t need to define the matrix yourself; you can instead let the network derive the parameters necessary to distinguish between the objects.

The paper, Visualizing and Understanding Convolutional Networks, explores this and visualizes how different layers of the network (trained on huge databases) activate.

Visualizing features in a fully trained model

My takeaway is that with enough parameters, training data and computational resources it’s possible to throw data at the right shape neural network and have it derive the interesting characteristics. Unfortunately, it’s not quite as simple as that! Neural networks have not just parameters to train, but hyperparameters that control the network itself. There are hard-won heuristics available for many of these, but it seems an area that is hand-tuned at the moment and I’m sure that for every success I’ve read there’s many more failures!

OK, so I’ve done a course on the basics. What next?

Well, it’s time for me to try some Kaggle problems and see if I can put the theory into practice.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store