A dummy’s guide to Deep Learning (part 2 of 3)

Kun Chen
The Bleeding Edge
Published in
6 min readApr 3, 2016

Now it’s time for us to see how deep learning really works! In case you missed the previous part and is now wondering how deep learning has anything to do with you, go check it out!

In this part, we’ll show you all the basic concepts you need to get started with deep learning.

First, in case you know absolutely nothing about machine learning…

Machine learning problems are typically where you want a computer to answer some questions without being explicitly programmed. For example, the question can be something like “What’s the price of my 1800 sqft apartment in Seattle?”, or “Is this news article telling the truth?

Such questions can often be translated into a form of:

Given some input X, what is the correct output Y?

For the example questions above: information about my apartment is the input, and an estimated price is the output. The news article is the input, and the output should be “yes” or “no” indicating whether it’s telling the truth.

There’s supervised learning and unsupervised learning. Here we are only going to talk about supervised learning, where we would show the computer program a bunch of example inputs and the correct answer to each of them.

These examples are called “training data”, and the process of showing these examples to the program is called “training”. You show them to the program over and over again until it becomes good enough at predicting the right answer to something you haven’t shown to it yet. The process of trying the program to see if it’s good enough is called “testing” or “evaluation”. And we call the program a “model”.

Some biggest challenges in this process are how to obtain accurate training data, how to design the inputs, outputs, and internal structure of the model so that it forms a good solution to the problem we are trying to solve.

What’s special about deep learning?

Let’s take an interesting question as example: given an image, tell me whether there’s a cat in it. This is a good example here because it’s a really simple question for a human, but turns out to be extremely difficult to program.

Traditional machine learning approaches require us human to first define a set of “features” that the program should look for. For example, “is it fluffy”, “does it have a head with two eyes and a nose”, “does it have four legs”, “is it yellow”, etc. And you feed the program with a large amount of examples, but for each example, you don’t show the full image to the program, you only show the values of your predefined features. And you tell it which ones are cats. Then after the training, our program would figure out things like “if it’s not fluffy, it’s probably not a cat”, “it doesn’t really matter whether it’s yellow or not”, etc.

The tricky part is, it’s really hard for us to define the right set of features. If you ask me what a cat is, I would say it’s a fluffy little animal with a small head, two eyes, two ears, a body with four legs and a tail… until you show me this:

There are no legs! There are no eyes! But how could we still tell it’s a cat? What “features” did our brain use to distinguish cats from other objects? If we can’t even figure out what features are useful, how can we define good features for the model to use?

This is where traditional machine learning approaches start to struggle, and is exactly where deep learning shines. The magic of deep learning is that we don’t have to define these features any more. We just need to build an empty “brain” that’s structurally “smart” enough so that it can automatically figure out what features are useful and how it should use those features to make predictions.

How does a deep learning model work?

To solve the same cat detection problem, a suitable deep learning model would be a convolutional neural network (CNN). It’s usually made of layers. I’ll explain how it works by assuming we already have a well-trained CNN, and we’ll see how it figures out whether there’s a cat or not.

First, instead of giving the model some feature values as input, we give it the full image. Let’s say the image is of size 200 x 200. The first layer of the network would take this image, and scan it from the top-left to the bottom-right, one 10 x 10 block at a time, to find out some basic features like “is there a horizontal line in this block”, “is there a vertical line”, “is this block mostly black”, or “does this block look fluffy”.

There could be tens to hundreds of such feature detectors in the first layer. They are called neurons. Each neuron would scan the image block by block and find out which blocks match the pattern it’s looking for. Assuming we decide to have no overlaps between the blocks, just to simplify things, the result from each neuron would be a 20 x 20 grid of boolean values. If there are 100 such neurons in the layer, there will be 100 such grids generated, and these altogether would be the input for the second layer. These grids are sometimes called feature maps.

The first layer extracts basic features like edges, colors, etc.

Layer 2 then looks at all the feature maps generated by layer 1, and again scans all of them from top-left to bottom-right, block by block, except that this time the blocks are bigger. On the 20 x 20 grids, layer 2 might look at blocks of 3 x 3, which is equivalent to a 30 x 30 region from the original 200 x 200 image. With all the basic features already detected by layer 1, this time layer 2 would be able to look for more complex shapes like a fluffy tail, a colored eye, or a small ear, etc.

The layers can go on and on, till one layer finally gets enough information to say it sees a cat. Here’s another illustration of facial recognition with convolutional neural networks. The key here is that each layer takes the findings of the previous layer, and tries to identify patterns that are more complex.

A figure from nvidia dev blog.

The most amazing thing about this is that we don’t have to tell the first layer that it should look for horizontal lines, fluffiness, and then the second layer should look for a fluffy tail, etc. In fact we don’t have to tell the model anything about how to look for a cat. We simply setup the structure of the model, and let it learn from the training data. If horizontal lines are important for identifying cats, the model will have one neuron for it. If colors are not really useful, color related features will automatically fade away and the model will decide to use the neuron for something more important.

Alright, that’s it! That’s pretty much everything you need to know before you can start building the coolest stuff. How awesome is that?! In part III, we will show you a real application that can recognize numbers in any image, and you will be amazed how easy it can be.

Part III: writing a real deep learning program

Thank you for reading! If you enjoyed this piece, please recommend it by clicking on the little heart button below, or share it with your friends! Follow The Bleeding Edge to stay up-to-date with latest technologies!

--

--