A Super Quick Introduction to Deep Learning

Aditya Naganath
Around10
Published in
7 min readJun 1, 2018

One of the most talked about technology trends today is artificial intelligence. Over the past couple of years we’ve seen a panoply of innovations, numerous articles and various interviews by the valley’s elite, that have all touted a new age of technology. This piece will not make an argument for or against the hype of AI, but will rather serve to quickly introduce you to the fundamental technology currently underpinning the hype: deep learning.

The deep learning movement has largely been driven by the deployment of what are called neural networks. Contrary to popular belief, they do not mimic the workings of the brain. Furthermore, a lot of the ideas behind neural networks were developed in the 90’s but they didn’t see mainstream adoption until this decade. This is because the differentiated performance of neural networks against traditional machine learning algorithms only becomes apparent over huge datasets — which until this decade, we just did not possess or have the ability to work with. However, with the advent of the GPU and the ability of the cloud to store humongous datasets, we have finally been able to deploy neural networks at scale against many learning tasks. This has resulted in computers being able to significantly better understand the world — from images and videos to spoken and written language.

What is a neural network and how does it work?

A three layered neural network

The above image shows a 3-layered neural network (2 hidden layers + 1 output layer). Deep learning basically means that we’ve deployed a neural network with more than 1 hidden layer.

Every neural network begins with an input layer and eventually ends with an output layer. The layers in between, called hidden layers, successively transform the input until it reaches the output layer, which is responsible for producing the final answer (a house price, cat or dog? etc). The goal, then, of deploying a neural network against a learning task, is to get the network’s hidden layers to learn the most appropriate and generalizable transformations of input from prior examples.

What does this mean?

In the simplest cases, inputs to a neural network are a vector of numbers, denoted as x = [x1,x2,….xn]. Each component of x, like x1, is represented by a single circle in the input layer in the image above. Thus, you can think of each circle in the input layer holding a single, unique number from x. Each of these numbers is fed (represented by the forward pointing arrow) to multiple circles in the first hidden layer, called hidden units (or artificial neurons).

A single hidden unit does two things:

First, it takes each of the input components fed to it (say x1, x2 and x3) and multiplies them by weights (w1, w2, w3), producing: w1*x1, w2*x2 and w3*x3. It then adds a “bias” vector (b1, b2, b3), producing: (w1*x1 + b1), (w2*x2 + b2) and (w3*x3 + b3). It then adds these 3 numbers together to produce a single result. We denote this intermediate result with the letter z.

Second, the hidden unit simply throws z into a function “g” (called a non-linearity in academic parlance), whose output g(z), is a number between say 0 and 1, or -1 and 1, depending on the choice of g.

This number, the output of the hidden unit, is similarly fed as an input component to a hidden unit in the next hidden layer. This is what a single arrow between hidden units in layers 1 and layers 2 of the above image represents.

Putting this together, we basically start with some input that is a collection of numbers → have each hidden unit apply the two transformations above to those numbers→ feed each of their outputs to the next hidden units. This process ends when we reach the output layer — which computes its own g(z) that gives us a final answer. This flow of numbers, from input to answer through the network, is called forward propagation.

This is half of the workings of a neural network! However, as mentioned, the goal of the network is to learn the most appropriate and generalizable transformations of input. This basically means that we want to learn the best weights (w1, w2, w3) and bias numbers (b1,b2,b3), for each hidden layer, such that the output layer will get a majority of its answers correct.

To do so, we first need to have a dataset of examples (i.e. with input vectors x and a correct answer for each x) for the network to learn from. This is called a training set. Equally important, we need to have a quantifiable way to determine how well the network performed on each input x. This is done via a loss function.

A loss function takes the answer provided by the output layer and compares it to the correct answer provided by the training set, for the same x. The more similar the answers are, the smaller the loss we incur and vice versa. Think of it as a feedback mechanism for the network. Roughly speaking then, the goal of training the network is to find the weights (w1, w2, w3) and bias numbers (b1,b2,b3), for each hidden layer, such that we minimize the overall loss (i.e minimize the total number of incorrect answers) we incur on our training set.

Image Source

Pictorially, a loss function looks something like the blue curve in the above image. The network’s goal is to find weights and biases for its hidden layers such that its overall loss on the training set is at or near “Jmin” — the point of smallest overall loss. For simplicity, think of Jmin = 0 i.e. we want 0 errors when we are done training the network.

Concretely, let’s say we have 10 training examples. Recall, each training example will consist of a vector “x” and a correct answer. We will forward propagate each x through our network and get 10 final answers. We will compare each of those answers against the corresponding correct one and compute how well we did on the 10 examples by computing the total loss. This total loss will be some point on the blue curve.

Given that our goal is to reach a total loss of Jmin, we begin to tweak our weights and biases for each hidden layer (i.e alter the value of each weight and each bias slightly), based on how well we did, such that the next time we forward propagate our training examples, we are at a point on the blue curve that is closer to Jmin.

The process of systematically tweaking our weights and biases based on how well we did is called gradient descent, and the process of doing this for every hidden layer is called back-propagation

Once the network’s weights and biases are such that they consistently produce a total loss close to 0, we say that the network is trained and is ready to be deployed. We can then accept new inputs x whose correct answer we don’t know, and trust that our network will find that correct answer — whether it’s recognizing an image of a cat or predicting the price of a house.

Summary

  1. We have an often large dataset given to us of examples (x, answer) for the network to learn from.
  2. We forward propagate those examples through our network to see how the network’s answers for each x matched up with the correct answers given in the dataset.
  3. We tweak the network’s weights and biases for each hidden layer, such that we’ll do better on our next iteration of forward propagation
  4. We repeat steps 2 and 3 until our total loss is as close to 0 as possible.
  5. We deploy our network on new inputs x to obtain answers we don’t already have.

Hopefully, you got a flavor for how neural networks and consequently deep learning, operate under the hood. For anyone who’s acquainted with deep learning, I’ve intentionally simplified certain things and left out others so that this piece is accessible to those who have no prior exposure to the topic.

Closing Thoughts

This piece was a very simplistic introduction to the field of deep learning, but still gave you a flavor for the core processes that occur under the hood of this technology. There are many innovative (and way more complicated!) variants of neural networks that are being deployed by companies like Google, Facebook, Amazon etc. to produce technologies and products like self driving cars, Alexa and the like. It must be noted that as far as I know, neural networks still have a long way to go before they can mimic human intelligence. They are better than humans at certain specific tasks, but nowhere close to perceiving the world the way humans do. Lastly, a big concern of neural networks (and deep learning) is that we still do not fully understand how networks learn abstract concepts. That is to say, as far as I understand, we are limited in explaining how a network formed certain conclusions to arrive at a specific answer. This is an interesting area of research today.

If you’d like to learn more about deep learning, I recommend Andrew Ng’s 5 course specialization on deep learning. I recently completed it and went from knowing nothing about the subject to having a fairly decent grasp of all the core ideas in the field.

If you have any questions or want to chat about this topic, feel free to reach out! I’m @anaganath on Twitter.

--

--

Aditya Naganath
Around10

@StanfordGSB 2020. Formerly at @PalantirTech, @twitter, @nextdoor. 2015 @columbia grad.