Intro to Deep Neural Networks — Is it really that hard?

Published in

Analytics Vidhya

8 min readNov 18, 2019

Concepts of Neural Networks are inspired by the way our brain learns, understands and retains information.

In the world of technology, “Machine learning” (ML), “Artificial Intelligence” (AI), and “Deep Learning” (DL) have become the perfect buzzwords which attract employers, investors and tech geeks! Although such concepts are initially hard to understand, with the right guidance and reliable documentation, it does get easier!

That being said, it’s important to solidify the basics of Deep Neural Networks (DNN). This includes definitions, certain algorithms, concepts and various types of model. It is as important to be able to apply various techniques to train and enhance the chosen data. In this article we will do exactly that, perfect the fundamentals!

The content is inspired by the book on Deep Learning, An MIT Press Book by Ian Goodfellow, Yoshua Bengio and Aaron Courville.

1. Definitions

What is Deep Learning?

Deep learning is a study under the field of Machine Learning which focuses on the techniques of teaching tasks to computer models; tasks that come naturally to humans. Some popular applications of DL methodologies include image recognition, voice search & voice-activated intelligent assistants, prediction of earthquakes and neural networks for brain cancer detection to name a few.

These algorithms are sparked by the functionality of our brains, precisely called the artificial neural networks. Specifically, structure of a neural network is very similar to how our nervous system is structured, where the complex system is made of millions of neurons where each neuron communicates back and forth with each other to learn, recognize and understand various tasks.

Fig 1. Slide by *Andrew Ng*, all rights reserved.

Initially, it may not be clear why such complex networks with a bunch of layers and hundreds (if not thousands) of neurons be better. However, comparing them with older algorithms help understand why. As shown in Fig 1, as the amount of data to work with increases, older models start to approach a saturated point and stop improving while deep learning models are made to work with larger data sets.

Splitting Data

In machine learning, the total data you have is split into 3 categories, led by training set, validation set and test set. This split is done to avoid the root problems of ML, which is overfitting and underfitting — covered later on in the article. As the name suggests, training set is used to train the algorithm with which is the biggest portion of the total data available. Whereas test data is used to evaluate the performance of the model using some performance metric once it is completely trained. It is important the training and test sets are mutually exclusive to avoid memorization rather than generalization. Finally, the validation set is used to tune variables called hyperparamaters, variables which control how the model is learning.

Fig 2. Visualization of the split between the 3 categories, Tarang Shah, all rights reserved.

2. What are we really “learning” though?

“A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E”, Ian Goodfellow.

a) Tasks, T

ML tasks are usually described in terms of how the ML system should process an example, where an example is a collection of features that has been quantitatively measured from some object or event that we want the system to process. Some of the most common ML tasks include:

Classification with / without missing inputs
Regression
Transcription
Anomaly detection
Synthesis and sampling
Denoising
Density estimation or probability mass function estimation

b) Performance Measure, P

To evaluate the abilities of a ML algorithm, we must design a quantitative measure of its performance. There are two main factors we can evaluate on:

Accuracy
Error Rate

Realistically, we usually want to know how well our algorithm will perform on data which it has not seen before, hence evaluating our performance on the test set of the data (separate from training set) is the right approach.

c) Experience, E

Machine learning algorithms can be broadly categorized as unsupervised or supervised by the kind of experience they are allowed to have during the learning process. This is an important topic which deserves another article on its own.

3. Overfitting and Underfitting Models

There are 2 key factors that determine how well a ML algorithm performs. That is, it’s ability to:

Make the training error small
Make the gap between training and test error small

However, a reoccurring problem with machine learning algorithms is the idea of:

a) Overfitting

This (frequent) problem occurs when the model captures the noise of the data. In other words, it happens when the model fits the training data “too well”, or learns patterns specific to the patterns in the data only, and “memorizes” its output to the extent that it impacts the performance of the model on new, unseen data. In simple words, it is when the gap between the training error and test error is too large.

A common fix to this problem is to less complicate the model, specifically to reduce the network capacity by removing or reducing # of elements in the hidden layers. Another possible solution is to add weight regularization layers so it adds a cost to the loss function for large weights and in return simplifies the model by forcing it to learn only relevant patterns in the training set. For example, check out this article published by researchers from University of Toronto which explain the use of Dropouts as a technique to prevent overfitting in neural networks.

b) Underfitting

This type of problem occurs when the model CANNOT capture the basic, underlying patterns of the training data. In contrast to overfitting, it happens when the model DOES NOT fit the training data well, leading to poor predictions on the testing or validation set. In other words, it is when the model is not able to obtain a sufficiently low error value on the training set.

A simple fix to this problem could be to simply get more training data and let it learn with more.

Obviously for both these situations, some common changes could possibly result in better solutions. Such as:

Changing initial weights / biases
Playing around with different activation functions
Switching up order of layers (architecture)
Cross-validation

4. How does it really work? A simple CNN example.

There are many types of Deep Learning models. For instance, Recurrent Neural network, Recursive Neural Network and Residual Neural Network (ResNet) to name a few. However, one of the most common ones known to ML practitioners is the Convolutional Neural Network (CNN).

We will look at a simple example of the CNN architecture using diagrams and Python code to understand its functionality.

Architecture Overview

CNNs are a specialized kind of neural networks for processing data that has a known grid-like topology such as time-series data and image data. Essentially, they take an input, assign weights / biases to various parts of the network and in the end classify what the image is based on the defined classes. Of course, we are skipping over a lot of the fun stuff that happens in the middle.

There are fundamental layers which are used in CNN’s. Although the order of which they may be placed vary, their core functionality and importance doesn't change. We will discuss the following layers: convolutional, pooling, and fully connected layers.

P.S. I will be using images and GIF’s from this article, as it has some good visualizations of how these layers work.

Convolutional Layer

In mathematics, convolution is the operation of which two functions ( f and g) produces a third function ( h ), where h expresses the relationship between f and g and how they effect each other. Similarly in CNN, the convolutional layer traverses over the entire input based on the filter / kernel size, and for each traversed matrix, it computes the dot products with the filter matrix at every spatial position, resulting in 1 unit on the output matrix. (Look at Fig 4.)

Here are examples of different types of possible features which are used to detect and extract low-level patterns such as edges, circles and other polygons depending on the feature map.

The GIF below is an example of how a convolution layer traverses through the input image and produces an output based on the dot product with a particular feature.

Fig 4. Feature / kernel size of 3x3 with input 6x6, producing a 4x4 matrix, Sumit Saha, All rights reserved.

The resulting matrix could be either reduced in dimensionality (like shown above) compared to the input or the dimensionality can either increase or stay the same (padding applied).

Pooling Layer

Pooling layer is another frequently used layer in CNN where its main purpose is to reduce size to decrease computational power needed for further processing. Not only that, it is very useful in extracting dominant features in a given input.

The most common type of pooling applied is max-pooling where given a stride s, it slides through the input with feature size n, and writes to the output matrix the maximum value in the current n x n input matrix as shown below.

Fig 5. Input size 5x5 matrix, with 3x3 max-pooling, Sumit Saha, All rights reserved.

Fully Connected / Classification Layer

Finally, after a combination of convolutional and pooling layers applied, the network usually ends off with fully connected layers where they capture high-level reasoning in the network. It is a layer where every neuron in one layer is connected to every neuron in another layer. Typically, the flattened matrix goes through a full connected layer to classify images.

The diagram below shows a typical architecture of a Convolution network.

Fig 6. A CNN with 2 constitutional layers, 2 pooling layers and a fully connected layer to help classify the image, Sumit Saha. All rights reserved.

5. Let’s Python it!

Till now, you’re probably just like “ya, ya I get all this theoretical stuff, but how do we code this?”.

We will use Tensorflow and Keras API to run a very basic CNN network with a sequential model, ReLU as the activation function and Adam as the backpropagation optimizer.

To begin, certain libraries are needed to be imported. To make life easier, I’m running Jupyter on Anaconda3 Naviagtor.

I have used sentdex Youtube channel to get started.