An introduction to Deep Learning using Flux Part II: Multi-layer Perceptron

7 min readMar 22, 2022

In a previous post, we showed how to build a very simple model (linear regression) for predicting data that was linearly correlated using Flux. However, in that post, we didn’t cover many details of what Deep Learning is and how you can create more complex models. In this post, we go a bit deeper into Deep Learning concepts and cover the following:

What a neural network is.
The basic unit (perceptron) that you use to create a neural network.
What a Multi-layer perceptron is.
How the training algorithm for a neural network works.
Build an example of a neural network that recognizes digits from the MNIST dataset using Flux.

Deep Learning is a subset of Machine Learning that extracts patterns from raw data. In Machine Learning, you teach an algorithm to extract patterns from data without explicitly programming how to perform this task. Instead, you provide data to the algorithm and use a training routine so that it learns how to extract patterns. Most of the time, you need to preprocess the data to make important features more discoverable for the algorithm. However, in Deep Learning, you just provide raw data and let the algorithm figure out by itself all of the important features it needs to extract patterns from the data.

Neural networks

Source: https://www.tibco.com/reference-center/what-is-a-neural-network

Deep Learning is all about creating neural networks that can be used to solve different problems such as recognizing human language or classifying objects from an image. A neural network is a model of computation that was inspired by the human brain. It consists of basic units (perceptrons or neurons) that are arranged into layers. Each layer can be of different types (input, hidden, and output). Moreover, a neural network learns to predict new data by adjusting its internal parameters (weights and biases) via a training routine.

The basic unit of a neural network

A perceptron is a basic unit of a neural network. It performs individually a computation that is basically the output of a linear equation plus a non-linear activation function. This activation function makes the output of the perceptron non-linear. We avoid a linear output because in most cases a non-linear model is more accurate for making predictions. You can use many different activation functions but the most common one is the sigmoid function.

Source: MIT 6.S19 Introduction to Deep Learning

The screenshot above shows an example of a perceptron. It has the following components:

Inputs x1, x2,…,xm and a bias term.
Weights w0, w1,w2,…,wm.
Sum operation Σ.
Non-linearity function or activation function g.

The output of a perceptron is computed as follows:

Inputs are multiplied by their corresponding weight (xi wi).
Inputs multiplied by their corresponding weights are all added together (1 w0 + x1 w1 + x2 w2 + … + xm wm).
The activation function g is applied to the output of the previous step.

You may have noticed that we use a bias term so that the input for g can be shifted to the left or right. This is useful when the algorithm is searching in the data for a bounding area.

We can simplify the computation by using Linear Algebra (matrices and vectors). It allows us to parallelize computation and thus make the process faster. This is especially useful when we are training a neural network that has thousands or even billions of parameters (weights and biases). First, we create a vector X with the inputs and a second vector W that contains all of the weights. Then, we rewrite the computation equation of the perceptron as:

Image by author

We choose the activation function g according to the problem we want to solve. The most common activation function is the sigmoid function (see image below). However, for most Deep Learning algorithms we use other functions.

Source: https://commons.wikimedia.org/wiki/File:Sigmoid-function-2.svg

Note: This section was heavily inspired by the MIT Intro to Deep Learning course Lecture 1. This is a great course with lots of very interesting lectures.

Multi-layer perceptron

A neural network can have different architectures depending on the problem we are attempting to solve. In this post, we create a multi-layer perceptron (MLP). An MLP is a neural network that has the most basic layers: input, hidden, and output. Each perceptron in a layer is connected to all of the perceptrons from the next layer (fully connected).

Training algorithm

As we mentioned above, we need a training algorithm (routine) so that the neural network learns to extract patterns from data. More specifically, we need to find the best values for W that enable the neural networks to make fewer mistakes when predicting new data. Also, we need a cost function to measure the mistakes the neural network makes. Thus, the problem of training the neural network boils down to finding the values for W that minimize the cost function.

We use the Gradient Descent algorithm to find the best values for W. It approximates the values for W step by step. First, the algorithm initializes all weights W randomly. Then, the algorithm computes the cost function’s gradients (they point to the minimum’s opposite direction). Finally, the algorithm updates the values for W to take them one step towards the cost function’s minimum. We stop the algorithm when it finds values for W that are good enough.

Example of a Multi-layer perception using Flux

For this example, we use the MNIST dataset which contains images of handwritten digits. The images of the handwritten digits are of size 28x28. The goal of our model is to predict the class of a digit (0 to 9). We create an MLP with the following parameters:

Elements in the input layer: 28x28. This is the size of the images of the digits.
Perceptrons in the hidden layer: 32.
Elements (number of classes) in the output layer: 10. Recall that the dataset contains all digits 0 to 9.

We import the libraries we need:

We use the Julia package MLDatasets to get the MNIST dataset but you can also download it directly from the MNIST database webpage. In addition to the MNIST dataset, the MLDatasets package contains other datasets such as FashionMNIST, CIFAR-10, and many more.

Check the size of the data:

The code above prints the size of the MNIST train and test data. The train dataset has 60000 examples and the test dataset has 10000. Each element of both datasets is a 28x28 matrix. Flux expects the data to be in a different shape. Therefore, we flatten the input data, that is, we convert each 28x28 matrix into a 784-dimensional vector:

Now, we need to one-hot encode (see image below) the labels so that we can feed them into the MLP.

Source: https://chrisalbon.com/code/machine_learning/preprocessing_structured_data/one-hot_encode_nominal_categorical_features/

Whenever we are working with a very large dataset, it is more convenient to work with mini-batches of the data. We use Flux’s DataLoader type so we can iterate over the mini-batches of the data:

As we mentioned above, the model has the following layers: input (28x28, same as image size), hidden (with 32 perceptrons), and output (of size num_classes). We create the model using Flux’s Chain and Dense functions and use the relu activation function:

As we mentioned above, we need a cost function (loss function) to measure how good the model’s predictions are during training. But we also need an accuracy function to measure the model’s prediction after training. We create a loss function and an accuracy function:

We use the Descent optimiser (Gradient Descent) and set a value η for the learning rate (how fast we want to approach the cost function’s minimum):

We train a neural network in stages (or epochs). In each epoch, we perform one step of the gradient descent algorithm and output the loss and accuracy.

Final remarks

We went through some basic concepts of Deep Learning and created a neural network to predict the MNIST dataset. Flux enables us to create more complex models such as CNN’s and RNN’s. For more information on Flux and how you can use other examples as a starting point for your own projects see Flux’s docs and the Model Zoo.