Mastering the Multi-Layer Perceptron (MLP) for Image Classification

A Step-by-Step Guide to Building and Training Neural Networks Using the MNIST Dataset

Published in

Eincode

7 min readSep 12, 2024

In machine learning, one of the most fundamental tasks is image classification. Multi-Layer Perceptrons (MLPs) provide an excellent foundation to understand how neural networks work. MLPs are a type of neural network composed of multiple layers of neurons, making them capable of learning complex relationships in data. In this blog post, we’ll explore how MLPs function and how to use them to classify images from the famous MNIST dataset, which contains handwritten digits.

First, Resources!

This blog article is based on this course: https://eincode.com/courses/master-neural-networks-build-with-javascript-and-react

Github Repo (Full Code): https://github.com/Jerga99/neural-network-course

What is a Multi-Layer Perceptron (MLP)?

A Multi-Layer Perceptron (MLP) is a type of artificial neural network consisting of multiple layers of neurons. It’s called “multi-layer” because it includes an input layer, one or more hidden layers, and an output layer.

Input Layer: This layer receives the raw data. In the case of image classification, each neuron in the input layer represents one pixel. For a 28x28 image, the input layer would have 784 neurons. This layer simply passes the data to the next layer without modification.
Hidden Layers: These layers perform the heavy lifting of the model. Each neuron in a hidden layer receives input from the previous layer, computes a weighted sum of these inputs, adds a bias, and passes the result through an activation function like ReLU. ReLU introduces non-linearity, allowing the network to capture complex patterns.
Output Layer: The output layer generates the final prediction. For classification tasks, the output layer has one neuron per class (e.g., 10 neurons for digits 0–9 in MNIST). The softmax function converts the raw scores into probabilities, and the class with the highest probability is chosen as the prediction.

MLPs are fully connected, meaning each neuron in one layer is connected to every neuron in the next. The goal of training an MLP is to adjust the weights and biases to minimize the prediction error.

Practical Example: Classifying MNIST Digits

To demonstrate how MLPs work in practice, we’ll walk through a simple application: classifying handwritten digits from the MNIST dataset. MNIST consists of 28x28 grayscale images of digits (0–9), and the goal is to build a model that correctly identifies these digits.

Step 1: Preparing the Data

Data preparation is crucial before feeding the data into the MLP. Let’s look at how this works with the MNIST dataset:

Flatten the Images: MNIST images are 28x28 pixels. Since MLPs expect a vector as input, each image is “flattened” into a 1D array of 784 pixels.
Normalize Pixel Values: Pixel values in images range from 0 to 255. To make training more efficient, we normalize them to values between 0 and 1 by dividing by 255. This keeps the weights from getting too large and improves the learning process.
One-hot Encode the Labels: MNIST labels are digits from 0 to 9. We need to convert them into a format the network can understand for classification. One-hot encoding is a method where each label is transformed into a binary vector. For example, the digit “3” becomes [0, 0, 0, 1, 0, 0, 0, 0, 0, 0]. The network outputs probabilities for each class, and the position with the highest value corresponds to the predicted digit.

Pseudo code for normalization and one-hot encoding (Full Code in “Resources”)

Normalize data:
  for each image in dataset:
    for each pixel in image:
      pixel_value = pixel_value / 255  # Normalize pixel values to range [0, 1]

One-hot encode labels:
  for each label in dataset:
    create a vector of length 10 filled with zeros
    set the position of the correct label to 1

Step 2: Building the MLP Model

An MLP typically has multiple layers of neurons. For this task, we’ll build a model with:

Input Layer: 784 neurons (one for each pixel in the flattened image).
Hidden Layer: 64 neurons. This layer uses an activation function called ReLU (Rectified Linear Unit). ReLU introduces non-linearity to the model, which is important for learning complex patterns in the data.
Output Layer: 10 neurons, one for each possible digit (0–9). This layer uses the softmax activation function, which converts the raw output into probabilities that sum to 1. The class with the highest probability is chosen as the predicted label.

Pseudo code for model structure (Full Code in “Resources”):

MLP structure:
  Input Layer: 784 neurons
  Hidden Layer: 64 neurons, activation = ReLU
  Output Layer: 10 neurons, activation = Softmax

What is ReLU?

The ReLU activation function is defined as f(x) = max(0, x). It helps introduce non-linearity into the network because without it, the MLP would just be a linear model, which wouldn’t be able to solve complex problems. ReLU keeps all negative values at 0 and passes through positive values, which helps the network focus on important features.

Softmax ensures that the output values represent probabilities, i.e., values between 0 and 1 that sum to 1.

Step 3: Forward Propagation (Making Predictions)

Forward propagation is the process by which the network makes a prediction. During this step, the input data is passed through the network, and the model computes the output probabilities.

Input to Hidden Layer: The input data (a flattened image) is multiplied by the weights and biases of the hidden layer, and the result is passed through the ReLU activation function. This creates the hidden layer’s activation values.
Hidden Layer to Output Layer: The output of the hidden layer is passed to the output layer, where a similar process occurs. The final output is passed through the softmax activation function, which produces a probability distribution across the 10 possible classes (0–9).

Pseudo code for forward propagation:

Forward pass:
  for each neuron in hidden layer:
    hidden_sum = sum(input[i] * weight[i]) + bias
    hidden_activation = ReLU(hidden_sum)

  for each neuron in output layer:
    output_sum = sum(hidden_activation[j] * weight[j]) + bias
    output_probabilities = Softmax(output_sum)

The result of forward propagation is the network’s prediction, which is a probability distribution across the 10 possible digits.

Step 4: Backpropagation (Learning from Errors)

Backpropagation is the learning phase of the network. After making a prediction in the forward pass, the network calculates the error by comparing the prediction to the actual label. It then “propagates” this error backward through the network to update the weights and biases, minimizing future errors.

Calculate the Error: The error (or loss) is calculated by comparing the predicted output with the actual label. One commonly used loss function for classification is Mean Squared Error (MSE), which measures how far the predictions are from the correct values.
Compute Gradients: Backpropagation uses a method called gradient descent to adjust the weights and biases. The gradient tells us how much to change each weight to reduce the error. The network computes these gradients by calculating the partial derivatives of the loss function with respect to each weight and bias.
Update Weights and Biases: Once the gradients are computed, the network updates the weights and biases to reduce the error. The learning rate controls how large these updates are. A smaller learning rate results in slower but more precise adjustments, while a larger learning rate speeds up learning but can overshoot the optimal weights.

Pseudo code for backpropagation:

Backward pass:
  for each output neuron:
    calculate output error = predicted - target
    compute gradient for output weights and biases using the error

  for each hidden neuron:
    calculate hidden error = sum(output errors * weights)
    compute gradient for hidden weights and biases using hidden error

Update weights:
  for each weight:
    weight = weight - learning_rate * gradient

This process is repeated for each training sample. Over time, as the network adjusts its weights and biases, it learns to make better predictions.

Step 5: Training the MLP

The training process involves running forward propagation and backpropagation for multiple epochs (iterations over the entire dataset). With each epoch, the network becomes better at making accurate predictions.

Epochs: An epoch refers to one complete pass through the entire training dataset. Typically, training is done for several epochs (e.g., 30–50).
Mini-batches: Instead of updating the weights after every single training example (which can be slow), the dataset is divided into smaller mini-batches. The weights are updated after each mini-batch, speeding up training while still making accurate adjustments.

Pseudo code:

Train the model:
  for each epoch:
    shuffle training data
    for each mini-batch:
      perform forward propagation
      compute the loss
      perform backpropagation
      update the weights and biases

Step 6: Testing the Model

Once the model is trained, it’s important to evaluate its performance on unseen data (the test set). This allows us to measure how well the model generalizes to new data.

Testing the model involves running forward propagation on the test set and calculating the accuracy. The model’s accuracy is the percentage of test samples it correctly classifies.

Pseudo code for testing:

Test the model:
  correct_predictions = 0
  for each test input:
    perform forward propagation
    if predicted class == actual class:
      correct_predictions++

  accuracy = correct_predictions / total_test_samples

Why Use MLPs for Image Classification?

MLPs are an excellent starting point for building neural networks, especially for beginners. While they are powerful enough to handle various tasks, including image classification, they are not as efficient for image data as Convolutional Neural Networks (CNNs). Still, mastering MLPs is crucial because they lay the groundwork for understanding more advanced architectures. Some key benefits include:

Simplicity: MLPs are relatively straightforward to implement, making them a great choice for learning about neural networks.
Versatility: They can be used for a wide range of tasks beyond image classification, such as regression and natural language processing.
Foundation for Deep Learning: Understanding how MLPs work prepares you to tackle more advanced models like CNNs, RNNs, and deep learning frameworks.

Stepping Beyond MLPs

Although MLPs are powerful for learning basic neural network concepts, they are less efficient for image data, where spatial relationships between pixels are important. This is where Convolutional Neural Networks (CNNs) come in. CNNs are designed to take advantage of the 2D structure of images and are much better suited for large-scale image classification tasks.

Learn More in My Course

In this blog, we’ve covered the basics of building a Multi-Layer Perceptron for image classification.

To dive deeper , check out my complete course on neural networks.