Multi-Layer Perceptron (MLP) in PyTorch

Published in

Deep Learning Study Notes

5 min readDec 26, 2019

Last time, we reviewed the basic concept of MLP. Today, we will work on an MLP model in PyTorch. Specifically, we are building a very, very simple MLP model for the Digit Recognizer challenge on Kaggle, with the MNIST data set.

Disclaimer

This is not a tutorial or study reference. In Fall 2019 I took the introduction to deep learning course and I want to document what I learned before they left my head. If you find my mistakes, please let me know and I will really appreciate your help first, and then fix them. Also, I will not post any code I wrote while taking the course.

Get the Files Ready in Place

Download the data from Kaggle. Next, unzip the train and test data set. I unzipped them to a folder named data. I used Google Drive and Colab. So, in the end, my file structure looks like this:

(Optional) Download data from Kaggle to Google Drive on Colab

First, follow the Kaggle API documentation and download your kaggle.json. Upload this kaggle.json to your Google Drive. Remember to change line 5 in the scripts above to where you actually stored your kaggle.json.

Read the Data from CSV to numpy array

We are using the pd.read_csv from the panda library. It is a nice utility function that does what we asked: read the data from CSV file into a numpy array. Remember to call the .values in the end.

In this challenge, we are given the train and test data sets. In the train data set, there are 42,000 hand-written images of size 28x28. The first column of the CSV is going to be which digit the image represents(we call this ground truth and/or label), and the rest are 28x28=784 pixels with value ranged in [0, 255]. The test data set contains 28,000 entries and it does not have the ground truth column, because it is our job to figure out what the label actually is.

If we were not pursuing the simplicity of the demonstration, we would also split the train data set into the actual train data set and a validation/dev data set. With this separate group of data, we can test our model’s performance during the training time. And since the model won’t be trained with this group of data, it gives us a sense of how the model would perform in general. We can’t achieve this effect with only the train data because during training, the model will get more and more overfitted to the train data set.

Prepare the data with PyTorch

Ultimately, we want to create the data loader. But to obtain this data loader, we need to create a dataset. The dataset makes direct contacts with our freshly read data and processes the data on-the-fly, while the data loader does the labor and loads the data when we need it. The data loader will ask for a batch of data from the data set each time. And the dataset will do the pre-processing for this batch only, not the entire data set. There’s a trade-off between pre-process all data beforehand, or process them when you actually need them.

To customize our own dataset, we define the TrainDataset and TestDataset that inherit from the PyTorch’s Dataset. We separate the Train and Test dataset classes because their __getitem__ outputs are different. Alternatively, we could also save a flag in __init__ that indicates how many outputs are there for the corresponding class instance.

We divided the pixel values by 255.0. This step does two things: 1. it converts the values to float; 2. it normalizes the data to the range of [0, 1]. Normalization is a good practice.

We also shuffled our train data when building the data loader. This randomness helps train the model because otherwise we will be stuck at the same training pattern.

Batch size. It depends on the capability of our GPU and our configuration for other hyperparameters. I like to use a batch size of 2 when debugging my model. Yes, unfortunately, we will need to debug the model sometimes if we want to craft our own wheels and it is not an easy task. During the actual training, I find values between 16 to 512 make sense.

Model

Finally, the model!

Ok, this model is a very simple one. But it is not so naive. It actually achieves 91.2% accuracy in this kaggle challenge, though there are two thousand contestants with better results.

In this model, we have 784 inputs and 10 output units. Because we have 784 input pixels and 10 output digit classes. In PyTorch, that’s represented as nn.Linear(input_size, output_size). Actually, we don’t have a hidden layer in the example above.

We also defined an optimizer here. Optimizers help the model find the minimum.

We are using the CrossEntropyLoss function as our criterion here. The criterion lets the model how well it performed.

Train the model

Epochs are just how many times we would like the model to see the entire train data set. During each epoch, we iterate through the data loader in mini-batches.

We let the model take a small step in each batch. And to do so, we are clearing the previous data with optimizer.zero_grad() before the step, and then loss.backward() and optimizer.step().

Notice for all variables we have variable = variable.to(device). This ensures all variables stay on the same computation machine, either the CPU or the GPU, not both. Because PyTorch does not support cross-machine computation yet.

Get the predictions

Yey. We are finally getting the results.

This is also called the inference step. It looks a lot like the training process, except we are not taking the backward steps now. Also, we can turn on the with torch.no_grad(), which frees up unnecessary spaces and speeds up the process.

An actual MLP

In the model above we do not have a hidden layer. So here is an example of a model with 512 hidden units in one hidden layer.

The model has an accuracy of 91.8%. Barely an improvement from a single-layer model. Inside MLP there are a lot of multiplications that map the input domain (784 pixels) to the output domain (10 classes). By adding a lot of layers inside the model, we are not fundamentally changing this underlying mapping. So our performance won’t improve by a lot. Actually, we introduced the risk of gradient vanishing and gradient explosion.

Recap

We build a simple MLP model with PyTorch in this article. Without anything fancy, we got an accuracy of 91.2% for the MNIST digit recognition challenge. Not a bad start.

Reference

The PyTorch master documentation for torch.nn.

The Kaggle digit recognizer challenge.

Thank you for reading. See you next time.