Why You Need to Learn PyTorch’s Powerful DataLoader

Akmel Syed
Dec 31, 2020 · 6 min read

In case you missed it, in a previous article we went over batch gradient descent in-depth and saw how it vastly improved the vanilla gradient descent approach. In this article, we’ll revisit batch gradient descent, but instead, we’ll take advantage of PyTorch’s powerful Dataset and DataLoader classes. By the end of this article, you will be convinced to never go back to a life of deep learning without PyTorch’s DataLoader.

Before we begin, we’ll rerun through the steps which we performed in the previous batch gradient descent article. Just like last time, we’ll use the Pima Indians Diabetes dataset, set aside 33% for testing, standardize it and set the batch size to 64. We’ll also keep the same neural network architecture — 1 layer of size 4.

I’ve labeled the cell codes with very high level titles, but if you wish to see the in-depth explanation of the code below, please refer to the previous article where we introduced batch gradient descent.

Import libraries

import pandas as pd
import torch
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import torch.nn as nn

Import data and standardize

df = pd.read_csv(r'https://raw.githubusercontent.com/a-coders-guide-to-ai/a-coders-guide-to-neural-networks/master/data/diabetes.csv')

Create neural network architecture

class Model(nn.Module):

def __init__(self):
super().__init__()
self.hidden_linear = nn.Linear(8, 4)
self.output_linear = nn.Linear(4, 1)
self.sigmoid = nn.Sigmoid()

def forward(self, X):
hidden_output = self.sigmoid(self.hidden_linear(X))
output = self.sigmoid(self.output_linear(hidden_output))
return output

Make variables which will be reused

def accuracy(y_pred, y):
return torch.sum((((y_pred>=0.5)+0).reshape(1,-1)==y)+0).item()/y.shape[0]

Instantiate Model class and set loss and optimizer

model = Model()
BCE = nn.BCELoss()
optimizer = torch.optim.SGD(model.parameters(), lr = lr)

Run batch gradient descent without PyTorch’s DataLoader

import numpy as np
train_batches = int(np.ceil(len(X_train)/batch_size))-1
test_batches = int(np.ceil(len(X_test)/batch_size))-1

Using PyTorch’s DataLoader Class

Sorry for the chunks of code above before starting the topic at hand. The above is so that we can compare the results with and without PyTorch’s DataLoader class.

Let’s get right into it. We’re going to import our required classes before moving forward.

from torch.utils.data import Dataset
from torch.utils.data import DataLoader

You’re going to see above that we imported the DataLoader class, but along with it, we also imported the Dataset class. This is because the DataLoader class accepts data in the form of a Dataset object.

To get our data in the form of a Dataset object, we’re going to create a custom class which is going to inherit from the Dataset class. Other than the making of an __init__ method, inheriting the Dataset class requires us to also override the __getitem__ and __len__ methods. The __len__ method returns the length of our Dataset object and the __getitem__ method returns the xy pair at a given index.

Let’s write the code for the class.

Note: Something you’ll notice is that I’m not preproceesing any of my data in the class, rather, I’m reusing all of my preprocessing from above. Due to the standardization and the splitting up of the data for train/test, I was unsure how to accomplish those steps in a clean way inside the class. Would love to get some advice in case I’ve been using the Dataset class the wrong way.

class PimaIndiansDiabetes(Dataset):

Due to the preprocessing being done above, the class is very straightforward. It’s really just acting as a wrapper for our training and testing datasets, allowing them to be in a format acceptable for the DataLoader class.

We’ll continue by making Dataset objects for both the training and testing data.

train_data = PimaIndiansDiabetes(X_train, y_train)
test_data = PimaIndiansDiabetes(X_test, y_test)

Finally, time to use the DataLoader class.

What we have below are the creation of 2 DataLoader objects — 1 for the training data and the other for the testing. The DataLoader class has many parameters which we can pass. Let’s talk about the arguments we’re passing it.

Batch_size — this 1 is self-explanatory. The DataLoader creates batches for us to be able to iterate through them. We no longer have to care about slicing the data to retrieve batches.

Shuffle — this allows our data to be shuffled, but more importantly, it shuffles our data every epoch. This trick allows our batches to be a random set of 64 records each time. This trick helps with generalization.

Drop_last — this is something I set as True only when I have shuffle set to True. This drops the last non-full batch. If you realized, our training set has a size of 514. When we divide that by 64, you’ll see that our last batch only contains 1 item. That 1 item is an insufficient sample size to be able to fit the model.

There’s 1 more parameter which I haven’t set but do wish to bring up. That parameter is num_workers. We’re not making use of it, because we aren’t using a GPU, but when you use many GPUs, num_workers allows us to take full advantage of all the GPUs. It allocates our data appropriately, allowing us to significantly improve the speed of our training process.

To read more about the DataLoader class and its capabilities, I suggest you head on over to PyTorch’s documentation and have a look at its capabilities: https://pytorch.org/docs/stable/data.html?highlight=dataloader#torch.utils.data.DataLoader

train_loader = DataLoader(dataset=train_data, batch_size=batch_size, shuffle=True, drop_last=True)
test_loader = DataLoader(dataset=test_data, batch_size=batch_size, shuffle=True, drop_last=True)

Let’s reset our model, along with the loss and optimizer.

model = Model()
BCE = nn.BCELoss()
optimizer = torch.optim.SGD(model.parameters(), lr = lr)

Let’s run our model again, but this time, using the DataLoader class.

for epoch in range(epochs):

iteration_loss = 0.
iteration_accuracy = 0.

model.train()
for i, data in enumerate(train_loader):
X, y = data
y_pred = model(X.float())
loss = BCE(y_pred, y.reshape(-1,1).float())

iteration_loss += loss
iteration_accuracy += accuracy(y_pred, y)

Cool stuff! You can see that our code is much cleaner when using the DataLoader class, but also, our results are also slightly better. Both the accuracy and the loss for our model are slightly better when using the DataLoader class. It’s the combination of these simple tricks which will take allow us to stand out in a crowd.

That concludes our little bit with the DataLoader class. Hopefully, you’re convinced as to why you need to add this to your arsenal when implementing neural networks.

As always, you can run the code in Google Colab — https://cutt.ly/cg2ai-pytorch-dataloader-colab

A Coder’s Guide to AI

You don’t need a PhD to dive into AI

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store