Creating a custom Neural Network with PyTorch

João Victor Aquino Batista
Academy@EldoradoCPS
5 min readSep 16, 2020

Hello, today I’m going to be talking about PyTorch, an optimized tensor library for deep learning using GPUs and CPUs. It’s all based on python and it has a lot of key features and capabilities which makes it really powerful!

I’ll walk you through a Google Colaboratory Notebook in which my friend Gabriel Teston and I created a custom neural network for object detection (In this case we are detecting a hand and the letter of the alphabet of American Sign Language its representing). In this article, I”ll cover the implementation of the Dataset as well of the model.

First of all, we import a lot of modules which are essential for our code. They are:

PyTorch itself:

import torch

torch.nn — the basic building block for graphs:

import torch.nn as nn

torch.utils.data.Dataloader class — it represents a Python iterable over a dataset:

from torch.utils.data import DataLoader

torchvision — a package that consists of popular datasets, model architectures, and common image transformations for computer vision :

from torchvision import transforms

xml.etree.ElementTree a library for reading xml files, i.e our annotations:

import xml.etree.ElementTree as ET

Now, we need to create a class that will represent our dataset; in other words, we need something that gets all of our data and organize it in the way that we want it.

So, we define ASLDataset:

We define the method init, that will initialize our dataset with its path and some image transformations from torchvision.

We also need to implement the method len, that returns the length of our dataset.

Last but not least, we implement the method getItem, which will be responsible for returning a specific item of our dataset. It is the most complex of all three but it’s still very easy to understand. First, we get the xml file for that specific item and extract the bounding box and the class name information from it. We define the class index number with the index of strings and “spa” and “del “ are for representations of space and delete. The -8 subtracting is to make the letter “a” (its string index is 10) to be from class index “2”, since “spa” and “del” are already 0 and 1. Then, we open the associated image with it and we apply a series of transformations on it, so we can work with a representation of the image with only tensors. We also create a dictionary with the annotation info and return both the image and annotation.

We define the transformations like this:

In this composition, we resize all the images to 416x416 pixels, transform them to tensors and normalize a tensor image with mean and standard deviation, so we don’t have to worry with images taken on different physical conditions (brightness, exposure, etc.). You can find more about torchvision transforms here. You can also learn more about tensors here.

Then, we can initialize ou dataset like this:

dataset = ASLDataset("/content/drive/My Drive/ASLDataset/all", img_transforms)

Now that we have our dataset we can split it into training, validation and testing. To achieve such thing we define the “sizes” of each one:

dataset_len = len(dataset)

train_len = int(dataset_len*0.7)

test_len = int((dataset_len - train_len)*1/3)

val_len = dataset_len - train_len - test_len

And make a random split:

train_dataset, test_dataset, val_dataset = torch.utils.data.random_split(dataset, [train_len, test_len, val_len])

Now that we have created our dataset we need to load them into something that Python can iterate, so we use a DataLoader. You can read more about it here.

BATCH_SIZE = 300

train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, pin_memory=True)

test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, pin_memory=True)

val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE, pin_memory=True)

That’s pretty much it for our dataset (quite a few steps, heh?).

Now, let’s dive into the Model!

For creating the model you’ll have to create another class, say ASLModel:

This model needs to first detect where a hand is and then predict which letter from the ASL alphabet it is representing. For our class, we need to define the method init with how many classes it’ll be able to detect and the type of its input (both parameters). We also need to choose every operation that it will do, assigning them to a variable. For example, our feature extraction (self.feature_extractor) is based on Convolutions (Conv2d — Applies a 2D convolution over an input signal composed of several input planes), Pooling (MaxPool2d — Applies a 2D max pooling over an input signal composed of several input planes.) and non linear activation (ReLU — Applies the rectified linear unit function element-wise). I won’t be discussing about these operations parameters because you can understand them better by reading the docs and because it’s not something accurate, it’s based on experimentation and seeing what gives you a more accurate result, but we can sum up by saying that we manipulate the image with some operations (Conv2d, MaxPool2d, Linear and Flatten) and then call an activation function (ReLU, Sigmoid). You can use any operations that torch.nn has and it’s up to you to choose, so I strongly recommend studying more about what you’re trying to achieve and what is better for your project (there are a lot of papers on creating neural networks and its architectures, like this one).

In this model, firstly we extract features from the images, then we create an inner representation of the image, which will help our model to become more complex in a way that it has an abstract representation of the hand and its region. After we’ve done this, we need to extract the bounding box, so the model will predict four coordinates from the image, minimum “x” and “y” as well as maximum “x” and “y” in which it thinks there is a hand. Finally, it will analyze what’s inside the bounding box defined by the coordinates and it will predict which class (letter) it’s from.

The other method that we have to create is forward, responsible for getting a dataset item and calling the methods such as feature extractor. It will also return the classification and the bounding box for each item. You can think of it like passing information from a neuron to another.

Now you can create a model by just typing:

model = alsModel((3,416,416), 28).cuda()

Note that for the input parameter we have (3,416,416), which tell us that it has 3 channels (Red, Blue and Green) as well as its dimensions 416x416. We also pass the number of classes (28), all the alphabets letter plus “space” and “delete”. One last thing to consider its the addition of “.cuda()” on the end of the statement that allocates the model on the GPU (for faster training 😬 ).

That’s it for the first article of this series, in the next one I’ll talk a little bit more about the training loop, how to implement it and some other things to consider when you are creating your own custom neural network!

--

--

João Victor Aquino Batista
Academy@EldoradoCPS

Student at Apple Developer Academy @ Instituto Eldorado. Computer Engineer student at UNICAMP. Loves gaming, playing instruments and producing music.