Traffic Sign Recognition

Built and trained a deep neural network to classify traffic signs

Barney Kim
8 min readMar 27, 2019

Introduction

This is my attempt to tackle traffic signs classification problem with a deep neural network implemented in PyTorch (reaching 99.33% accuracy). The highlights of this solution would be data preprocessing, trained with heavily augmented data and using Spatial Transformer Network.

Code can be found in my Github

Dataset

The GTSRB dataset (German Traffic Sign Recognition Benchmark) is provided by the Institut für Neuroinformatik group here. It was published for a competition held in 2011(results). Images are spread across 43 different types of traffic signs and contain a total of 39,209 train examples and 12,630 test ones.

The pickled dataset summary:

  • Number of training examples = 34799
  • Number of validation examples = 4410
  • Number of testing examples = 12630
  • Image data shape = (32, 32)
  • Number of classes = 43

Let’s visualize the German Traffic Signs Dataset using the pickled file.

Issues with data

There are numerous issues with dataset which represent the real world problems faced in actual traffic sign recognition.

  1. Class imbalance: Apparently dataset is very unbalanced, some of the classes having 2000 samples and some of them having only 200 samples. This imbalance biases the model to predict classes with higher number of sample more frequently than the ones with less number of samples to achieve better accuracy.
  2. High contrast variation: The images differ significantly in terms of contrast and brightness. It is difficult to human to understand and classify some of these signs which are in total darkness.
  3. Small dataset: Even though we have about 35k train images, they are not quite enough to perform well in all general case scenario. More data also helps with overfitting.

Preprocessing

High contrast variation among the images calls for contrast normalization. The Contrast-limited adaptive histogram equalization (CLAHE for short) algorithm partitions the images into contextual regions and applies the histogram equalization to each one. This evens out the distribution of used grey values and thus makes hidden features of the image more visible. I think CLAHE seems a good algorithm to obtain a good looking image directly.

As Pierre Sermanet and Yann LeCun mentioned in their paper, using color channels didn’t seem to improve things a lot, so I only used a single channel in my model. To create a grayscale image, I simply convert an image from RGB to YCbCr and I used only one channel Y from YCbCr color space.

class CLAHE_GRAY:
def __init__(self, clipLimit=2.5, tileGridSize=(4, 4)):
self.clipLimit = clipLimit
self.tileGridSize = tileGridSize

def __call__(self, im):
img_y = cv2.cvtColor(im, cv2.COLOR_RGB2YCrCb)[:,:,0]
clahe = cv2.createCLAHE(clipLimit=self.clipLimit,
tileGridSize=self.tileGridSize)
img_y = clahe.apply(img_y)
img_output = img_y.reshape(img_y.shape + (1,))
return img_output

Preprocessed images look like:

Handling imbalanced dataset

Flipping

As I mentioned, the amount of training data is not enough for model to classify well. So, I used couple of tricks to extend training dataset by flipping. I noticed that there are many similarities between classes that I can use to increase my small dataset.

You might have noticed that some traffic signs are invariant to horizontal or vertical flipping. Some signs can be flipped either way. And some signs can be classified as different classes if you flip them over. For example, turn left sign flipped horizontally becomes turn right.

This simple trick lets us extend original 39,209 training examples to 63,538 images.

Dataset classes distribution after flipping

Augmentation

However, it is still not enough. The 43 classes are not equally represented. The relative frequency of some classes is significantly lower than the mean.

I thought balanced dataset improve on the results. So, I built a jittered dataset by geometrically transforming(rotation, translation, shear mapping, scaling) the same sign picture. This can be done easily in PyTorch by applying torchvision.transforms.

train_data_transforms = transforms.Compose([
transforms.ToPILImage(),
transforms.RandomApply([
transforms.RandomRotation(20, resample=PIL.Image.BICUBIC),
transforms.RandomAffine(0, translate=(0.2, 0.2),
resample=PIL.Image.BICUBIC),
transforms.RandomAffine(0, shear=20,
resample=PIL.Image.BICUBIC),
transforms.RandomAffine(0, scale=(0.8, 1.2),
resample=PIL.Image.BICUBIC)
]),
transforms.ToTensor()
])
train_data_transforms = transforms.Compose([
transforms.ToPILImage(),
transforms.RandomApply([
transforms.RandomRotation(20, resample=PIL.Image.BICUBIC),
transforms.RandomAffine(0, translate=(0.2, 0.2), resample=PIL.Image.BICUBIC),
transforms.RandomAffine(0, shear=20, resample=PIL.Image.BICUBIC),
transforms.RandomAffine(0, scale=(0.8, 1.2), resample=PIL.Image.BICUBIC)
]),
transforms.ToTensor()
])

WeightedRandomSampler

I tried use WeightedRandomSampler which samples all of the classes with equal amount and passed that to data loader of PyTorch. This sampler samples elements with given probabilities(weights).

class_count = np.bincount(dataset.labels)
weights = 1 / np.array([class_count[y] for y in dataset.labels])
sampler = WeightedRandomSampler(weights, 43 * 20000)
data_loader = DataLoader(dataset, batch_size=64, sampler=sampler)
classes distribution by data loader

Training

Custom DataSet & DataLoader

Given dataset is in numpy array format, and has been stored using pickle, a python-specific format for serializing data. So let’s create a class that is inherited from the Dataset class.

torch.utils.data.Dataset is an abstract class representing a dataset. A Dataset can be anything that has a __len__ function (called by Python’s standard len function) and a __getitem__ function as a way of indexing into it.

class PickledDataset(Dataset):
def __init__(self, file_path, transform=None):
with open(file_path, mode='rb') as f:
data = pickle.load(f)
self.features = data['features']
self.labels = data['labels']
self.count = len(self.labels)
self.transform = transform

def __getitem__(self, index):
feature = self.features[index]
if self.transform is not None:
feature = self.transform(feature)
return (feature, self.labels[index])

def __len__(self):
return self.count

Now, we can iterate over the created dataset with a torch.utils.data.DataLoader. However, if you’re lucky enough to have access to a CUDA-capable GPU, you can use it to speed up your code. So, I created WrappedDataLoader to move batches to the GPU.

class WrappedDataLoader:
def __init__(self, dl, func):
self.dl = dl
self.func = func

def __len__(self):
return len(self.dl)

def __iter__(self):
batches = iter(self.dl)
for b in batches:
yield (self.func(*b))
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")def to_device(x, y):
return x.to(device), y.to(device, dtype=torch.int64)
train_loader = WrappedDataLoader(train_loader, to_device)

Model

I implemented the original IDSIA MCDNN model with extra batch normalization layer and number of other modifications.

class TrafficSignNet(nn.Module):
def __init__(self):
super(TrafficSignNet, self).__init__()
self.conv1 = nn.Conv2d(1, 100, 5)
self.conv1_bn = nn.BatchNorm2d(100)
self.pool = nn.MaxPool2d(2, 2)
self.conv2 = nn.Conv2d(100, 150, 3)
self.conv2_bn = nn.BatchNorm2d(150)
self.conv3 = nn.Conv2d(150, 250, 1)
self.conv3_bn = nn.BatchNorm2d(250)
self.fc1 = nn.Linear(250 * 3 * 3, 350)
self.fc1_bn = nn.BatchNorm1d(350)
self.fc2 = nn.Linear(350, 43)
self.dropout = nn.Dropout(p=0.5)

def forward(self, x):
x = self.pool(F.elu(self.conv1(x)))
x = self.dropout(self.conv1_bn(x))
x = self.pool(F.elu(self.conv2(x)))
x = self.dropout(self.conv2_bn(x))
x = self.pool(F.elu(self.conv3(x)))
x = self.dropout(self.conv3_bn(x))
x = x.view(-1, 250 * 3 * 3)
x = F.elu(self.fc1(x))
x = self.dropout(self.fc1_bn(x))
x = self.fc2(x)
return x

Regularization

I use the following regularization techniques to prevent overfitting:

  • Dropout : A simple way to prevent neural networks from overfitting. Dropout is a regularization method that approximates training a large number of neural networks with different architectures in parallel. During training, some number of layer outputs are randomly ignored.
  • Batch Normalization : During training time, the distribution of the activations is constantly changing. The intermediate layers must learn to adapt themselves to a new distribution in every training step, so training process slow down. Batch normalization is a method to normalize the inputs of each layer, in order to fight the internal covariate shift problem.
  • Early Stopping : A major challenge in training neural networks is how long to train them. If we use too few epochs, we might underfit; if we use too many epochs, we might overfit. A compromise is to train on the training dataset but to stop training at the point when performance on a validation dataset starts to degrade. This simple, effective, and widely used approach to training neural networks is called early stopping. I used early stopping with a patience of 100 epochs. After the validation loss has not improved after 10 epochs, training is stopped.

GPU Utilization

I was training a network with various batch_size and num_workers. But I got a very low and oscillating GPU utilization. So I decided to find the bottleneck.

torch.utils.bottleneck is a tool that can be used for debugging bottlenecks in your program. It summarizes runs of your script with the Python profiler and PyTorch’s autograd profiler. Run it on the command line with:

python -m torch.utils.bottleneck your_script.py [args]

Initially, the image processing task was processed in real time each time a batch was read from the dataset. In other words, GRAY_CLAHE and several transforms chained together using transforms.Compose. But that was the bottleneck. So, I preprocessed all images of dataset(train, valid, test) and saved to new pickle file before training.

Spatial Transformer Networks

The problem with CNNs is that they don’t efficiently learn the spatially invariances. A few years ago, DeepMind released an awesome paper called Spatial Transformer Networks aiming at boosting the geometric invariance of CNNs in a very elegant way.

The goal of Spatial transformer networks (STN for short) is to add to your base network a layer able to perform an explicit geometric transformation on an input. STN are a generalization of differentiable attention to any spatial transformation. It allow a neural network to learn how to perform spatial transformations on the input image in order to enhance the geometric invariance of the model.

The layer is composed of 3 elements:

  • The localization network takes the original image as an input and outputs the parameters of the transformation we want to apply.
  • The grid generator generates a grid of coordinates in the input image corresponding to each pixel from the output image.
  • The sampler generates the output image using the grid given by the grid generator.

The given dataset shows that the traffic signs were taken at a wide variety of angles and distances. So I preprocessed input image by the Spatial Transformer Networks.

Now, let’s train the model. The network is learning the classification task in a supervised way. In the same time the model is learning STN automatically in an end-to-end fashion. After training, I inspected the results of our learned visual attention mechanism.

Conclusion

We see that preprocessing image with contrast normalization, using augmented data, removing class imbalance and using spatial transform network helps us classify traffic signs with high accuracy. After training iterations this model scored 99.33% accuracy on the test set consist of 12,630 images.

Personally, I thoroughly enjoyed this project and gained practical experience using PyTorch. I hope you enjoyed reading this post. Feel free to leave comments and claps :)

--

--

Barney Kim

Software Engineer@KakaoMobility, Self-Driving Car Engineer@Udacity Graduate https://wolfapple.github.io