Fast object detection with SqueezeDet on Keras

This is part one of our blog posts on the SqueezeDet object detection architecture. Here at omni:us, we are already using this architecture in production to detect regions of interest in business documents. We decided to re-implement it ourselves in Keras, which you can find here, instead of using the original TensorFlow code.

In this part, we will provide a short overview of the network, why we rewrote the code and why you might want to use it. The subsequently upcoming second part will feature some juicy implementation details, for those of you, who want to get their hands dirty and dig into the source themselves.

So, what is SqueezeDet?

SqueezeDet is a lightweight, single-shot, fully convolutional object detection architecture published by the gentlemen Bichen Wu, Alvin Wan, Forrest Iandola, Peter H. Jin and Kurt Keutzer. Now let us untangle this:

Single-shot Object Detection:

This means, the network architecture does not just merely classify an image into a predefined group. Instead of providing you with a single label for an image, it carries out a process of identifying the locations, usually boxes, and the classes of multiple objects within the image. You are also provided with a confidence value, which you can use for filtering. Here, the network tells you how sure it is of its prediction. As an example, see the image above. The green boxes denote, what the coordinates and classes the network should predict, and the red ones, what the network does predict and with which level of confidence.

How does this work in SqueezeDet? Before the training, you define a grid over your image. At each point of this grid, you have a couple of boxes sitting with different aspect ratios, tailored to your dataset. These are called the anchor boxes. If you want to look at it in a bayesian way, this is a prior distribution over the boxes. During the training, instead of predicting the raw box coordinates, the network learns to move and stretch these boxes to fit to the original ones. If we again put on our bayesian hat, we are observing data and updating our posterior distribution, which should eventually converge to the true distribution of boxes. Check out the paper, to get a more mathematical view of what is going on.

Sketch of the anchor box regression from the original paper

At the same time, we minimize a standard cross-entropy loss for each box to do classification, as we would for image classification. To obtain meaningful confidence score, each box’s predicted value is regressed against the Intersection over Union of the real and the predicted box. So, in one single shot, we obtain the bounding box prediction, the classification, as well as, the confidence score, while earlier architectures split these tasks into two segments.

Lightweight:

This of course is a relative term. SqueezeDet’s architecture consists of approximately 2 million trainable parameters. For a new neural network aficionado or people who are used to linear regression, this might seem like a lot. But note, that the widely loved ResNet-50 has around 25 million parameters, while VGG19 has about 143 million. In general, the more the parameters, the higher the accuracy. Hence, it’s astounding, that with the small SqueezeDet architecture, we attain such a high level of accuracy, while requiring much less parameters.

Fully convolutional:

There exists no recurrency in the network’s connections, but usually, this only refers to the absence of fully-connected layers. Recently, the last traditionally fully-connected layer in classification has been largely replaced by convolutional or pooling layers. As the name implies, in a fully connected layer each neuron has a connection with a unique weight to each input. In a convolutional layers, a neuron, sometimes called a filter, slides over the input, only processing a small part at a time, but with the same weights. How big is the difference in parameters? Let’s use a bounding box example:

Imagine your penultimate layer has an output size of 100 x 100 x 1, and we want the bounding box predictions on a grid of 10 by 10, with only one anchor box shape and a single class. We need 4 values to move and stretch the box in horizontal and vertical direction. This means, that your last layer has an output size of 10 x 10 x ( 4 + 1 + 1). If we omit biases, the fully connected layer would need (100 x 100) x 10 x 10 x (4 +1 +1) = 6 million parameters.

If instead, you use a convolutional layer with (4+1+1) filters with a size of 10 x10, and a stride of 10, you get the same output, but with only 10 x 10 x (4+1+1) = 600 parameters.

Why use SqueezeDet?

As mentioned before, the size of this network’s architecture and, therefore, the weight files it produces are significantly smaller as opposed to other networks. These files take up around 8MB, while other networks produce files sizing up to about 50 times more. This proves vital and economical to users of systems with smaller storage space, for example smart phones, and for those who do not want to do any pruning afterwards.

Storage-savings aside, SqueezeDet also minimizes the time spent on the training stage, a part many people neglect. The time it takes to train the system with a single GPU on a medium sized dataset is now cut down to a matter of hours, as opposed to days. This allows iterating different ideas quickly. Therefore, if you are a startup and can not spin up a cluster with 100 GPUs, you might want to go small for your initial experiments. During deployment of the trained model, it also performs rapid predictions on new images. SqueezeDet has the lead with its time-saving and space-optimizing characteristics.

So, why did we rewrite the code to Keras?

By doing so, it makes it much simpler for potential users to get a nice overview of the whole code, as there is just less code to read. It also makes it way easier to maintain. By rewriting the code, we had the liberty of removing things we found redundant or unnecessary. For example, the original implementation contained code for benchmarking against different architectures. We also inserted features we wanted, like the three sublosses to be shown during training, and writing all kinds of useful metrics, like mean average precision, to TensorBoard.

Losses and evaluation metrics in TensorBoard

Furthermore, Keras enables you to add new evaluation metrics quite easily. Additionally, as long as you can wrap it into a callback, you can add pretty much every function you want executed after each batch or epoch.

We also took the time to properly implement a visualization of the training progress, by showing the evolution of the network’s predictions on some images with added ground truth boxes. This comes quite handy, as it allows you to monitor things easily, check different confidence thresholds visually and show non technical people what your network is actually doing.

You can slide on to see how your network predictions change during training.

Another motivation was to upgrade from Python 2.7 to Python 3.5, with the intention of ensuring future compatibility of the code, since Python 2 will only be maintained until 2020.

As a sweet bonus, the command line output of Keras is also pretty cool. It lets you see the batch number, as well as, the total number of batches. Here, you see the expected remaining time of the current epoch, as well as, the loss function decrease for each batch.

Command line output during training

If you want to check out SqueezeDet, head over to our git repository. There you will find a guide on how to setup everything you need within in a virtual environment. We also provided you with an example on how to run model training and evaluation on the KITTI dataset, which unfortunately does not feature cute kitties, but road scenery with cars, cyclists and pedestrians marked. Happy experimenting.