Object detection with neural networks — a simple tutorial using keras

TLDR: A very lightweight tutorial to object detection in images. We will bootstrap simple images and apply increasingly complex neural networks to them. In the end, the algorithm will be able to detect multiple objects of varying shape and color. You should have a basic understanding of neural networks to follow along.

Image analysis is one of the most prominent fields in deep learning. Images are easy to generate and handle, and they are exactly the right type of data for machine learning: easy to understand for human beings, but difficult for computers. Not surprisingly, image analysis played a key role in the history of deep neural networks.

In this blog post, we’ll look at object detection — finding out which objects are in an image. For example, imagine a self-driving car that needs to detect other cars on the road. There are lots of complicated algorithms for object detection. They often require huge datasets, very deep convolutional networks and long training times. To make this tutorial easy to follow along, we’ll apply two simplifications: 1) We don’t use real photographs, but images with abstract geometric shapes. This allows us to bootstrap the image data and use simpler neural networks. 2) We predict a fixed number of objects in each image. This makes the entire algorithm a lot, lot easier (it’s actually surprisingly simple besides a few tricks). At the end of the post, I will outline how one can expand on this approach to detect many more objects in an image.

I tried to make this tutorial as simple as possible: I will go step by step, starting with detection of a single object. For each step, there’s a Jupyter notebook with the complete code in this github repo. You don’t need to download any additional dataset. The code is in Python plus keras, so the networks should be easy to understand even for beginners. Also, the networks I use are (mostly) very simple feedforward networks, so you can train them within minutes.

Detecting a single object

Let’s start simple: We will predict the bounding box of a single rectangle. To construct the “images”, I created a bunch of 8x8 numpy arrays, set the background to 0, and a random rectangle within the array to 1. Here are a few examples (white is 0, black is 1):

The neural network is a very simple feedforward network with one hidden layer (no convolutions, nothing fancy). It takes the flattened image (i.e. 8 x 8 = 64 values) as input, and predicts the parameters of the bounding box (i.e. the coordinates x and y of the lower left corner, the width w and the height h). During training, we simply do a regression of the predicted to the expected bounding boxes via mean squared error (MSE). I used adadelta as an optimizer here — it’s basically standard stochastic gradient descent, but with an adaptive learning rate. It’s a really great choice for experimentation, because you don’t need to spend a lot of time on hyperparameter optimization. Here’s how the network is implemented in keras:

model = Sequential([
Dense(200, input_dim=64),
Activation('relu'),
Dropout(0.2),
Dense(4)
])
model.compile('adadelta', 'mse')

I trained this network with 40k random images for 50 epochs (~1 minute on my laptop’s CPU) and got almost perfect results. Here are the predicted bounding boxes on the images above (they were held out during training):

Quite good, isn’t it? You can see that I also plotted the IOU values above each bounding box: This index is called Intersection Over Union and measures the overlap between the predicted and the real bounding box. It’s calculated by dividing the area of intersection (red in the image below) by the area of union (blue). The IOU is between 0 (no overlap) and 1 (perfect overlap). In the experiment above, I got an almost perfect IOU of 0.9 on average (on held-out test data). The code for this section is in this Jupyter notebook.

Detecting multiple objects

Predicting a single object isn’t that much fun, so let’s add another rectangle. Basically, we use the same approach as above: Bootstrap the images with 8x8 numpy arrays and train a feedforward neural network to predict two bounding boxes (i.e. a vector x1, y1, w1, h1, x2, y2, w2, h2). However, if we just go ahead and do this, we get the following (quite disappointing) result:

Both bounding boxes seem to be in the middle of the rectangles. What happened? Imagine the following situation: We train our network on the leftmost image in the plot above. Let’s say that the expected bounding box of the left rectangle is at position 1 in the target vector (x1, y1, w1, h1), and the expected bounding box of the right rectangle is at position 2 in the vector (x2, y2, w2, h2). Apparently, our optimizer will change the parameters of the network so that the first predictor moves to the left, and the second predictor moves to the right. Imagine now that a bit later we come across a similar image, but this time the positions in the target vector are swapped (i.e. left rectangle at position 2, right rectangle at position 1). Now, our optimizer will pull predictor 1 to the right and predictor 2 to the left — exactly the opposite of the previous update step! In effect, the predicted bounding boxes stay in the center. And as we have a huge dataset (40k images), there will be quite a lot of such “duplicates”.

The solution is to “assign” each predicted bounding box to a rectangle during training. Then, the predictors can learn to specialize on certain locations and/or shapes of rectangles. In order to do this, we process the target vectors after every epoch: For each training image, we calculate the mean squared error between the prediction and the target A) for the current order of bounding boxes in the target vector (i.e. x1, y1, w1, h1, x2, y2, w2, h2) and B) if the bounding boxes in the target vector are flipped (i.e. x2, y2, w2, h2, x1, y1, w1, h1). If the MSE of A is greater than B, we leave the target vector as is; if the MSE of B is greater than A, we flip the vector. I’ve implemented this algorithm here. Below is a visualization of the flipping process:

Each row in the plot above is a sample from the training set. From left to right are the epochs of the training process. Black means that the target vector was flipped after this epoch, white means no flip. You can see nicely that most flips occur at the beginning of training, when the predictors haven’t specialized yet.

If we train our network with flipping enabled, we get the following results (again on held-out test images):

Overall, the network achieves a mean IOU of 0.5 on the training data (I haven’t calculated the one for the test dataset, but it should be pretty similar). Not as perfect as for a single rectangle, but pretty good (especially considering that it’s such a simple network). Note that the leftmost image is the same as in the plot before (the one without flipping) — you can clearly see that the predictors have learned to specialize on the rectangles.

Finally, two more notes on the flipping algorithm: Firstly, the approach presented above is of course only valid for two rectangles. However, you can easily extend it to multiple rectangles by looking at all possible combinations of predictors and rectangles (I will explain this in some more detail below). Secondly, you don’t necessarily have to use the mean squared error to decide whether the target should be flipped or not — you can as well use the IOU or even the distance between the bounding boxes. In my experiments, all three metrics led to pretty similar results, so I decided to stick to the MSE as most people should be familiar with it.

Classifying objects

Detecting objects works pretty well by now, but of course we also want to say what an object is. Therefore, we’ll add triangles and classify whether an object is a rectangle or a triangle. The cool thing is that we don’t need any extra algorithm or workflow for this. We’ll use the exact same network as above and just add one value per bounding box to the target vector: 0 if the object is a rectangle, and 1 if it’s a triangle (i.e. binary classification; code is here).

Here are the results (I increased the image size to 16x16 so that small triangles are easier to recognize):

A red bounding box means the network predicted a rectangle, and yellow means it predicted a triangle. The samples already indicate that classification works pretty well, and indeed we get an almost perfect classification accuracy.

Putting it all together: Shapes, Colors, and Convolutional Neural Networks

Alright, everything works, so let’s have some fun now: We’ll apply the method to some more “realistic” scenes — that means: different colors, more shapes, and multiple objects at once. To bootstrap the images, I used the pycairo library, which can write RGB images and simple shapes to numpy arrays. I also made some modifications to the network itself, but let’s first have a look at the results:

As you can see, the bounding boxes aren’t perfect, but most of the time they are kind of in the right place. The mean IOU on the test dataset is around 0.4, which is not bad for recognizing three objects at once. The predicted shapes and colors (written above the bounding boxes) are pretty much perfect (test accuracy of 95 %). Apparently, the network has really learned to assign the predictors to different objects (as we aimed for with the flipping trick introduced above).

In comparison to the simple experiments above, I made three modifications:

1) I used a convolutional neural network (CNN) instead of a feedforward network. CNNs scan the image with learnable “filters” and extract more and more abstract features at each layer. Filters in early layers may for example detect edges or color gradients, while later layers may register complex shapes. I won’t go into the technical details here, but you can find excellent introductions in the lectures from Stanford’s CS231n class or this chapter from Michael Nielsen’s book. For the results shown above, I trained a network with four convolutional and two pooling layers for about 30–40 minutes. A deeper/more optimized/longer trained network might probably get better results.

2) I didn’t use a single (binary) value for classification, but one-hot vectors (0 everywhere, 1 at the index of the class). Specifically, I used one vector per object to classify shape (rectangle, triangle or circle) and one vector to classify color (red, green or blue). Note that I added some random variation to the colors in the input images to see if the network can handle this. All in all, the target vector for an image consists of 10 values for each object (4 for the bounding box, 3 for the shape classification, and 3 for the color classification).

3) I adapted the flipping algorithm to work with multiple bounding boxes (as mentioned above). After each epoch, the algorithm calculates the mean squared error for all combinations of one predicted and one expected bounding box. Then, it takes the minimum of those values, assigns the corresponding predicted and expected bounding boxes to each other, takes the next smallest value out of the boxes that were not assigned yet, and so on.

You can find the final code in this notebook.

Real-world objects

Recognizing shapes is a cool and easy example, but obviously it’s not what you want to do in the real world (there aren’t that many abstract 2D shapes in nature, unfortunately). Also, our algorithm can only predict a fixed number of bounding boxes per image. In the real world, however, you have diverse scenarios: A small side road may have no cars on it, but as soon as you drive on the highway, you have to recognize hundreds of cars at the same time.

Even though this seems like a minor issue, it’s actually pretty hard to solve — how should the algorithm decide what’s an object and what’s background, if it doesn’t know how many objects there are? Imagine yourself looking at a tree from close by: Even though you only see a bunch of leaves and sticks, you can still clearly say it’s all one object, because you understand what a tree is. If the leaves were lying around on the floor instead, you would easily detect them as individual objects. Unfortunately, neural networks don’t quite understand what trees are, so this is a pretty hard challenge for them.

Out of the many algorithms that do object detection on a variable number of objects (e.g. Overfeat or R-CNN; have a look at this lecture for an overview), I only want to highlight one, because it’s pretty similar to the method we used above: It’s called YOLO (You Only Look Once). In contrast to older approaches, it detects objects in an image with a single pass through a neural network. In short, it divides the image into a grid, predicts two bounding boxes for each grid cell (i.e. exactly the same thing we did above), and then tries to find the best bounding boxes across the entire image. Because YOLO only needs a single pass through a network, it’s super fast and even works on videos. Below is a demo, and you can see more examples here.


If you have any questions or ideas, write me on Twitter (@johannes_rieke) or comment below!