This will be the first post in a series that describes how convolutional neural networks (CNNs) can be used for image segmentation. As an example, we will use a set of aerial images (from a city in Germany). In those images, we will automatically detect buildings, streets, sidewalks and parking spots, and mark them with different colours.
In this first post, we will provide some background information and describe our neural network architecture. In the following post, we will focus more on data processing and explore how we can find the right hyper-parameters.
I am that type of person that wants to know what they are buying before they are spending the money, so I will show you the final result first and then explain how we got there.
On the left side, you can see the ground truth (created manually), in the middle is the input image and on the right-hand side is the output from our convolutional neural network. Every colour corresponds to one particular class that we are trying to detect. There are five classes in total: streets (light blue), sidewalks (turquoise), parking spots (green), buildings (blue) and background (violet).
As you can see, our CNN is pretty good at picking up all the different classes. In fact, around 90% of the pixels are classified correctly. This is quite impressive if we consider that no complicated pre- or post-processing steps were required and that the network had never seen this particular image before.
Before we explore network architectures and dive into the technical details, we want to explain why this is useful and refer to related research.
2.1. Image Segmentation
One of the reasons neural networks became so successful is their ability to solve complex tasks without any domain knowledge. This means that even though we used aerial images in this project, the same architecture can be used for any other image segmentation task. In fact, at no point did we use any of our knowledge about how roads or buildings should look like! In contrast, ten years ago, people would have handcrafted specific features and restricted the search space by manually specifying shape grammars or other mathematical constructs (describing the shapes of roads and buildings).
As with everything in life, there is an up and downside to having such a well-generalized tool at our disposal. The downside is that we are most likely giving up a bit of accuracy. For example, requiring parking spots to be rectangular might have given us another one percent increase in accuracy.
But on the upside, we:
- Produce a more general model that can be re-purposed.
- Avoid making any assumptions, reducing our (inaccurate) bias¹.
- Create results that are easier to understand and reproduce.
¹ Side note: We refer to bias that gets introduced when designing a solution. Whenever you select specific features and ignore others, you are inserting bias (which can have a positive or negative effect). An example would be to assume that all buildings are rectangular.
Historically CNNs were mainly used for classifying images, but in recent years image segmentation has become increasingly popular. You can now find numerous papers describing various approaches. For a start, we recommend “Mask-CNN: Localizing Parts and Selecting
Descriptors for Fine-Grained Image Recognition” (by Wei, X.S. et al. 2016) and “Fully Convolutional Networks for Semantic Segmentation” (by Shelhamer, E et al. 2016). In this article, we draw upon many of their ideas, especially from the second paper.
2.2. Aerial Images
Aerial and satellite images are the perfect use case for image segmentation because they provide us with lots of different applications. We will discuss some here, but please let us know if you come across any other interesting applications (firstname.lastname@example.org).
Autonomous vehicles are the first use case. Safety is the most important factor when designing an autonomous vehicle, but safety requires an understanding of the environment. The width of the road, position of sidewalks or pedestrian crossings are only a few things an autonomous vehicle needs to know. Current maps do not provide this information and driving around to collect that knowledge is slow and expensive. By analysing satellite and aerial images, we can build high definition maps that can help all autonomous vehicles to navigate the world more safely.
The detection of parking spots is another interesting application. Even though aerial images are usually not available in real time, the initial position of parking spots, good drop-off points and pick up locations is something that is of value for both transportation companies and service providers such as parkopedia. Furthermore, acquiring multiple images from different days/weeks/months allows us to estimate the usage intensity of streets, parking spots and urban areas in general which is not only highly relevant for city planning but can also help companies making more accurate decisions.
Last but not least, we are also detecting buildings which can give us a good estimate of how populated a given area is. In fact, large NGOs are often interested in population estimates for areas where the government is not releasing reliable information. Assessing how much area is covered by buildings can significantly improve those estimates.
3. Convolutional Neural Network
Now, we will get a bit more technical and describe our neural network architecture. For classification, a neural network usually consists of a number of convolutional layers followed by dense layers that output a probability of the object to belong to one particular class.
We can use a somewhat similar idea to classify every single pixel in the image. A pixel might be classified as either “background”, “street”, “sidewalk”, “building” or “parking”. Classifying every single pixel then gives us a segmentation for the whole image.
However, to do this efficiently, we want to classify each pixel into all classes at the same time. The easiest way to achieve this is to have n convolutional layers, followed by one final convolutional layer where the number of filters equals the number of classes we want to predict; in our case five. Essentially, all we are doing is to replace the dense layers with a final convolution layer. This architecture would look something like this (the colours are just for the visual effect):
The drawback with this architecture is that every time we go from one layer to the next, we can only capture local regions because the filters we are using are only five pixels wide and tall.
To understand more complex objects, for example, a road, the network needs to be able to combine pixel values from the whole image. Without this ability, a piece of road that is occluded by a tree or a building can not be correctly classified by the network.
To fix this issue, we use a modified version of an autoencoder. An autoencoder reduces the size of the features in every layer until all the information is compacted into a much smaller space. This process is commonly known as dimensionality reduction. It then reverses the process to generate a new image with the original size of the input image.
This process provides us with two main benefits:
- It allows our network to combine elements of the whole image. To visualise this, we can imagine the first part working like a pyramid where every layer down-scales the image:
2. Reducing the information forces the network to store only the essential information, making over-fitting less likely. We can think of this as if we are asking somebody to remember the whole image for a long time, using only a tiny piece of paper. What would they do? They would draw a tiny map, which is exactly what we want the network to do.
After we reduced the size of the input, we have to upscale the image to its original size. For every down-scaling convolutional layer, we will use a transposed convolutional layer to upscale the image again. Our new architecture looks like this:
From the above visualisation, you can see that we first use six convolutional layers to downscale the image (we can use pooling layers or a stride size of 2) until we compacted the image into an 8x8x256 region. While the original image contained roughly 800 thousand (512*512*3) values, this region can only store 16 thousand (8*8*256) values forcing the network to find a more compact representation. We then use transposed convolutional layers to restore our output image. Compared to auto encoders, the only twist is that we are not expecting the output image to be equal to our input image but instead introduce a cost function that enforces the output to resemble our map. If you are not familiar with convolutional and their transpose (sometimes called deconvolution) you can read this excellent theano tutorial.
A quick visual comparison might be enough to give you some intuition.
In the example to the left, we go from a 4x4 input to a 2x2 output. We use a convolutional filter size 3x3.
In the example to the left, we go from a 4x4 input to a 2x2 output. We use a convolutional filter size 3x3.
At this point, we would have a model that can produce a decent map, but you would not be very satisfied with the accuracy. The reason is that it is extremely hard for the network to get every pixel right. We can imagine it like this:
First, we forced the network to compress all of its knowledge into a small space (like a miniature map). Then, we scaled that map up to the original size but even though the network knows that there should, for example, be a road or a building at the top right border it does not know the exact location. It might move it a few pixels to the side or not get the outline completely right because it could only store an approximation of that building or road and not the perfect location and shape.
The solution and final step (at least for this article) is to introduce shortcuts in the network. Basically, we take the existing architecture and enable the network to look back towards its past layers. This enables, for example, the last layer to take a look at the input image and realise that some buildings are not perfectly aligned or that the shape of a particular road is not perfect. It can then shift the buildings slightly or adapt the shape of the road.
Our final architecture will then look like this:
So far, we have failed to mention how exactly the layers will be able to get information from past layers. This is, in fact, an open research question but for our architecture we decided that the n-th transposed convolutional layer should only be able to get information from the (6-n)-th convolutional layer (indicated by the colour matching arrows). Furthermore, we considered two possible ways to exchange the information:
- Adding the outputs of the layers or,
- Concatenating them.
If we are adding the layers, the n-th transposed convolution will be executed as normal and create an output with shape (h, w, x) where x is the number of filters in that layer and h/w are the height and width. Luckily the (6-n)-th convolutional layer will have created an output with exactly the same dimensions. This means, before handing the output over to the next transposed convolutional layer (n+1), we add the output from the (6-n)-th and n-th layer elementwise.
The second option is to concatenate both layer outputs. In that scenario, we do not add the values but create a new output of size (h, w, 2*x) or more general (h, w, x+y). The benefit is that the number of filters from the n-th transposed convolutional and the corresponding convolutional layer can be different (for example x and y). On the downside, this makes our model larger and increases training time.
While Shelhamer, E. et al. prefer the first option (adding layers), we found that our results improved slightly when we were concatenating layers.
After we fixed our architecture, all that is left to do is to train the model and find optimal hyper parameters. We will describe our approach and choice in the second part, but we want to show one (or five) additional outputs from the network.
As we already described, the last layer uses five filters, one for every class. Visualising the output of every filter, gives us an easy way to see the probability maps for all different classes. Below is an example based on the same image we used at the beginning of the article. Green means the output value is higher and blue means it is lower.
You might be able to see some grid artefacts in the images. This comes from the fact that the input images are much larger than 512x512 and we need to tile them. We will discuss those steps in a future post.
Stay in touch, so you do not miss the next episode!