On the importance of proper data handling (part 1)

Handling XView satellite imagery with YOLO

Published in

Picterra

8 min readDec 18, 2018

These days everyone is so focused on how to make the best CNN models, with the newest layers, and the most interesting loss functions. But before all that there is an equally important step that is often overlooked: proper data handling. In images from datasets such as VOC and COCO perhaps there is not so much to do here: the standard rotation, translation and scaling augmentations will do. However, satellite imagery is a different beast altogether and without properly handling it we’ll run into issues training our models efficiently.

Despite all the advances in deep learning in recent years arguably the most important component of your network is not the network at all, but the data you give it. If you do not handle your data in a way that is suitable for your problem even the most cutting edge architectures will struggle find an optimal solution.

In this article we will go over the various issues we encountered when playing with the XView dataset and how we solved them. But first, we’ll put the problem in context.

Problem Statement:

The problem that we will be focusing is object detection from satellite imagery provided in the XView dataset. Object detection is one of the big three problems that people are currently trying to solve in computer vision, including segmentation and classification. Object detection is the problem of localizing objects in an image typically by finding a tight fitting bounding box around each object of interest and then classifying it. To do this we need two things: a detection model and a dataset.

The Model: We will choose a CNN model. More specifically we will choose the well know YOLO (v2) model. YOLO takes an initial input image of a certain size (say 416 x 416) and then divides it up into a grid by downsampling the image by half, 5 times, resulting in a grid of 13x13. We call this the proposal grid because for each grid cell, we have some number of boxes (the paper uses 5 for the VOC dataset) of predetermined size centered on grid cell. The boxes serve as initial proposals for the locations and sizes of objects of interest in the image. What the YOLO network then does is to transform these boxes into ones that actually wrap around these objects of interest and classifies them. There are usually many redundant overlapping boxes which are then removed using post processing (more specifically, non maximum suppression). The full pipeline can be some pictorially below:

We start with a proposal grid with some number of anchors per cell (left). The network transforms these proposals into boxes that wrap around objects of interest (center). Redundant and low probability proposals are removed using non maximum suppression (right)

YOLO (v2) is one of the top performing models for the task of object detection. Additionally it is very very fast, running at 90 fps on a GTX 1080 Ti with the default 416x416 input image size, which makes it ideal for user facing applications.

The Dataset:

We went over the XView dataset in a previous post here. Let’s revisit a few key points that will be important for us later.

It’s currently the largest object detection dataset for satellite imagery currently existing, with a million annotations over 847 images taken from a large variety of scenes.
The images are large (3300 px by 3300 px), at 0.3 meter spatial resolution, each covering an extent of 1 square km.
It contains 60 classes and has some very severe class imbalance.
It has a lot of density imbalance, some scenes / parts of scenes being extremely sparse or completely empty, and some areas being very dense.
There is a large object size range ranging from 5 px in bounding box diagonal size up to 1100 px.

The naive approach:

Now that we have our model and dataset defined let’s start with the naive approach and just feed our huge images straight into our YOLO network with the annotations. Each image gets roughly divided into a 100x100 proposal grid and we chose 9 anchors per grid cell to cover the large range of object sizes which gives us about 90000 proposals per image.

This is where we run into our first issue.

Issue 1 — That’s a huge image!

We simply cannot fit such a huge input into our GPU memory! With a GTX 1080 Ti, which has about 11 Gb of memory, we can fit a batch size of approximately 0.25. That’s not even one whole image! So we can’t even train on this. Even if we could fit a small batch size of 1 or 2 somehow, the training would take forever. Additionally YOLO (v2) uses batch normalization layers with every convolutional layer, which do not play well with small batch sizes.

We could try to do batch aggregation over multiple batches, accumulating gradients until we reach an equivalent batch size of say 16 before performing batch propagation. While this would work and would solve the batch normalization issues, it would still take an exorbitantly long time to train. As I mentioned earlier, there is a lot of empty space in our XView dataset images, areas like open water or fields/country side with very little complexity and variation. We really don’t need to look at all of this information and it’s really a waste of time to train on all of it, we need to be more selective about what we train on.

Solution 1 — Tiled sampling

Let’s stick to an input tile training size of 416 x 416 with a proposal grid resolution of 13 x 13. At each epoch we will sample one tile from each of our training images to ensure that each epoch gets some information from each scene. At prediction time we will simply scan across our image in 416x416 chunks and collect our predictions. For now let’s ignore the problem of being selective about how we choose our tiles and just say we sample randomly from the image. We’ll get to dealing with this problem later. Before this we must deal with another issue.

Issue 2 — YOLO and clusters of small objects:

YOLO is known to be not so good with clusters of small objects. This is because the density of proposals is inherently limited by the spatial resolution of the proposal grid. It’s best to see this via images.

13x13 proposal grid overlaid on an XView image

What you can see is that each grid cell can contain multiple cars. Imagine a case where each cell had 10 cars and that we have 9 anchors per cell. We will not be able to predict all the cars no matter what we do. Here we have 2 or 3 cars per cell and 9 anchors but let’s also remember that the purpose of having more anchors is not to account for a large number of objects in a cell but rather having objects of different sizes. Ideally we have one cell for each car.

Solution 2 — Smaller tiles

Let’s shrink our tile size down to 208x208. The resulting tile and proposal grid looks like:

Smaller tile size, now we have a denser grid that properly fits over the parking lot cars.

This looks much better, now each cell is responsible roughly for a single car. As far as the XView datasets the objects don’t get much smaller and denser than this.

Issue 3— Different object sizes:

So our input now fits clustered groups of small objects quite well. But what about larger objects. The diagonal object size range in the XView dataset ranges between 5 and 1100 pixels. We can see the problem below:

The big boat cannot be captured by the small tile size.

In the example above the appearance of the large boat simply cannot by a small tile size. What’s the solution? We’ll go with the easiest one!

Solution 3 — Multiple tile sizes, Multiple scales, Multiple networks

Let’s just have different tile sizes for different object sizes! We chose to split up the dataset into small, medium and large objects. A 208x208 tile size works well for small objects, for medium 416x416 and for large 832x832.

Large tile for the large boat, medium for the medium sized boat, small for the small boat.

You may note that that we will run into the issue of small batch sizes again with a larger tile size. Our solution is to scale the 832x832 extent into a 416x416 input, the assumption being that larger object detection will not suffer as much from the loss of information during downscaling. For consistency we scale the 208x208 tile up to 416x416 (in reality this was not necessary but it simplifies our setup slightly).

This means we have three input types at three different fixed scales (because of the up and down sampling). There’s no reason to try to make a single network learn all these scales at the same time if we know ahead of time what they are. Let’s simplify the problem and just split it up into three separate YOLO networks, we call this ensemble MultiYOLO. With MultiYOLO one network focuses on small objects via the upscaled 208x208 tiles, another on medium using 416x416 tiles and the third on large using the downscaled 832x832 tiles.

Some results…

Here is a sample of the output from the small, medium and large object networks. Note how each network appropriately focuses on a different object size range.

From left to right, the small, medium and large predictions. As we can see each network focuses on different object sizes.

However, there is a missing piece of the puzzle we have yet to explain here. We only told you the tile sizes that we are using to train but not how we actually sample the tiles from each of our XView images. This sampling plays a crucial role in training the MultiYOLO ensemble. We will discuss this in our in Part 2 of this post!

On the importance of proper data handling (part 1)

Handling XView satellite imagery with YOLO

Problem Statement:

Issue 1 — That’s a huge image!

Some results…

Written by Roger Fong