On the importance of proper data handling (part 2)

Efficient Sampling with a Quadtree

Published in

Picterra

6 min readDec 20, 2018

In Part 1 of this post we ended with the issue of how we can actually sample from our XView images. If you haven’t read that post yet I highly suggest that you do! To recap we are now at the point where we need to sample tiles from the XView dataset and we want to focus on areas where there are objects of interest in an efficient manner. Continuing in the same manner as before…

Problem 4: How do we pick our tiles (efficiently)?

Naively we we can just sample a tile randomly from each XView image. However, this does not work. A major problem that we ran into with when training our models was with the spatial sparsity of the data. The issue is easiest to see with the “small object” MultiYOLO model. Let’s randomly sample some small tiles from part of an XView image.

Just by sampling randomly we see mostly empty patches with no objects. A large number of the XView images look like this. The result? The model learns to just predict nothing at all as the best solution. This is not ideal, we need to focus more on the tiles with objects. In addition, while this is not as big of an issue with larger tile sizes it still takes away from the amount of valuable data seen during training. The obvious solution is to keep resampling until we hit a patch that contains at least one annotation of interest. This would work except that some images are so sparse that we could be sampling for very long time before we hit anything. In practice doing this made training the small network take an order of magnitude longer, thus not feasible.

Creating a model that trains quickly is just as, if not more, important than finding a high performance architecture. Indeed in many deep learning applications the performance bottleneck is not the potential of architecture to solve the problem but the time it takes to train and iterate on said architecture.

Solution 4: Quadtrees!

So what can we do better than random? Quadtree random! Before we get into it let’s provide a bit more background. Quadtrees are an acceleration data structure that will greatly speed up our sampling time.

So what is a quadtree? Well most of us computery people know what a binary tree is. A quadtree is just the same concept as a binary tree applied in 2D (in our case over image space).

On the left, a binary tree. On the right a quad tree applied to an image. The root is the whole image. The 4 children on the 4 quadrants of the root image. Then each child is further subdivided into another 4 children, etc.

So what values do we store at each of the nodes/levels over our tree? We store a probability of sampling in the box corresponding to the node based simply on the number of objects within that box. It is the percentage of objects out of the total number of objects in the parent box. Thus we can sample by selecting a box to sample from at each level of the quadtree based on these probabilities. Once we get to a predetermined depth of the tree we can take the chosen box (at this point a much smaller area) and randomly sample a tile centered somewhere in that box with a much higher chance of getting an object of interest. Here’s an illustration of the sampling process:

On the left the calculate that the top right box has half the objects in the entire scene, we have a 50% probability of sampling within that box. Let’s say this is the box we end up sampling from. If we divide this chosen box up into 4 quadrants again and count calculate the probabilities using just the objects within that box, we see that the smaller top right region again has a 50% probability of being sampled from. Within that much smaller box we can then sample a tile randomly with a high chance of including an object of interest.

The other thing we haven’t discussed is that in order to properly train each of our three networks on the small, medium and large objects we actually have to have three separate quadtrees per image, each one being calculated on a different subset of object sizes. The small object quadtree is computed using only the smallest third of the objects, the medium object quadtree is computed using only the middle third, and the large object quadtree is computed using the largest third. Note that these quadtree probabilities can be pre-computed over each image for each object size group (small, medium, large) to save time during the training process.

Problem 5: Class imbalance

As discussed earlier class imbalance is a major issue with the XView dataset, where instance count per class ranges anywhere between 17 and ~300000. This means that the task of classification will be very difficult since it’s harder to learn about the rarer, less seen classes.

Solution 5: Weighted quadtree sampling (and focal loss)

There are typically two ways to deal with class imbalance: sampling the rare classes more often and adjusting the loss function to learn more from said classes.

We can do the former very easily given the quadtree that we just built. Before we were counting each object as just a single object when calculating our probabilities. Instead how about we count rarer classes as more objects (or a more “valuable” object). The result is that tiles with rarer objects will end up being assigned higher sampling probabilities and we will more likely train on tiles that contain these rarer objects. We can see this process below.

The same setup as before but now a rare object counts as 9 objects, everything else still counts as 1. The probability of the upper right quadrant on the left goes from 0.5 up to 0.65. We also see that within that quadrant now the lower right sub-quadrant has a higher probability of being sampled than the upper right sub-quadrant.

For adjusting the loss, we simply use Focal Loss as described in the RetinaNet paper. Focal loss is simply a variant of cross entropy loss where the loss is scaled for each sample individually based on it’s difficulty of being properly classified. This differs from previous approaches that scale the loss for the entire class. Focal Loss indirectly addresses the rarer classes since it is typically the samples from these rarer classes that will be the hardest to classify. However it is much more flexible since if there is a rare but also very easy to distinguish class it doesn’t focus too much on that class.

To summarize…

So finally, we have the following:

Three YOLO v2s (MultiYOLO)
Each trained on a different tile extent and scale
Each using quadtree sampling to accelerate tile selection
Where the quadtree probabilities are weighted using class frequencies
Using focal loss instead of cross entropy loss for the classification task

Note that we have no affine data augmentation (translation, rotation, scale), which is pretty nice. Here’s why:

Translation: Already handled by the fact that we randomly sample tiles from each XView image. This is the same as just randomly shifting the tile.
Scale: We don’t need to and don’t want to augment scale. We know exactly what the scale of our images is, so it is not something our network needs to know and learn about. This makes our problem easier.
Rotation: Satellite imagery is top down. This advantage is two fold. Firstly it means that there are a very limited number of rotational representations that our network has to learn about (limited to rotations around the vertical axis). In natural imagery this is not the case as networks need to learn about all possible rotations of an object over any axis. The second advantage is that since there are fewer rotations possible, we probably have already seen all possible rotations of an object in our XView dataset and thus do not need to provide any more rotation augmentation. Indeed we tested this and found that it did not improve our score.