Using YOLO algorithms to conquer the global Data Science Challenge

Published in

Styria.AI Tech Blog

10 min readOct 17, 2018

During the last few years, deep learning has entered and helped numerous industries with delivering products of unprecedented quality. In the early phases of the deep learning revolution we mostly had relatively simple use cases such as the classification, but as the technology was gaining track, algorithms and use cases started to evolve rapidly. Nowadays, deep neural networks are employed to translate languages, personalize user content, optimize various processes, etc. Although there exist many applications throughout different industries, here we want to focus on the ones requiring object detection algorithms.

Object detection is a task of finding and classifying the objects of interest in a given image. Formally we say that the object detection is composed of two subtasks: object localization and classification, but an image is worth a thousand words when explaining concepts:

Object description task described in a single image.

Given an arbitrary image (in this case, an image of a penguin), the goal is to find the bounding box around the object(s) of interest (localization part) and to correctly classify them (classification part). The bounding box can be drawn with two pairs of information. The first pair is the coordinate pair of the bounding box center, while the bounding box width and height make the second pair. The final piece of the puzzle is the correct classification which follows the localization task.

Solving a problem such as an object detection without a proper motivation is pointless but fortunately, industries have already found ways to utilize object detection algorithms. Hot topic of today are autonomous cars, and one of the crucial components of them are the object detection algorithms. In order to reach the final destination safely and efficiently, the autonomous car has to precisely detect all the objects of interest in its path. Other vehicles, traffic signs, traffic lights, and pedestrians all present objects of interest that must be detected by an autonomous car, so it can react properly and in time.

Environment seen from a perspective of an autonomous car. Source: https://medium.com/@karol_majek/bdd100k-dataset-25e83e09ebf8

Object detection is a supervised machine learning task meaning that the algorithms require labeled data to train on. There are many public datasets which are used to benchmark object detection algorithms, one of them being the famous COCO (Common Objects in Context) dataset. COCO 2017 contains around 120k images and supports 80 classes, totaling more than 800k labeled bounding boxes. When compared to the size of the widely used classification dataset ImageNet (ILSVRC challenge has around 1M images and 1000 classes), COCO seems rather small, but there exists one object detection dataset that matches the magnitude of the ImageNet dataset.

Google Open Images Dataset and Challenge

A few years back, Google released the Open Images dataset which is a

… dataset of ~9 million images that have been annotated with image-level labels and object bounding boxes.
The training set of Version 4 contains 14.6M bounding boxes for 600 object classes on 1.74M images, making it the largest existing dataset with object location annotations. The boxes have been largely manually drawn by professional annotators to ensure accuracy and consistency. The images are very diverse and often contain complex scenes with several objects (8.4 per image on average).

Comparison of three famous object detection datasets — Pascal VOC, COCO, and the Open Images V4. Open Images version used in the Open Images challenge had in fact 500 classes instead of the 600 in the original dataset. 100 classes were removed due to their infrequency, or because they were very broad (e.g. “clothing”).

This year, Google has organized the Open Images Challenge which adopted the Open Images Dataset to create the biggest object detection competition so far hosted on Kaggle. The challenge actually consisted of two separate tracks: The Object Detection track, and the Visual Relationship Detection track, but this article focuses on the former.

Despite its size, the Open Images dataset is a “hard” dataset. To see why is it hard, we can look at some examples from the dataset:

Bounding box labeled as an “Aircraft” on an image that is actually a depiction of an aircraft.

The first example is a hydroplane sign which was labeled as an aircraft. The issue with this image is that it is a depiction of an aircraft, an artwork of some kind. Images like this one come from a different distribution than the “real” images, making the training of a deep neural network much harder.

An example of a grouped tree annotation.

Grouped labels are a novelty in the object detection domain and the idea is to make a single “grouped” bounding box label for objects that appear in groups. In the image above there is a forest and instead of creating bounding boxes for each individual tree, there is a single bounding box surrounding multiple trees labeled as a “Tree (Group)” class. It’s important to say that grouped labels do not make new classes, but each label has a “GroupOf” attribute which indicates if a bounding box is a grouped label or not.

The final issue is related to the missing labels. Many of the images were missing a great part of labels that were actually supported in the dataset. Image above shows three people but only a boy on the left has the bounding box around him. This also imposes a problem to the learning algorithm because the network will get confused with examples similar to this one, resulting in sub-optimal accuracy of the model.

Our team named StyriaAI competed among other 454 highly competitive teams, all striving to win the 30.000$ prize given to top three contestants. The evaluation server was up from July 3rd until the August 30th, which gave us close to two months of time to form our solution.

The architecture

In a short period of past few years, the object detection community has flourished and gave us many different models, each having specific drawbacks and advantages. Models can be roughly separated into two different categories: single stage, and two-stage models. While two-stage models have a separate region proposal and decision refinement processes, the single stage will group these processes into a single module, making the algorithm much faster but less precise on average. We opted for the so-called You Only Look Once (YOLO) model. YOLO (we used the 3rd version of the algorithm) is a single-stage object detection model that can make predictions in real time on a single GPU (~30FPS), making it lightweight for the training also. Training time was of an utmost importance in our experiments because the Open Images is a huge dataset and our hardware, along with the time for experimenting, was limited.

YOLOv3 architecture. Source: https://towardsdatascience.com/yolo-v3-object-detection-53fb7d3bfe6b

Instead of going for a library with an inbuilt object detection algorithms such as Tensorflow Object Detection API or Chainer CV, our team decided to implement the whole object detection pipeline from scratch in the Pytorch library. The reasoning behind this decision is that we wanted to learn every nook and cranny of the YOLO algorithm, enabling us to later adopt the algorithm to our own custom use case. Moreover, this project was part of a student internship and we wanted to give our students a real challenge to work on :)

The pipeline

The first problem we ran into was the great class imbalance in the dataset.

Class imbalance displayed on the graph. The logarithmic scale of occurrences on the Y-axis, classes on the X-axis.

Class Balancing

Class “Person” was the most frequent class with more than 800k occurrences while on the opposite side of the spectrum was the class “Pressure Cooker” totaling only 13 occurrences. Without balancing, the network failed to reach the optimal performance due to rarely seeing the images containing infrequent classes. To combat this issue, we oversampled the images containing infrequent classes, and undersampled the images with the more frequent classes. The result of this technique was that the model saw an equal number of images containing all supported classes, lifting the prediction performance of the less occurring classes. Balancing method described above can easily be implemented using the Pytorch’s WeightedRandomSampler.

Augmentations

The general rule is the more data you have, the better your deep network will perform. Although the Open Images is a large dataset, it doesn’t hurt if we can expand it with new synthetic examples, generated by various augmentation methods. We chose augmentation methods from a set of pretty standard ones: horizontal and vertical flipping, scaling, random cropping, and light and saturation shifts. All were used with a probability of applying p=0.5, except for the vertical flipping where the probability was set to 0.005 for the reason that not many images were taken upside-down.

CoordConv layer

An interesting feature we tried was the Coordinate Convolutional layer published in a recent paper by Uber.

A regular Convolutional layer compared to a Coordinate Convolutional layer.

The idea behind the CoordConv layer is that instead of only using the feature map, the X grid, and Y grid coordinates are concatenated to the original feature map as shown in the image above. The idea can be applied to any convolutional layer in the network but we only used the layer on the input image creating an input with five channels, three RGB, X grid channel, and Y grid channel.

Final performance boost was achieved by using the ensemble of multiple YOLOv3 models, all having different backbone architectures. Backbones were ResNet18, ResNet50, DenseNet161, and InceptionV3, pre-trained on the ImageNet dataset.

Achievements

After two months of hard work, lots of bugs, and bug fixes, our team ranked 66th out of 454 teams on the private leaderboard.

66th place out of 454. Not bad for a team of students who developed a small framework from scratch.

The 66th place may not look like something to brag about, but considering that our team implemented everything from scratch (in Pytorch), and that none of the team members has ever encountered a problem in the object detection domain, this becomes an amazing achievement of which our Data Science team is very proud.

Furthermore, our solution was selected among 20 other contestant teams to be presented at the poster session at the prestigious ECCV conference held this September in Munich.

Explaining our approach at the ECCV poster session.

Post-mortem analysis

Developing an object detection pipeline from scratch is a complicated process where numerous issues and bugs can often go undetected. After the competition was finished, many teams disclosed their solutions on the competition forum. Among the disclosed solutions, few of them used the same YOLOv3 algorithm with much greater success than our team, some of them ranking in the top 20 teams. The huge gap between our accuracy score of 0.22, and one solution that used the same YOLOv3 model, achieving accuracy of 0.43, indicated we had serious flaws in our codebase.

The first bug was pinpointed to the random crop function we implemented. Given a raw image, our function would randomly choose the new width and height of the cropped image, with the constraint of having the minimum width and height equal to 0.7 * raw_width/raw_height. The issue arises when some objects are partially cropped. As there are many objects per image on average, it very often happens that the partially cropped objects are left very small and not distinguishable even by the human.

Example of a randomly cropped image. The bottom bounding box is labeled as a flowerpot but it is very hard to conclude this because the object area left after the crop is small.

To remedy this issue, we simply remove all the partially cropped bounding boxes with width or height lesser than the 4% of width or height of the raw image.

In the submitted code, our images were all resized to the 465x465 resolution to match the network input size. The problem here lied in the resizing method, which squashed and warped the image to match the desired input resolution.

Apple bounding boxes get squashed to match the 465x465 resolution, resulting in even less recognizable objects.

While not having a significant effect on the bigger bounding boxes, squashing and warping can significantly affect smaller objects which become less distinguishable by the model. Letterbox resizing is an alternative to the default resizing method where the aspect ratio of width and height is preserved. The end result are the bounding boxes without distortions that could make them unrecognizable.

Letterbox resizing will preserve the width and height aspect ratio of the bounding boxes.

The final fix was related to our misinterpretation of the original YOLOv3 paper where it states:

YOLOv3 predicts an objectness score for each bounding box using logistic regression. This should be 1 if the bounding box prior overlaps a ground truth object by more than any other bounding box prior. If the bounding box prior is not the best but does overlap a ground truth object by more than some threshold we ignore the prediction, following [17]. We use the threshold of 0.5. Unlike [17] our system only assigns one bounding box prior for each ground truth object. If a bounding box prior is not assigned to a ground truth object it incurs no loss for coordinate or class predictions, only objectness.

Our interpretation resulted in removing all the labeled bounding boxes that had an IoU (Intersection over Union) with the best matching anchor smaller than 0.5 and we only considered the best overlapping anchors. The correct implementation always incurs loss for the best matching anchor, while all other anchors from the same grid cell having IoU > 0.5, will not be penalized in the loss function.

Among other features, we increased the model input resolution to 576x576, integrated the color shift augmentation, added more convolutional layers to our model, and fixed a bug in the non-maximum suppression function. In the end, we managed to increase our single model accuracy from 0.20 to 0.37 which is an increase of 85%.

After many bug fixes our single ResNet50 model achieved a mAP score of 36.73% on the private leaderboard.

Further improvements could be made with more complex models such as ResNet101 or Darknet53 used in the original paper, and with even larger input resolution. Many disclosed solutions split the dataset into sub-datasets according to its data distributions, training multiple models for each of the sub-datasets. For the best accuracy, one could simply switch to the two-stage object detection models as many top-performing contestants did, but we have chosen to use the one-stage model nevertheless. Two-stage models are still 3–4x slower than the one-stage methods while the performance gain is “only” in the range of 10–30%. As we may implement object detection algorithms in our future products, a perfect balance of the accuracy and speed of the algorithm plays a considerable role in delivering the best user experience for the client.