Solving an object detection problem with Google’s TPUs

5 min readOct 18, 2019

Recently I have participated in a Machine Learning competition on Kaggle. It was Google’s Open Images 2019 Challenge — Object Detection track. I was very limited in time: I only had four weeks, and training a single model takes about two weeks with 4 1080Ti GPUs. But I was lucky enough to have some TPU credits from Google. So I decided to give it a try. Here is the story.

What are TPUs?

TPUs are specialized chips, which are good at matrix multiplications. They are ideal for Deep Learning. On a single TPU, a model typically trains 10 to 100 times faster than on a single GPU.

Currently, there are TPU v2 and v3; v3 has twice as much memory and works faster (2x faster, according to specs; I observed 1.5x speed increase). I used configurations v2–8 and v2–8 (8 cores per chip, which is a minimum number).

TPU works as a dedicated computer in Google’s datacenter. It has no storage, so any data has to be streamed in and out from Google Cloud Storage. To use TPU, we are supposed to create a pair of VM and a virtual TPU device in the GCP Control Panel. You run your program on your VM, it connects to the TPU device and just sends commands to it. So, input data is read from Google Cloud Storage, logs are streamed back to the control VM and all results are saved into Google Cloud Storage (these include TensorBoard data and saved models).

API and limitations

Ok, TPUs are powerful, but everything comes with a price. Unfortunately, PyTorch implementation wasn’t fully ready, so I had to use TensorFlow. TensorFlow 2.0 was also not ready in terms of TPU support, so I stuck with good old TF 1.14.

Currently, TPU requires a completely static TF graph. This implies some limitations on the models you can use. For example, I could not use a random input image size for RetinaNet. Hard negative mining is not possible.

Input data is streamed via the network, so the data format must be efficient. Loose image files probably would result in network being a bottleneck. A better way to do it is to use TFRecord files. We also loose some freedom here: I don’t know how to implement balanced sampling with TFRecords. Actually, I have an idea: maybe we can achieve it with specially crafted TFRecord files, one per class (I didn’t try it).

Object detection models

I used this repo by Google: https://github.com/tensorflow/tpu/tree/master/. It implements ResNet and EfficientNet for image classification and RetinaNet (https://arxiv.org/abs/1708.02002) for object detection. Here is my fork of this repo with a few fixes and customization: https://github.com/artyompal/tpu_models.

This repo comes with very good tutorials, which I highly recommend if you want to try training on TPU:

https://cloud.google.com/tpu/docs/tutorials/resnet for image classification;
https://cloud.google.com/tpu/docs/tutorials/retinanet for object detection;
https://cloud.google.com/tpu/docs/tutorials/mask-rcnn for instance segmentation.

More about this particular RetinaNet implementation

Originally, the only supported backbone was ResNet50/101/152/200. I added EfficientNet support. I also added SE-ResNext backbone support, but it’s not final (it works too slow, maybe the image channel order is not right).

Also, this implementation supports a normal FPN and a better version of it called NAS-FPN: https://arxiv.org/abs/1904.07392. The latter works significantly better. As the name implies, this is a result of Neural Architecture Search for better FPN architecture.

Also, both RetinaNet and ResNet backbone support DropBlock: https://arxiv.org/abs/1810.12890. This is a better alternative to dropout in convolution networks, and it does improve the score, both for classification and object detection models.

Open Images dataset

It’s a huge dataset with 1.8M images, annotated with 12M bounding boxes. There are 500 classes. The labels are noisy and not always correct. The dataset has severe class imbalance: AFAIR, the most common class has 440K annotations, while the least frequent class only has 22. To deal with class imbalance, I split classes into 6 groups by frequency.

Also, dataset classes make a hierarchy:

I only used leaf classes for training, which was actually a mistake. The result could be better if a model had more data. The problem is, not all objects are labeled rigorously. For example, some people are labeled as Man, some are labeled as Woman, and some are just as Person (parent of both those classes).

My training process

So basically, I trained RetinaNet with all of those bells and whistles in some combinations. Training took 1–2 days on a single TPU. TPUv3 allows the batch size of 64 with 1024x1024 images (it has 16 Gb of HBM memory per core, there are at least 8 cores).

One more problem was, I only had pretrained weights for ResNet50. But I had a lot of TPU power, so I just downloaded ImageNet and made my own pretrained models with ResNet101/152/200. This actually worked great.

I added EfficientNet backbone support, but unfortunately, it didn’t work great. I trained some models with EfficientNet, but I met some weird export error in the very last night, so they weren’t part of the final solution.

Ensembling

Since it was for Kaggle, we must build an ensemble of powerful models to get the best score possible. One could use Non-Max Suppression to achieve this, but the better approach is Soft-NMS (https://arxiv.org/abs/1704.04503). The code looks like this: https://github.com/artyompal/tpu_models/blob/master/scripts/inference/soft_nms.pyx.

Final thoughts

TPUs are really powerful. Of course, we really need a stable PyTorch implementation :). I hope TPUs will make Deep Learning easier and partially stop Nvidia monopoly, which will hopefully make cloud computing more accessible.