YoloV3 on TPU

Published in

inspiringbrilliance

6 min readAug 28, 2020

Problem Statement

Given a set of images, detect the target logos (localise and classify). There are multiple ways to solve the problem of object detection and localization but for this particular walkthrough, we made use of YoloV3.

Dataset

We decided to evaluate the effectiveness of Yolov3 on logos and picked two random logos from datasets that are available for research purposes. For this, we used the Belgalogos¹, Web-2m² dataset to procure images and annotations. Web-2m images needed manual effort to annotate the images. The code that we used relied on Pascal VOC XML files for getting annotations of the images.

Training Data

We had roughly around 2.7k images (logo1, logo2 together) and we relied on augmentation to create 5 augmented instances per image in the training set. We also added around 900 negative sample images (Neither has logo1 nor logo2) to reduce the number of false positives.

For augmentation we used ImgAug and applied the following techniques randomly:

Horizontal flipping
Vertical flipping
Cropping
Padding
Rotation by 60 to 90 degrees
Hue
Color temperature
Brightness
Sharpening
Blurring
Embossing with Gaussian filter, median filter, etc.
Elastic transformation
Perspective transformation

Validation Data

For validation data, we got around 800 images from similar sources as well as videos (usually ads). We made sure that there wasn’t an imbalance between the number of logo1 and logo2 images (or the validation set becomes biased).

Results on GPU:

We used the dataset that we had prepared and used YoloV3-SPP as our model to detect Logo1 and Logo2. As seen we were able to do pretty well and get precision and recall above 85%.

Throughout the post, we will be concentrating on the results we got on the validation set.

Why TPU?

TPUs have been shown to give a 10x performance increase over high-end GPU while being 50% costlier than GPU. It felt like the next logical step was to try training YoloV3 on TPU and finding if we could achieve similar results while reducing the time taken and the cost.

Cost and Time Benefits:

By looking up the prices of standard GPUs like V100 and TPUs in GCP, it’s clear that TPUs are cost-effective and can also reduce the time it takes to train a model. The above information was recorded for a dataset of roughly 23k images.

Changes that were required:

Porting Keras changes to Tensorflow.keras: This was a minor change, mostly imports were changed and everything else almost remained unchanged.
Using Tensorflow dataset API instead of Keras generators: Google recommends using the TensorFlow dataset so that the TPU can be efficiently used. The dataset API also comes with options of prefetching data, parallelizing the network calls that are made to fetch the data which reduces the waiting time for the TPU.
Moving to TF records: Along with tf.Dataset, Google also recommends using TF records to have the fastest performance possible. In an experiment that Google conducted, they were able to demonstrate that using TF records gives at least a 2x speedup. We were able to use the Keras generators that we previously used to create the required TF records. Since the training data’s tensors were big (around 30gb), we had to shard the records before storing them on the GCP bucket.
Changing Model implementations to use static Tensors (shapes were fixed): These changes were mostly in the Keras model implementation where certain tensors were set in a dynamic way( the shape of the tensor is decided at runtime). This took a while to debug since we had to find the exact sizes of the tensors being used. TPUs can’t work with dynamically shaped tensors and need to know the exact size of the tensor before processing them.

Initial Results (LR : 8*10e-4, Epochs : 5+ 50)

Possible reasons for not being closer to GPU run:

The initial results were far off from the GPU run, so we had to debug and figure out why this was the case. After a little bit of research and looking up issues that were common with TPUs, this is what we came up with.

We used tf.keras adam instead of tf.train adam which does not work well on TPU. Generally, SGDs work better on TPUs³.
If your model uses batch normalization, a total batch size less than 256 (for example, less than 32 per core) might reduce accuracy⁴. [we’re using 64 as BS and every convolution block has got BN layer for Yolov3]
The ideal batch size when training on the TPU is 1024 (128 per TPU core) since this eliminates inefficiencies related to memory transfer and padding⁵.
The way warmup is implemented in Yolo might work differently in different environments. So one of the ideas was to reduce the number of warmup epochs and see how it affected the performance.
The learning rate has to be changed appropriately whenever the batch size is increased or decreased, but by what factor was still an uncertainty.

Results with Lesser Warmup epochs ( LR: 8*10e-4, Epochs: 2+53, Optimizer: tf.keras Adam):

Results with Lower learning rate(LR: 10e-4, Epochs: 2+53, Optimizer: tf.keras Adam)

Results with Tf Train Adam(LR: 8*10e-4, Epochs: 2+53, Optimizer: tf.train Adam):

Results with Variational SGD(LR: 10e-4, Epochs: 2+53, Optimizer: tfp VariationalSGD)

Future Work:

Even though we tried out both Adam and SGD there is a possibility for better convergence with different learning rates and also there is room for experimentation with different optimisers and LR schedulers.
In the case of Variational SGD (which implements burn-in), there are a lot of parameters which can be changed for more experimentation.
It is possible that the model might converge better on TPU if it is run for more epochs. Only by changing the parameters and experimenting can we confirm whether it’s possible or not.
Our implementation relied on tf.keras models and using .fit() to converge the model on the dataset. There are implementations of Yolov3 which use pure TensorFlow functions to represent the model and manual training loops which calculate the losses per iteration⁶.

Conclusion:

From the different experiments that we tried out, we were still not able to get closer to the GPU equivalent run for the same dataset. A batch size 256 or more is required for batch normalization to work properly on TPU, which requires us to use high-end instances that incur higher cost. So it makes more sense to use TPUs when the amount of data that we have is huge (> 1 million images).

[1]: Alexis Joly and Olivier Buisson, Logo retrieval with a contrario visual query expansion, In Proceedings of the Seventeen ACM International Conference on Multimedia, 2009.

[2]:https://openaccess.thecvf.com/content_ICCV_2017_workshops/papers/w5/Su_WebLogo-2M_Scalable_Logo_ICCV_2017_paper.pdf

[3]:https://medium.com/bigdatarepublic/cost-comparison-of-deep-learning-hardware-google-tpuv2-vs-nvidia-tesla-v100-3c63fe56c20f

[4]:https://cloud.google.com/tpu/docs/troubleshooting#multi-core-training

[5]:https://cloud.google.com/tpu/docs/troubleshooting#batch-too-small

[6]:https://github.com/nikhillovestech/Yolov3-TPU

[7]:https://cloud.google.com/tpu/pricing#pod-pricing

[8]: https://www.tensorflow.org/guide/tpu

[9]: https://www.tensorflow.org/tutorials/load_data/tfrecord