Introducing DecaVision to train image classifiers with Google’s free TPUs

Train your own computer vision models in a fraction of the time with Decathlon’s python package

Published in

Decathlon Digital

10 min readDec 11, 2020

At Décathlon Canada, one of the main tasks of the AI team is to extract meaningful information from images. In order to do so, we have spent a fair amount of time on building a solid pipeline to train image classification models with python, using Tensorflow. My colleague Samuel already wrote a wonderful series of articles (see parts 1, 2 and 3) about the algorithms that are used in our pipeline. To summarize, we combine the power of data augmentation, transfer learning and hyperparameter optimization to get the best models possible.

To make this pipeline as easy to use and as accessible as possible, we deployed the whole codebase as a python package that we called DecaVision. We also wrote a clean documentation that describes all the details of the library, with links to the github repository and a colab notebook with examples. The objective of this article is to explain how easy it is to train a model from scratch using our package by detailing an example that we recently built: a yoga pose classifier. The trained model is deployed and is available for free on the sport vision API.

The idea behind the yoga pose classifier came from a request made by the communication leader of the yoga vision at Decathlon. She discovered our social listening tool (see this post for more info) and found it extremely useful to find relevant images of people practicing yoga to help catch her audience’s attention. Having the possibility of identifying the specific pose on the yoga images and, as a consequence, being able to identify the most popular poses, will create a lot of value for her.

Example of information about images in the social listening tool

Why choose DecaVision?

The main feature that sets DecaVision apart is the fact that it is optimized to work with Google colab’s free TPUs. Those are made available to democratize deep learning for people who don’t have access to heavy computing resources. They are however not easy to use at full capacity so DecaVision takes advantage of tfrecords and Tensorflow 2 to gain as much speed and performance as possible.

There are many ways to use the DecaVision package to train a model, but we describe the most efficient method, which relies on Google colab and Google Cloud Storage. You will obviously need a Google account to follow this protocole with your own data, but there is a free version of GCS if you don’t use too much storage.

Dataset preparation

Let’s dive directly into building our yoga classifier. The first and most important thing that we have to do is collect images of interest. There are many sources we can use to achieve that: Google images, Instagram, Kaggle, etc. This step is crucial, as there is a popular concept called GIGO (garbage in, garbage out). We suggest to go carefully through this step. Once we have at least 100 images per category, we organise them into the following folder structure on our computer:

image_dataset/
  train/
    category_1/
      image_1.jpg
      image_2.jpg
    category_2/
      image_3.jpg
      image_4.jpg

In our specific example, we are focusing on the following 18 categories: bridge, camel, chair, child, cobra, downward_dog, forward_bends, horizontal_handstand, lotus, plank, standing_forward_bend, tree, triangle, twists vertical_stand, warrior_1, warrior_2 and wheel. Our main sources are Google images and Instagram, with images of varying difficulty.

Examples of images for the warrior 1 pose. The one on the left, taken from Instagram, is clearly harder to identify than the one on the right, taken from Google images.

The next steps can be performed either in a local jupyter notebook or on colab, but we suggest doing it locally to have a fixed version of the dataset to experiment on instead of creating a new one for each experiment. We start by installing and importing DecaVision, ideally in a new virtual environment:

!pip install decavision
import decavision

Since so far we have only collected training data, we separate a part of our images to create a validation dataset. In our case we chose the fraction to be 10%.

decavision.utils.data_utils.split_train(path='image_dataset', split=0.1)

We now use the data augmentation functionality of DecaVision, which creates new images from the original ones by applying random transformations. Doing this step helps the model generalize better. It is important to perform data augmentation only on the training data. At testing time we do not apply data augmentation and simply evaluate our trained network on the unmodified testing data.

augmentor = decavision.dataset_preparation.data_augmentation.DataAugmentor(path='image_dataset/train', distortion=True, flip_horizontal=True, flip_vertical=True, random_Crop=True, random_erasing=True, rotate=True, resize=True, skew=True, shear=True, brightness=True)
augmentor.generate_images(250)

In this specific case, we use the following transformations: distortion, flip_horizontal, flip_vertical, random_crop, random_erasing, rotate,
resize, skew, shear and brightness. We ask the function to aim for around 250 images total per class, after augmentation. The last step to do on the local machine before jumping to colab is to zip the image_dataset folder.

Example of what data augmentation does to images. In this case, we clearly see that only a part of the image was kept using a random crop and that a part of the image was remove using a random erasing.

The only remaining step in the preparation of the data is to transform the images into tfrecords. This a special data format that was created to play well with Tensorflow and it is essential to access the maximal speed improvements that TPUs can provide. Converting the data to tfecords could be done locally as well, but we would have to upload the resulting files manually to Google Storage so we prefer doing it directly on colab.

In a colab notebook, we start by installing and importing the package as we did above. Be careful not to activate the TPU for the data preparation step since Tensorflow is not able to access local files with a TPU. Next we upload our zip file to the notebook’s local storage (we could also use the function decavision.utils.colab_utils.download_dataset to load the dataset from Google Drive) and unzip it using

!unzip image_dataset.zip

As we just mentionned tensorflow cannot access local files when working with a TPU. This is why we have to go through Google Storage to train. It’s at this step that we have to create a bucket to store the tfrecords. The advantage is that the data will be stored there forever and will not have to be loaded every time we restart the colab notebook. To let colab access the GCS bucket, we have to authenticate our Google account:

decavision.utils.colab_utils.authenticate_colab()

We then simply use DecaVision to create the training and validation tfrecords and upload them directly to our bucket:

generator = decavision.dataset_preparation.generate_tfrecords.TfrecordsGenerator()
generator.convert_image_folder(img_folder='image_dataset/train', output_folder='gs://myproject/image_dataset/train')
generator.convert_image_folder(img_folder='image_dataset/val', output_folder='gs://myproject/mage_dataset/val')

The previous content of the bucket will be deleted to create the new files. By default, the images are split into 16 different files, but we can change the number of shards accordingly to create files of around 100mb. The images are resized to (299, 299) by default, which is the correct size for most of the pretrained models. However we always need to make sure that we use the correct size for the model we want to try.

Training a model

Now that all of the data processing is done, good job! We don’t need to repeat it again. We simply authenticate our Google account when we open a notebook and start training with the data saved directly on Google Storage. For optimal results, we can now activate the TPU.

When training a deep neural network, there are many hyperparameters that need to be tuned in order to obtain the best model, for example the learning rate or the number of epochs. DecaVision does all this testing for us with a single function:

classifier = decavision.model_training.tfrecords_image_classifier.ImageClassifier(tfrecords_folder='gs://myproject/image_dataset', batch_size=256, transfer_model='B5')
classifier.hyperparameter_optimization(num_iterations=25, n_random_starts=10)

This function starts by training a model 10 times with random combinations of hyperparameters. It then uses what it learned from these random combinations to find 15 better ones (hopefully). All these tests are printed with the results so we can choose our favorite combination at the end. Each step of training consists in training an extra layer added on top of a frozen pretrained model and then unfreezing the last block of that model to finetune it. To learn more about this technique, called transfer learning, refer to the article series mentionned in the introduction (in particular part 1)!

The only options to choose when doing hyperparameter optimization with DecaVision are the batch size and the pretrained model. In this case, we are taking EfficientNet B5, but many others are available, like ResNet, Xception and EfficientNet B3.

Once the hyperparameters that generate the model with the best validation accuracy are found, we need to train the final model. The hyperparameters that are dictated by the optimization are the number of epochs, a distinct learning rate for each of the two steps of training, the dropout rate, the number of units in the extra layer and whether of not to do the finetuning step.

classifier = decavision.model_training.tfrecords_image_classifier.ImageClassifier(tfrecords_folder='gs://myproject/image_dataset', batch_size=256, transfer_model='B5')
classifier.fit(hyperparameters)

Here we do not show the hyperparameters that were found for the yoga poses classifier. The final step in the training portion is to save the model. This is as easy as using the following method:

classifier.model.save('model.h5')

If the extension .h5 is not specified, the model will be exported to a SavedModel format. This format can however only be used on a TPU if we save the model to GCS.

TPU vs GPU vs CPU

Before moving along, we would like to report the results of a small experiment that we did to make clear the fact that using colab’s TPUs is worth all the trouble. We trained a model with the same hyperparameters (except batch size, which has to be higher for TPUs and lower for GPUs for performance) on our yoga dataset (around 11k images) and tried the different processing units available on colab pro. Here are the results:

The TPU available was the v2 and we averaged around 40 seconds per epoch on both steps of the training. The GPU that we tried was the Tesla v100 and we averaged only 55 seconds per epoch for the first step of training and 70 seconds for the fine tuning step. Finally, just for fun, the first epoch took more than 30 minutes on the available CPU…

Evaluating the model

Once a satisfactory model has been saved, it is important to use it a little before declaring victory. We can use DecaVision to evaluate the model with different tools. The simplest test to do is evaluating the accuracy on different datasets where the images are separated in a folder per category:

tester = decavision.model_testing.testing.ModelTester('model.h5')
tester.evaluate(path='image_dataset/val')

The results that we got were an accuracy of 96% on the training set, 95% on the validation set and 83% on the test set. The model performs much worse on the test set because it was made only from Instagram images, which are harder than the Google images that make up a part of the other two datasets.

To see explicitely how our yoga classifier performs on new images that it has not seen we can just place them in a folder and use the following function:

import os
categories = os.listdir('image_dataset/train')tester.classify_images(image_path='image_dataset/test', categories=categories)

This feature requires an ordered list of the possible categories, which we infer here from the training dataset itself. Here are a few examples of results for the yoga poses classification model:

Examples of predictions from our best model.

In order to determine if the model makes more mistakes repeatedly on certain categories, we can plot the confusion matrix:

tester.confusion_matrix(path='image_dataset/val')

Confusion matrix evaluated on the validation set.

For the yoga classifier, we can see that the model sometimes confuses bridge with wheel and forward bend with child. As a human, this is to be expected since the poses look pretty similar. If we found something unexpected in the confusion matrix, it would be an indication to take a second look at the dataset. In any case, we can know which classes to focus on when trying to improve the model.

A final method of assessing the reliability of the model is by directly looking at images that it makes mistakes on:

tester.plot_errors(path='image_dataset/val', num_pictures=9)

Once we are sure that the model is satisfactory, the final step is to deploy it. We will save the story of how we did this for another time. Do not hesitate to try out the model for yourself!

The actual process

The steps explained above seem simple and straightforward but of course reality is not like that. The process of building a satisfactory image classification model is iterative, which means that we have to repeat the same steps many times.

In the case of classifying yoga poses, we started with only 10 easy poses and no data augmentation to assess the difficulty of the task. We ended up with a very good model surprisingly fast so we decided to add poses to reach 18. We were not satisfied with the model that we got from that data, even after a lot of hyperparameter tuning, so we decided to use data augmetation. The final model that we obtained was satisfactory enough to deploy online, but we can still see a few categories that need improvement just by looking at the confusion matrix. We will work on improving the model in the future.

We are hiring!

Are you interested in computer vision and the application of AI to improve sport accessibility? Luckily for you, we are hiring! Follow https://developers.decathlon.com/careers to see the different exciting opportunities.

Let us know if you have any comment or suggestion about the topic of this article and don’t hesitate to share it with your network if you liked it :) If you have any idea to improve the package, don’t hesitate to take a look at the github repository to see how you can contribute to the project.

A special thanks to the members of the AI team at Décathlon Canada for the comments and review, in particular Samuel Mercier, Heri Rokotomalala and René Lanciné Doumbouya.