Improving performance of image classification models using pretraining and a combination of labeled and unlabeled data

Train image classification models using “semi-supervised transfer learning” to improve performance by leveraging unlabeled data at scale.

Published in

Decathlon Digital

10 min readMar 4, 2021

Image classification is a fundamental problem in computer vision. It refers to the process of organizing a collection of images into a known number of classes, and then assigning new images to one of these classes.

Conventional machine learning models for image classification rely on labeled data to train classification models. Labeled data is a collection of samples (i.e., images) which are manually assigned to a class or category among a number of classes. This is also referred to as the training dataset which is used to teach the model to predict the right class for the given sample(s). To know more, we refer interested readers to a series of articles (part 1, part 2 and part 3) on image classification. And also this article on building models which runs on TPUs using our DecaVision Python package.

While unlabeled data can be generally obtained with minimal human labor, labeling data is often laborious, costly and requires the efforts of experienced human annotators.

At Décathlon Canada, we are building image classification pipelines which can leverage both labeled and unlabeled data at scale. One of our main products, The Sports Vision API (SVAPI), consists of several endpoints based on image classification, object detection and image captioning which allows users to describe the content of a given product or sport image. More information is available in our documentation here.

Our approach

We consider using semi-supervised learning (SSL) which addresses the data labeling problem. The method leverages large volumes of unlabeled data together with a relatively small amount of labeled data to learn better classifiers [1,2]. SSL has had a resurgence in recent years, in large part thanks to its ability to improve the accuracy of the model on important benchmarks.

Out of the many SSL methods available, we found self-training or pseudo labeling methods [4,5,6] to best fit our use-cases. As a part of our pipeline, we also use EfficientNets [3] as base models for both feature extraction and fine-tuning.

Additionally, we also capitalize on existing models by using transfer learning. In particular, models that have already learned representative features from existing large-scale benchmark datasets such as ImageNet [7]. Using a pre-trained network with transfer learning is typically much faster and reduces computation overhead as it reduces the number of learnable parameters than training a network from scratch [8]. This strategy also achieves better performance for new tasks. More details on transfer learning can be found here.

We follow a known SSL method to leverage unlabeled data and adapt it with transfer learning for our use-cases. We term our approach as “semi-supervised transfer learning”.

Our pipeline consists of three stages:

Train a supervised classifier on the labeled dataset, called the teacher model.
Assign a class to each unlabeled sample using the teacher model (pseudo labeling) to construct a new pseudo labeled dataset.
Train a bigger (i.e., larger architecture) model with strong data augmentation on the combined labeled and pseudo labeled dataset, called the student model.

Illustration of Noisy Student Training (Image source: Amit Chaudhary)

Data loading and preprocessing

Prior to training the teacher and student models, the training and validation datasets are converted into the TFRecords format with the desired input size (i.e., 299 X 299 X 3) by resizing. This is a file format that is well optimized for TensorFlow and since it stores the objects in binary format, training a model goes faster, especially for large datasets. Since we are processing more than half a million images, this was a good fit for us. A study of benchmarking input data pipelines can be found here.

Data augmentation

We use data augmentation only for the training samples to improve generalization performance and add more variation. We use techniques such as random_flip_left_right, random_brightness, random_saturation and finally we use clip_by_value to ensure the images have values between 0 and 1. An example is shown below. Note that data augmentation is done only when training the student models.

Hyperparameter tuning and optimization

Due to the nature of deep learning algorithms, we train models in two stages. First, to find the optimal hyperparameters for our model, we perform a hyperparameter optimization process. This is done using the hypertuning feature in the DecaVision library, which is inspired by the scikit-optimize library. This function starts by training a model 10 times with random combinations of hyperparameters which are predefined in the search space (i.e, hidden size, learning rate, learning rate finetune, finetune etc.). It then uses what it learned from these random combinations to find 15 better ones. In total, a single model configuration goes through 25 iterations of hyperparameter search to find the best model possible.

Training with optimal hyperparameters

Secondly, once we find the optimal hyperparameters, we train a new model configuration using these optimal hyperparameters. This stage starts first by training an extra layer or layers (depending on the hyperparameter optimization) on top of the frozen pretrained model, and then fine tunes few blocks of the pretrained model by unfreezing them. In all our experiments while performing hyperparameter search, we found fine tuning to return better results.

Implementation details

All experiments are performed using the DecaVision library on Linux workstations with NVIDIA RTX 2080Ti and 3080 GPU cards. Pretrained ImageNet models are trained for our tasks using the Adam optimization algorithm to minimize the sparse binary or categorical cross entropy losses, depending on the task. A factor of 0.1 is used to reduce the learning rate once the loss stagnates. Training is continued until the validation loss stagnates using an early stopping mechanism.

Sport and content classification tasks

Using the setup discussed earlier, we are interested in building models to identify whether a given image displays people practicing sports. Also, identify the pose (content) in a picture, from a pool of 18 popular yoga poses including bridge, lotus, tree and more. More details about the yogapose classification task can be found here.

We pick these tasks as they are directly related to the tasks being served by the SVAPI endpoints, with the goal to improve them. More details about sport and content classification is available in the SVAPI documentation here (sport, content).

Labeled data details

For the tasks mentioned above, we name them as sportornot and yogapose-18 for easy reference.

The sportornot dataset consists of two classes with around 14K images being either of people practicing sport or images of people not practicing sport. The yogapose-18 dataset consists of 3K images of 18 different popular yoga poses including bridge, lotus, tree and more.

Sample images from the sportornot dataset

The labeled datasets are split into training, validation and test sets to keep similar data during the different steps of the process. The test set for the yoga dataset however consists only of images from Instagram, which are in general harder to classify and correspond more to our use case. A separate set is used as an additional test set (named test-add) only for the sport or not classification task. This is discussed more in later sections.

Sample images from the yogapose-18 dataset

Unlabeled data details

We use two sets of unlabeled data from different sources for sportornot and yogapose-18 which consist of over 500K and 120K images respectively. While the labeled data is sourced from Instagram and Google Images, the unlabeled data is sourced from Instagram (#decathlon for sportornot and various hashtags for yoga poses) and Getty Images (for yoga) using appropriate search keywords.

Having a team of 2 people to label and train more than half a million images would have required 3 months of work, which we were able to get a working model in less than 4 weeks using our approach.

Results

Sport or not classification

The table below summarizes the classification performance for the sporornot task, the teacher model is an EfficientNet-B3 architecture trained on the labeled dataset. Pseudo labels are then generated for the unlabeled 500k images. This is followed by training an EfficientNet-B5 architecture on a combination of the pseudo and labeled dataset with real time data augmentation as noise. Notice that the accuracy on the test set drops even after training on more than 500k images!

After a thorough inspection, we find that many images in the test set are very ambiguous and it is even hard for a human to say whether the image is sport or not. An example is a picture of someone standing in front of mountains.

Example image of a person standing with mountains in background (Image source: Emily Polar)

This image can be considered as sport as if the person went mountain-climbing/hiking, or non-sport as the person is just taking a picture with mountains in the background (Can you tell what it is?). This motivated us to build another unseen test set with only unambiguous images, test-add, in order to properly evaluate our models. On test-add, the student model outperforms the teacher by 2% improvement.

Content classification

In the table below, we summarize the results from the yogapose-18 dataset. In this setting, the teacher model is an EfficientNet-B5 architecture trained on the labeled dataset. Similar to the sportornot dataset, we generate 120k pseudo labels from the unlabeled data for yogapose-18 and train an EfficientNet-B7 architecture in a similar fashion. Here, we notice significant performance improvements compared to the sportornot dataset. It can be said that the student model is able to make better predictions on both the validation and test set samples for individual classes which results in an improvement by a large margin on the test set.

Classification performance on the yogapose-18 classification task

Adapting student model with transfer learning

From the table below, we can conclude that training student models using transfer learning not only achieves better accuracy, but also reduces training time and reduces computation overhead by a large margin by reducing the number of learnable parameters. We believe that this approach “semi-supervised transfer learning” is more practical in the industry in order to quickly develop models using unlabeled data.

Classification performance on the sport or not classification task with and without transfer learning. Upper arrow represents higher the better and vice versa.

Things we tried which didn’t work

Throughout this study, we also tried a bunch of “tricks” which did not work for the teacher or student models. We tried optimizing weighted loss functions by assigning class weights for each class with the goal to address class imbalance problems. This approach worked only for the teacher models but hurt performance when using it for student models. It can be said that data balancing works well for smaller models.

We also tried iterative training, where the student model becomes the new teacher model and generates even finer pseudo labels. This approach did not work in our setting which could be due to the size of unlabeled data. Iterative training was found useful in cases where the unlabeled samples are in millions.

Lastly, we also tried data augmentation while training the teacher models. This led to a drop in performance while generating the pseudo labels and training the student models. This is because while generating pseudo labels during inference, the model sees clean images. Whereas during training, it was more exposed to augmented images.

Conclusion

We find that SSL methods play a key role in building better image classification models which can leverage large amounts of unlabeled data than models that do not. These methods, along with unlabeled data, do in fact improve some of our models in the SVAPI by an acceptable margin. This coupled with transfer learning strategies help to achieve not only even better results, but also greatly reduce computation overhead.

In future, we intend to apply this method to SVAPI endpoint tasks such as identify the sport practiced in a picture, from more than 150 possibilities as well as implementing these methods for our object detection use-cases. Thanks for reading!

About the author

Hasib Zunair is a Masters student at Concordia University working on computer vision applications for low-resource medical image domains. He is also with Décathlon Canada via the MITACS Accelerate Fellowship as a research intern working on improving the existing computer vision models in the SVAPI using semi-supervised learning. See his related publication here.

We are hiring!

Are you interested in computer vision and the application of AI to improve sport accessibility? Luckily for you, we are hiring! Follow https://joinus.decathlon.ca/en/annonces to see the different exciting opportunities.

Let us know if you have any comment or suggestion about the topic of this article and don’t hesitate to share it with your network if you liked it :) If you have any idea to further improve performance, do reach out!

A special thanks to the members of the AI team at Décathlon Canada for the comments and review, in particular Yan Gobeil, Samuel Mercier and Heri Rokotomalala. Also thanks to Professor A. Ben Hamza from Concordia University for supervision and discussions throughout the project.

References

[1] Z. Zhu, “Semi-supervised learning literature survey,” Technical Report, 2005.

[2] O. Chapelle, A. Zien, and B. Scholkopf, editors. “Semi-supervised learning,” MIT Press, 2006

[3] M. Tan and Q.V. Le, “EfficientNet: Improving accuracy and efficiency through AutoML and model scaling,” In ICML, 2019.

[4] Q. Xie, M.-T. Luong, E. Hovy, and Q.V. Le, “Self-training with noisy student improves ImageNet classification,” arXiv:1911.04252, 2020.

[5] I.Z. Yalniz, H. Jégou, K. Chen, M. Paluri, and D. Mahajan, “Billion-scale semi-supervised learning for image classification,” arXiv:1905.00546, 2019.

[6] K. Sohn, D. Berthelot, C.-L. Li, Z. Zhang, N. Carlini, E.D. Cubuk, A. Kurakin, H. Zhang, and C. Raffel, “FixMatch: Simplifying semi-supervised learning with consistency and confidence,” arXiv:2001.07685, 2020.

[7] Deng, Jia, et al. “ImageNet: A large-scale hierarchical image database,” In CVPR, 2009.

[8] Zunair, H., et al. “Melanoma detection using adversarial training and deep transfer learning.” in Physics in Medicine & Biology, 2020.