Automatic image annotation using Active learning of weather conditions

Published in

Analytics Vidhya

4 min readMar 23, 2021

--

Implement multi-Label weather classification model : Day, night, or Twilight for visual conditions and clear,rainy, or snowy for weather conditions.
Implementing ”loss prediction module” for selection of data to be annotated and trained.
Large part of the dataset can be automatically annotated.

Overview:

Data imbalance

All real time data contains imbalance in data. the model should perform well across all labels. Below is an example of the Dataset used.

The distribution of classes shows an imbalance in data. Selection of trainable data for each class is important.

Solution:

Under-sampling of Data: An imbalanced class distribution will have one or more classes with few examples (the minority classes) and one or more classes with many examples (the majority classes). It consists of reducing the data by eliminating examples belonging to the majority class with the objective of equalizing the number of examples of each class. This reduces the skew from a 1:100 to a 1:10 or 1:2 class distribution.
Class weight: One of the method to address imbalance in data is providing weights to each label, which provides more importance to minority labels such that the end result is a balance of all the labels present. Class weights using median frequency balancing is computed as w_class = median_freq / freq_class
Data-augmentation: Data augmentation is a technique to increase the diversity of dataset without an effort to collect any more real data. It prevents the model from overfitting.

Transfer learning

Initialization of weights from other similar dataset.

Loss prediction module

A loss prediction module attached to a target model predicts the loss value from the input without its label.
All data points in an unlabeled pool are evaluated by the loss prediction module. The data points with the top-K predicted losses are annotated and added to a training dataset.

Active learning

Problem:

What data should be labelled and trained?
Estimate the performance of the model?

Solution:

Network identifies the images difficult to predict
Scans unique images from whole database
Identifies Best case and Worst case performance
Approx Error estimation without ground truth

High loss = High error = High likelihood of prediction being wrong
Low loss = Low error = High likelihood of prediction being right
Top k accuracy will always have least accuracy (worst case)
Bottom k accuracy will always be highest accuracy (best case)
Choose top k images to manually label and retrain

Automatic annotation

Labeled and unlabeled data can be decided through thresholding the loss itself
Very small of amount of very high error data is manually labelled for retraining.
Majority of low error data can be auto labelled high high accuracy.
Rest is unlabeled and is expected to be labelled in further cycles of active learning using very few new samples.

Detailed process

Data analysis:

Understand the distribution and patterns in the Dataset.
Find corrupted images / labels.
Look for data imbalances and bias.
Visualize the outliers and border conditions.

2. Set up the end-to-end training/evaluation skeleton

human baseline
get model’s baselines with fixed parameters.
overfit the model to check the capacity.
Tune the hyperparameters

Steps to use Active learning on an unlabelled dataset:

The first step is to label manually a very small sub-sample of the data. It provides the essential difference between the labels.
Once there is a small amount of labeled data, the model needs to be trained on it. The model will not be great but will help the model get some insight into which areas of the parameter space needs to be labeled further.
Use Transfer learning: It was observed that in the early stages of training, due to significantly less amount of data, the number of layers to be trained should be a few and freeze other layers with pre-trained weights. The number of layers of trained layers is increased as the trainable data increases.
After the model is trained, the model is used to predict the class on the remaining unlabelled data point. A score is chosen on each unlabelled data point based on the prediction of the model. In our case, the Loss prediction module is used.
The data points with high predicted losses are manually labeled and used for retraining.
This process can be iterated: a new model can be trained on a new labeled dataset, which has been labeled based on the priority score. Once the new model has been trained on the subset of data, the unlabelled data points can be run through the model to update the prioritization scores to continue labeling. In this way, one can keep optimizing the labeling strategy as the models become better and better.

check out: http://karpathy.github.io/2019/04/25/recipe/