Subsets of ImageNet
ILSVRC 2012, commonly known as ImageNet, is a large image dataset for image classification. It contains 1000 classes, 1.28 million training images, and 50 thousand validation images. You can find more information about the dataset at http://image-net.org or at https://www.tensorflow.org/datasets/catalog/imagenet2012.
Why take a subset of ImageNet?
While the dataset is invaluable to the research community, its size prevents common researchers from conducting experiments on it. It requires more than 150GB of storage, and training a resnet50 on it will take around 215 hours using a T4 GPU on Google Colab, not to mention that Colab limits each session to 12 hours.
Taking a subset of ImageNet helps reduce space and time complexity. One can first test some ideas (ablation study) on the subset; if the ideas worked well on the subset, bring them to the original ImageNet and hope that they still work well. Note that the optimal hyper-parameters on the subset are usually different from those on ImageNet.
A good subset can not only benefit researchers on a low budget, but also researchers on a big budget. Google’s EfficientNet is found using Neural Architecture Search by training a bunch of small models on ImageNet using 5 epochs. After finding a good small model, they use “compound scaling” to scale the model up. Similarly, Facebook’s RegNet is found by manually finding a good architectural design space by training small models using 10 epochs, and scaling up by adhering to the design space. Training 5 epochs on ImageNet might still be too time-consuming. If we can find a subset that can also differentiate good models from bad models, we can reduce the training time.
How do we evaluate the quality of the subset?
Take a list of existing models, train them for 100 epochs on ImageNet while keeping track of their validation accuracies on each epoch, and train them for 50 epochs on the subset. Plot the model validation accuracies as a function of the number of epochs. If the final performance of model A on ImageNet is higher than that of model B, the same should be clear on the subset. Moreover, find the earliest epoch, on both sets, where the rank of the models on that epoch is clear enough to imply the rank on the final epoch. Then, the number of images in the training set times the earliest epoch gives the total number of images looped through. If the total number of images needed to loop through is smaller on the subset, the subset has value since it reduces both training time and training size. On subsets with fewer training images, the model needs to be trained for more epochs and the hyper-parameters need to be adjusted. Care needs to be taken when coming up with the list of existing models. The list should be strict enough to evaluate the quality of the subset.
What are some ways of taking subsets of ImageNet?
In order for the ideas to transfer well from the subset to the full set, the validation set size has to be big enough so that the accuracy difference is not just due to random chance. A validation size of 10K or larger seems reasonable. Also, the training image size on the subset has to be the same as that on ImageNet, which is usually 224x224. A model that works well on 32x32 probably does not work so well on 224x224.
Suppose we want a subset with N classes and M training images in each class.
First, we discuss some methods to find N classes. The 1000 classes on ImageNet are fine classes, and a class that contains several fine classes is a coarse class.
- Randomly sample N classes from the 1000 classes.
- Randomly group the 1000 classes into N coarse classes, each corresponding to around 1000/N fine classes.
- Since the classes of ImageNet are the leaf nodes in the WordNet tree, merge leaf nodes that have a common parent. Repeat the merging many times until you are left with N coarse classes.
- Group easily confused classes together to form N coarse classes. Build a 1000x1000 confusion matrix by running the validation set on a pre-trained model. Define the confusion rate of a set of classes to be the sum of the probability of confusion of class A and class B for all pairs A and B in the set. Find a subset of 1000/N classes that has the maximum confusion rate of any subset with 1000/N classes. Remove the subset. Repeat the above N times to form N coarse classes.
- Use the coarse labels here https://github.com/noameshed/novelty-detection/blob/master/imagenet_categories.csv and make adjustments to form N coarse classes.
- Find N coarse classes yourself by looking at the names of the 1000 classes, https://github.com/anishathalye/imagenet-simple-labels
In all the methods except method 1, the number of images hasn’t changed, but the fine labels are grouped into coarse labels.
After finding the N classes, we can think of ways to sample M training images from each of the N classes.
- Randomly sample M images from the available images for that class.
- Randomly sample half of the images from each class and put them in split A. Put the rest of the images in all classes into split B. Train on split A and evaluate on split B, noting down the probability given to the correct label for each image in B. Then train on split B and evaluate on split A. The probability given to the correct label shows how easy the image is. One can then use this information to impose a distribution of difficulty. For example, I may not want too many images with the same difficulty. Clip the probabilities to the nearest 1%. For each class, find a threshold such that the number of images for each probability (difficulty) is capped at the threshold, and such that the sum of the number of images is M. Instead of doing this at the class level, you can do this at the set level, except that, by doing so, the number of images in each class will vary.
A few configurations that I think will work quite well are:
N=200, M=500, Total=100K
N=100, M=1000, Total=100K
N=20, M=5000, Total=100K
N=20, M=2500, Total=50K
N=10, M=5000, Total=50K. This serves as a direct replacement of Cifar10.
N=100, M=500, Total=50K. This serves as a direct replacement of CIfar100.
Try out all the methods and configurations to see which one works best.
Object Detection and Semantic Segmentation
If you intend to find a subset of ImageNet that transfers not just to ImageNet, but to object detection and semantic segmentation tasks like Pascal Voc, COCO, and Cityscapes, a few things need to be noted.
More recently, researchers realized that a model can achieve SOTA performance on COCO and Cityscapes without pretraining from ImageNet. And some researchers have tried to find good models on COCO and Cityscapes without the help of ImageNet. However, most models used in object detection and semantic segmentation are still selected because of their performance on ImageNet, so a subset of ImageNet that transfers relative performance to object detection is still important. For example, models like Wide-Resnet and EfficientNet are found using ImageNet but are popular in object detection.
The training images are commonly resized to 513x513 or even larger because high resolution increases the granularity of the localization and because some operators such as DeepLab’s ASPP do not work well on small resolutions. Thus, our training image size should be 513x513.
Although ImageNet has 1000 classes, COCO and Cityscapes have fewer than 100 classes. Models that are found on ImageNet might have too many channels in the convolution layers. It is common for a model to have 2048 channels in the last convolutional layer before average pooling. It kind of makes sense if you have a fully connected layer from 2048 to 1000, but it does not quite make sense to have 2048 channels before rapidly reducing it down to 19 channels (Cityscapes has 19 classes). Thus, models that work well on a subset with 100 classes might work even better than the models that work well on ImageNet.
The advice in this section comes from my experience working with Google’s semantic segmentation model DeepLab. Take a look at my PyTorch implementation if interested.
This is my first ever blog post!!
I look forward to bringing more stories in the future.
Thank you.