GumGum Tech Blog
Published in

GumGum Tech Blog

Domain Apposite Pre-processing to Improve Classification Performance

The closer you look, the less you see? (Taken from Now You See Me) Image credit: olly/Adobe Stock.

With the advent of Deep Learning, we have some of the most sophisticated image classifiers available to us today, that offer performance which seemed improbable just a decade ago. The progress on the image classification task has been so tremendous, that the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), arguably the flagship image recognition competition, was ended as it was considered to be solved, with 29 of the 38 participating teams having greater than 95% accuracy on the task in 2017. At GumGum, we develop deep learning models for image classification as part of many of our contextual intelligence and brand safety solutions. One of the products offered on the Verity platform is an intelligent Contextual Recognition Pipeline (CRL) that allows advertisers to target specific themes and scenes of interest in long form content such as movies and TV shows.

A majority of the ILSVRC winning architectures and models developed thereafter, follow the convention of feeding an input image of constant size, with a 1:1 or square shaped aspect ratio as input to the model. This is because most image classification architectures have had fully connected layers in the final stages that require an input of fixed size. The popular ones, including AlexNet [1], Inception [2], VGG [3], ResNet [4] and EfficientNet [5] all have inputs that have spatial dimensions between [224, 480] and an aspect ratio of 1:1. This was most likely adopted for simplicity. As we want our classification model to work on a myriad of aspect ratios, both wide (width > height) and tall (height > width), using an aspect ratio of 1:1 introduces the least amount of error during resizing and works as a good compromise. Following these conventions, however, may not always be ideal. One of the tasks we try to solve in our CRL is scene categorization for long form content, such as movies and TV shows. Data from this domain almost always tends to have wider aspect ratios (width > height). In this piece, we will discuss reasons why these widely adopted rescaling conventions may not give the best results for our scene categorization task, especially during inference, and how we develop models that operate on wider content to still do well on scene categorization.

In scene categorization, we try to recognize the context or location of where the frame was shot. This helps us identify popular themes and interesting events in long form content for highly relevant contextual targeting. To investigate if the modifications actually help us, we will conduct experiments using a small five class subset of the scene categorization dataset and leverage transfer learning to develop an EfficientNet-B0 model [5] (pre-trained on ImageNet). The architecture is simple: we will use the convolutional layers of the pre-trained EfficientNet [5] as a frozen feature extractor backbone, followed by a series of fully connected dense layers to get our final classification predictions. We will compare image classification performance of using standard rescaling that has been used by most models while training on ImageNet, versus something that could be more suited for our use case. Let’s dive right in.

The Norm

Most image classification models follow the convention of feeding an input image of constant size, with a 1:1 or square shaped aspect ratio as input to the model. Every model training scheme has image resizing and cropping steps as part of the input image pre-processing pipeline as images are rarely of the exact desired dimensions. The need to conform to a constant size and aspect ratio for input has been due to the presence of fully connected layers at the head of the architecture that are used for obtaining final class predictions. Feeding a squared input with edges between [224, 480] to a deep learning classification model was probably adopted for simplicity and efficiency, and also introduced the least amount of error when resizing and worked as a good compromise. Taking a center crop proves to be helpful because the subject of an image is usually located close to the center. Convolutional layers however, are capable of handling an input with arbitrary spatial dimensions (excluding channel count), as they are spatially invariant and operate in a sliding window fashion. There have been some fully convolutional models developed for image classification as well, that do not have any spatial dimensional requirements. We can also make use of the adaptive average pooling layers between the convolutional and fully connected layers to ensure we pass a constant sized input to the dense layers. Although this helps avoid breaking things in the network, it may have a negative impact performance when used with pre-trained weights and would probably prove to be helpful for a model trained from scratch [6].

Figure 1. Some sample crop windows used in the EfficientNet [5] PyTorch implementation to prepare input for classification shown on a 2.39:1 frame. Image credit: Palm Springs on HULU.
Figure 2. More sample crop windows used in the EfficientNet [5] PyTorch implementation to prepare input for classification shown on a 1.77:1 (HDTV) frame. Image credit: Silicon Valley on HBO.

All popular past winners of the ImageNet challenge such as AlexNet [1], Inception [2], VGG [3] and ResNet [4], use some form of isotropic rescaling, followed by a crop with 1:1 aspect ratio that then becomes input to the neural network. For AlexNet [1], the image is first rescaled such that the shorter side is of length 256, and then the central 256×256 patch is cropped out to get the input image. Multiple 224x224 sized crops are then fed to the network for training. VGG [3] training methodology also defines a training scale variable, somewhat related to the shortest side of the image, that is used to bring the image to the required dimensions. Finally, ResNet [4] pre-processing rescales the input image to have the shorter side between [256, 480], after which a random 224x224 crop is taken to get the final input image.

For the purposes of this discussion, we are using an EfficientNet-B0 model [5] that has been pre-trained on ImageNet. We will be using the open source PyTorch implementation with pre-trained weights provided by NVIDIA in TorchHub. This implementation uses the following rescaling and cropping methodology: For an image with arbitrary width and height, first, a random crop is taken such that the cropped patch has an area (width x height) between [0.08, 1.00] and aspect ratio (width / height) between [0.75, 1.33]. Some examples of this scheme on movie and TV show samples are shown in Figures 1 and 2. This patch is then resized to a 224x224 image for the B0 model configuration. During inference, the image is isotropically rescaled such that the shorter side is 224, followed by a 1:1 center crop that yields a 224x224 sized input image. This strategy works well if we want the model to work on a large variety of images. However, it falls short when it comes to samples from our dataset for scene categorization. This is illustrated in Figures 3 and 4. As we can see, even the largest possible crop does not adequately capture the information contained in the image.

Figure 3. Example showing the portion of image used for inference in the EfficientNet [5] PyTorch implementation shown for a 1.77:1 (HDTV) frame. Only the area in the red window is used to make classification predictions. Image credit: Friends on Warner Bros.
Figure 4. Example showing the portion of image used for inference in the EfficientNet [5] PyTorch implementation shown for a 2.39:1 anamorphic widescreen format frame. Only the area in the red window is used to make classification predictions. Image credit: Palm Springs on HULU.

In scene categorization for movies and TV shows, we try to recognize the context or location of where a frame was shot. This helps us identify popular themes and interesting events in long form content for highly relevant contextual targeting. Visual content from these sources usually tends to have a wider aspect ratio. Some of the most popular aspect ratios used to view such content are 1.33:1, 1.77:1, 1.85:1 and 2.39:1. And therefore we see videos with aspect ratios ranging from 1.33 to 2.39. Figure 5 shows a rough distribution of aspect ratios present in our test set for scene categorization. If we choose to apply the same pre-processing strategy used for EfficientNet-B0 [5] training on ImageNet, the best possible way to obtain an input image is to first resize the shortest side (in this case, the height) to 224, followed by taking a center crop to get the input image for feeding to the network. This however, results in missing out on a significant chunk of information present on either sides of the center crop, and is also crucial when it comes to recognizing scenes or context, as it contains many cues related to the surrounding environment or location. For instance, the best possible crop we can use for inference in Figure 3 is shown by the red frame, which results in only using 56.25% of the image information for making predictions for a 16:9 input. The issue becomes even worse when used for a 2.39:1 input as shown in Figure 4, where we only use 41.84% of the available information.

Figure 5. Distribution of aspect ratios (AR) present in the Scene Categorization dataset.

The Novelty

By looking at the analysis of the open source EfficientNet [5] pre-processing strategy, it is clear that there is a need to improve our pre-processing pipeline that is used for long form content data. An exploratory data analysis of our dataset for scene categorization reveals that a majority of the content lies on the wider end of the spectrum. This is shown in Figure 5. As we can see from the distribution, we see a large peak at 1.77, which corresponds to the HDTV or 16:9 aspect ratio, and a few more candidates around the ultra-wide values of 1.85 and 2.39. Using knowledge of the problem and cinematography, we can simplify this even further, as we can be sure that our data will always (mostly) conform to one of these popular spatial proportions used to shoot content for film and television.

Taking all this into consideration, we can make changes to the image pre-processing pipeline to now rescale images that are closer to their actual proportion and train a model that operates on images with a wider aspect ratio. The aspect ratio and spatial dimensions for inputs to the network that would be better suited are 1.77 (or 16:9) and 640x360 respectively. From the model inference perspective, this is a relatively easy switch. The convolutional layers of EfficientNet-B0 [5] which act as a feature extractor backbone now produce an output feature map with dimensions (BATCH_SIZE, CHANNELS, 12, 22), instead of a (BATCH_SIZE, CHANNELS, 12, 12) sized map returned when an input of 224x224 is given. The channel size for an EfficientNet-B0 [5] model is 1280 which is the same for both input sizes. The PyTorch AdaptiveAvgPool2d layer helps make the feature extractor backbone output compatible with the dense layer requirements and reduces it to 1280-D vector, used by fully connected layers for making final class predictions.

The logic that is used for rescaling images is as follows. For all images (majority of the dataset) that have a width / height ratio of greater than or equal to 1.25, they are simply resized to the input image dimensions. For the few images with a ratio less than 1.25 that remain, the images are slightly squished in the vertical axis, to bring the ratio to 1.25, which is then followed by a center crop with aspect ratio of 1. Finally these are rescaled to the input dimensions and fed to the model. After trying out many configurations, we found that these parameter settings strike a good balance and work best overall. Some of the other data augmentation techniques that were experimented with during training were RandomHorizontalFlip, RandomRotation and ColorJitter. Finally, input images are also normalized using ImageNet statistics before being fed to the model.

The Results

To compare the two image rescaling techniques, we train two EfficientNet-B0 [5] based models with the same data. For the first model called Vanilla, we use the PyTorch open source EfficientNet [5] rescaling method. While the second one called Landscape, we use the strategy discussed in the previous section. Both models are trained for 25 epochs (as they converge faster on a small subset of data) with most hyperparameters and settings are kept identical to make a fair comparison. The models are trained and evaluated on high quality subsets of our scene categorization train and test datasets respectively. This dataset has been created by sampling from five popular scene recognition categories such as airports, beaches for travel, sports games for entertainment and lifestyle categories like living room or bedroom environments. The train and test splits have roughly 500 and 50 samples per category respectively.

Table 1. Classification performance of models trained using two different rescaling techniques.

The image classification performance of models trained using the two rescaling techniques is shown in Table 1. Table 1 shows the standard quantitative metrics such as precision, recall and F-1 score used to measure classification performance. All values are weighted averaged over the five classes in the test set. As seen from the numbers, the Landscape model trained using the new rescaling method does significantly better on scene categorization than the Vanilla model. We see an increase in all three performance metrics.

Conclusion

In this article, we discussed the image pre-processing conventions followed by popular open source image classification deep learning models. These conventions have broad applicability and work well for many use cases, but fall short for our application, as shown through an analysis of using them on long form content for scene categorization. We also discussed how the image pre-processing pipeline can be modified to do better on long form content with wider aspect ratios such as movies and TV shows, and compared the performance of EfficientNet-B0 [5] models trained using the two pre-processing methods on a high quality sampled dataset. Adapting and designing the pre-processing pipeline that better suits the problem was helpful, as shown by the increase in classification performance. There might still be benefits to using some aspects of the standard conventions with respect to regularization during training, both training and inference speeds, and more, and perhaps a hybrid of the two methods discussed here might prove to be more helpful, but more experimentation is needed to confirm these notions.

References

[1] https://proceedings.neurips.cc/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf[2] https://arxiv.org/pdf/1409.4842.pdf[3] https://arxiv.org/pdf/1409.1556.pdf[4] https://arxiv.org/pdf/1512.03385.pdf[5] https://arxiv.org/pdf/1905.11946.pdf[6] https://stackoverflow.com/questions/57421842/image-size-of-256x256-not-299x299-fed-into-inception-v3-model-pytorch-and-wo

We’re always looking for new talent! View jobs.

Follow us: Facebook | Twitter | LinkedIn | Instagram

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store