Multi-label image classifier for threat detection with FP16 inference (part 2)
In the previous post we talked about using a multilabel classifier for threat classification in order to address problems when the threat in the image is not the salient object. In this post, we are going to discuss using Exploratory Data Analysis (EDA) for our multilabel dataset and its results. Since our model needs to analyze millions of images, keeping the inference times of our model low is of utmost priority for better scalability. In that regard, we also present results on using mixed precision training for our EfficientNet based threat classifier. Since this is a proprietary dataset, the specifics of the dataset will not be revealed — for example the exact labels — but they will be addressed by simple numbering from L0 to L9. L0 is a class that represents safe content in the image while classes from L1-L9 represent presence of threatening content.
Exploratory Data Analysis
For multilabel, it’s important to look at label co-occurrences. This is in a sense equivalent to looking at class imbalances in a multiclass scenario.
Since this is a multilabel use-case, the order of co-occurrences is also a parameter. Order here is defined as the number of classes in each label set. Scikit-multilearn library along with networkX graph libraries allow you to construct and analyze label relationships using graphs. This can provide some key insights into the dataset and its distribution.
The model and image preprocessing steps used are the same as mentioned in the first part of this series. You can read more about it here.
At the time of writing this blog, a gold standard dataset for running all our models consistently on one set of data was not yet complete. For the purpose of this post, since Multiclass and Multilabel require balancing the classes differently, the test datasets are not identical. However, for the scope of this blog, the test datasets has sufficient merit to show that this method has more promise as opposed to a multiclass classifier. The performance metrics for the the 10 main classes (i.e L0-L9) is shown in Figure 1.
This bar graph represents the performance of the multilabel classifier against the multiclass classifier. Note that the performance of the multilabel classifier shows a significant increase with respect to recall points in all classes. For an application like threat classification Recall is the most important metric because failing to detect a threat ( marking it as safe ) is costlier compared to marking a safe image as threat which results in loss of inventory. Precision on the other hand has decreased for certain classes due to imbalanced classes and noisy labels. Other factors that could lead to less than ideal precision performance could be that some of these classes are prone to labeling bias of the annotator. This bias amplifies significantly when the number of samples is very low in a particular class, which was the case with L2 and L9 classes.
In summary this iteration of the threat classifier is much better than the previous version for two reasons:
- The performance shows improvement on key classes and key metrics namely Recall.
- Multilabel classifier is able to provide the end user with more information about the classes present in the image.
Performance wise everything looks good, but deployment into production involves taking scalability into consideration. One way to improve vertical scaling in Deep Learning models with millions of parameters and thousands of layers is to use Floating Point 16 ( FP16 ) computation.
As easy as it sounds, there are problems associated with this -
- Not all GPUs are capable of performing FP16 computations. More on that in a bit.
- Training with full FP16 computations results in unstable training and therefore it is important to use Mixed Precision training routines.
- Not every layer has a corresponding FP16 implementation available.
All GPUs from Nvidia with the Volta or Turing architectures support FP16 computation using Tensor cores which accelerate Matrix multiplications. This is available on AWS with the G4 instances which run Tesla T4 GPUs. More information on this is available here T4.
For the sake of brevity Fig 2. summarizes mixed precision training beautifully:
- The model is converted to FP16 so that all its weights are in FP16. But a copy of the FP32 ( master ) weights are stored which is synced with these FP16 weights of the model.
- After the forward pass we obtain a FP16 loss which is converted to FP32 loss because we will be scaling the loss with a large number.
- Loss scaling is applied to prevent gradient overflow; whenever there is an overflow the loss scale is halved and that gradient update is skipped.
- Using this scaled loss, we perform the backprop and obtain a scaled FP16 gradient which is then converted to FP32.
- We perform descaling and gradient clipping if necessary on this. These FP32 gradients are used to update our FP32 master weights. The updated master weights are then converted to FP16 to get our updated model weights in FP16.
The reason for using FP32 master weights is to accommodate small updates of weights since gradients are small compared to the parameters and FP16 cannot handle computations at this precision ( values less than 2⁻¹¹ for example ).
Similar to the weights there are other layers that EfficientNet uses that require FP32 parameter updates, for example, batchnorm layer. Too complicated?
Thankfully Nvidia has wrapped this up into a simple library for us to use called APEX. This library implements Automatic Mixed Precision training on PyTorch models with ~4 lines of code.
This brings us to the last but the most important part — how effective is mixed precision training for EfficientNet ?
In our experiments we used EfficientNet B4 model. The speedup is summarized as follows -
We observed approximately 50% speedup on our model and significant decrease in size of the model. The two versions of FP16 are the two modes that Apex operates in, O1 is the preferred and suggested mode to operate mixed precision training in. One caveat is that we operate on 1 image per batch, which is not ideal. Operating with more images per batch might yield better speedups. The decrease in model size allows us to load more images per batch or more models itself on the same machine which is another win for using FP16 inference.
There were a few key insights we gained working on FP16 inference for EfficientNet architectures which we hope will help the community in avoiding pitfalls.
- Be sure to check your cuDNN version.
- A cuDNN version < 7.2 requires number of input and output channels to be a multiple of 8 for tensor core usage to be triggered.
- With a cuDNN version > 7.2, this restriction is lifted on convolution operations.
- Nvidia claims a 8X speedup using FP16 inference in an ideal setup. EfficientNet is far from achieving these impressive numbers. In our very exhaustive experiments we realised a couple of things. The speedup of only 50% is observed because EfficientNet implementation heavily uses 5x5 depthwise convolutions. And unfortunately cuDNN does not have an implementation to accelerate 5x5 depthwise convolutions. The issue is discussed in detail here.
- A very neat way of checking if your model is indeed being accelerated using tensor cores ( using FP16 ) is to use a profiler and check where each kernel is running. This is a sample profiler output for EfficientNet B4 on a Tesla T4:
The TC column triggering 1 implies tensor core is being used to perform that particular kernel operation. In order to obtain such kernel specific analysis we used pyprof which ships with the apex library.
Another benefit that we observed by using mixed precision training routines is that the performance of our model increased because FP16 acts as natural regularizer.
The performance increase albeit marginal comes as an auxiliary benefit at no additional cost!