Fire Alert System with Multi-Label Classification Model Explained by GradCAM

5 min readOct 20, 2021

This is the 2nd post to the fire alert system series where I will share interesting things, I learnt. In my previous post, I trained a classification model to predict if a scene was safe or fire. Then I used CAM to visualize what which region of image was responsible for the prediction. In this post, I will talk about training a Multi-label classification model and using GradCAM to investigate my model.

What is a multi-label classification model and how is different from multi-class classification model? In multi-label, classes can occur at the same time while in multi-class, classes are mutually exclusive. In this case, an image can contain both flame and smoke at the same time making it a multi-label task.

In this experiment, I wanted to see if I break fire down into two components: flame and smoke, would this improve fire detection. The intuition is that flame and smoke look very different. Flame often has higher contrast and brightness, and is yellow, orange, and red color tone while smoke has lower contrast and is of white to black in color tone. I am interested to see if the classification model would yield a class activation map that resembles human intuition.

Dataset

I used the open-sourced fire & smoke bounding boxes labeled dataset, where I can separate flame only and smoke only images. The total number of images with flame only was 691 images, smoke only 3721 images, and fire {flame & smoke} 4207 images. There is a problem with unbalance dataset.

To alleviate this issue, I needed to add more flame only images and some smoke only images. I scraped the internet for these images. I also created a script to synthetically created images with flame only and smoke only. I first downloaded transparent images (4 channel image with transparent mask) of flame and smoke; some examples are shown below.

I overlaid these transparent flame & smoke images on to a background image of a scene. The script randomly picked flames or smoke images and perform random resizing, horizontal flip, shearing and color jitter before being randomly overlaid onto the background. Each image created also output corresponding label file with bounding of the smoke/flame in YOLO format. Lastly, I also created random object (pedestrian, plants, rocks, and trash bags) images using the same background to create counter examples to flame and smoke in equal proportion.

I used synthetic data to fill in lacking classes to make class distribution balanced. This is a very important condition to meet for multi-label classification model. The generation and use of synthetic data has many utilities in vision-based supervised learning. Here is a post of another example where synthetic data helped improved model performance.

The final compiled dataset used for this experiment is summarized below. No synthetic data were included in the validation set. The dataset was larger and more balanced. Train and validation ratio was 8:2. The class label is a one-hot encoding.

Models

In previous experiment, I set the input image size to 224x224 but in this experiment I increase it to 456x456. The model used here was the EfficientNetB7 and utilizing the ImageNet pre-trained weight. I only allowed the last Conv-layer and the remaining fully connected layers to be train. I also changed final activation function from binary cross entropy to sigmoid making it a multi-label model.

Like in previous experiment, I created a CustomNet which was a simple CNN model with the traditional Conv, batch-norm, ReLU, and max-pool block stacked together. There were 4 of these blocks before a GAP layer flatten the feature map into a vector. Lastly, I stacked 3 fully connected layers to the model. To view the class activation map, I used Grad-Cam instead of CAM because I added multiple dense layers after the GAP layer to improve model performance.

The loss function used was macro soft-f1 score and was optimized using Adam with initial learning rate at 1.0E-05.

Results

The pretrained EfficientNetB5 yield a validation F1 score of 0.67 while The CustomNet yielded the validation F1-score of 0.85 The result is summarized below. It seemed that the EfficientNetB5 converged very quickly while the CustomNet slowly learn as its validation F1 score slowly increased.

Using the CustomNet, I generated the Grad-CAM heatmap of both class label for some images from the validation set.

In addition, I also created a heatmap plot on the synthetic dataset used in the training. Interestingly, the localization of flame and smoke was more accurate.

This level of localization precision was due to the large number of synthetic images, which contained flame and smoke images overlaid on to the same background. They assisted the model to learn what flame and smoke looks like by providing plentiful counter examples.

But the moment of truth was evaluating on the test set. The model got an F1 score of only 0.36. A drop from validation F1 of 0.85. This again showed that the model didn’t generalize to the test-set even with synthetic dataset used in training contain the test-set background. Looking at the GradCAM heatmap for the smoke class, showed no localization at the smoke.

I was disappointed by the results; I had hoped that the model would be able to identify smoke and flame. This is one pitfall in deep learning, you don’t have direct control over how the model will learn.

Fire Alert System with Multi-Label Classification Model Explained by GradCAM

Written by Natthasit Wongsirikul