Fire Alert System with Classification Model Explained by Class Activation Map

Natthasit Wongsirikul
5 min readOct 19, 2021

--

This project was based on client’s request for a fire detection system that was implemented on top of their existing CCTV system. This is a 3 series post summarizing all the experimentation done to investigate different approach to detecting fire from simple image classification to object detection.

I trained a binary classification model to predict 2 classes: fire vs. safe where the fire class includes both flame and smoke. The model was also modified to include class activation mapping (CAM) to visualize the spatial location in the image that most contribute to the model’s decision via a heatmap. The idea is to see if my classification model was focusing on region within the image that we as human consider to be fire.

Dataset

The dataset used for this task are summarized below. The fire and smoke images came from Gaiasd-dataset, FIRE-SMOKE-DATASET, and the Scenic-dataset. The dataset is split to 9:1 ratio for training and validation. The validation set is made up of the ¾ of FIRE-SMOKE-DATASET dataset as well some images captured from the internet (~100)

Left side training and right side validation

The test set are created with the cooperation of the client to create fire within the area where the camera can see. Using an empty metal barrel filled with papers, a fire was lit and placed at the scene where the camera is pointing to.

Frame with smoke or flame from the video were sampled and manually labeled yielding a total of 1122 images making up the test set.

Models

I experimented with 4 base models and compared their performance. The details of the models are summarized in the table below. All the model’s final convolutional output layers underwent global average pooling layer to down-sample the final spatial feature map into a single value vector before passing to the final fully connected layers. All base model utilized the pre-trained weights from ImageNet which meant that input image had to be resized to 224 by 224. The entire base models were frozen, only the final convolution layer and the attached dense layer were allowed to train for 50 epochs. The CustomNet was a simple classic block of 2D-Conv, Batch-Norm, and MaxPool layer stacked together into 3 stacks before flattening into a single fully connected layer.

Lastly, using the model’s last Conv weights to produce the class activation map to see which area in the image most responsible for the class prediction. The CAM heatmap was then overlaid on top of the image.

Results

The best performing model on the validation set was the EfficientNetB0. It demonstrated some ability to correctly localize fire but on the majority, non-fire pixels are often the main trigger the classification decision. The model performance on the test set (real life client’s site) was poor with loss of 0.692 and an accuracy of 0.544 showing that the model failed to generalize to footage from actual site.

The class activation map showed that the trained model’s decision, even when correct, may not come from region of the image that is the object. In this case, fire pixels were not the main contributor to the classification of fire. This is NOT saying that the model isn’t looking at fire region, it most likely extracted features unique to fire and smoke in its latent space and propagated those features through the model. It just didn’t spatially align at the last convolution layer.

Interestingly, I was able to convert the heatmap into a bounding boxes by converting the heatmap into binary image blobs then drawing a rectangle around them. This looks like an output from object detection!

Bounding boxes generated from CAM heatmap

Another observation made is that fire is often made up of flame and smoke, two things that look very different. In this experiment, I didn’t differentiate between the two. However, it is possible that a fire is happening but only the smoke visible. Thus, it is worth being able to detect flame and smoke separately.

When testing on images from customer, the result was not good. This is because the test set is very different from the training and validation set. For starter, the size of the fire and smoke relative to the whole image is very small compared to the fire & smoke dataset online. In addition, down sampling the image to 224 by 224 loose much of the detail that makes up the texture of the smoke.

In the next post, I’ll be trying Multilabel image classification with two classes: smoke & flame. The idea is to create a better model by training it to differentiate between smoke and flame.

--

--

Natthasit Wongsirikul

I'm a computer vision engineer. My interest span from UAV imaging to AI CCTV applications