Which Model Is Best For Smoke Detection?

Prathamesh Gadekar
6 min readApr 2, 2023

--

Exploring all the models to pick the best.

Photo by Chris Karidis on Unsplash

Introduction :

Smoke detectors detect smoke and trigger an alarm to alert others. Typically, they are found in offices, homes, factories, etc. Generally, smoke detectors fall into two categories:

  1. Photoelectric Smoke Detector - The device detects the light intensity and generates an alarm if it falls below a set threshold value since smoke causes the light intensity to decrease due to dust particles and smoke.
  2. Ionization Smoke Detector - A detector of this type is equipped with an electronic circuit that measures the current difference and alerts the user if it exceeds a certain threshold. As the ions cannot move freely due to smoke and duct particles, the current in the circuit will decrease.

Using the provided dataset, we aim to develop an AI model that can accurately raise an alarm if smoke is detected. It is our objective to compare many Classification Models, such as KNN, Logistic Regression, etc., based on their accuracy, represent them visually, and select the best from them.

The data is taken from here.

Importing Required Libraries :

#Importing all essential libraries
import numpy as np
import pandas as pd
import seaborn as sns
from plotly.subplots import make_subplots
import matplotlib.pyplot as plt
import plotly.express as px
import missingno as msno

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

#Importing Models
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression,SGDClassifier
from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier,AdaBoostClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.dummy import DummyClassifier
from sklearn.tree import ExtraTreeClassifier


from sklearn.metrics import accuracy_score
import time

import warnings
warnings.filterwarnings('ignore')

Data Exploration :

Feature Description:

  1. UTC - The time when the experiment was performed.
  2. Temperature - Temperature of Surroundings. Measured in Celsius
  3. Humidity - The air humidity during the experiment.
  4. TVOC - Total Volatile Organic Compounds. Measured in ppb (parts per billion)
  5. eCo2 - CO2 equivalent concentration. Measured in ppm (parts per million)
  6. Raw H2 - The amount of Raw Hydrogen present in the surroundings.
  7. Raw Ethanol - The amount of Raw Ethanol present in the surroundings.
  8. Pressure - Air pressure. Measured in hPa
  9. PM1.0 - Particulate matter of diameter less than 1.0 micrometer.
  10. PM2.5 - Particulate matter of diameter less than 2.5 micrometers.
  11. NC0.5 - Concentration of particulate matter of diameter less than 0.5 micrometers.
  12. NC1.0 - Concentration of particulate matter of diameter less than 1.0 micrometers.
  13. NC2.5 - Concentration of particulate matter of diameter less than 2.5 micrometers.
  14. CNT - Simple Count.
  15. Fire Alarm - (Reality) If the fire was present then the value is 1 else it is 0.
data = pd.read_csv('../input/smoke-detection-dataset/smoke_detection_iot.csv',index_col = False)
data.head()
First five rows of the data(Source: Author)
data.shape
data.describe().T.sort_values(ascending = 0,by = "mean").\
style.background_gradient(cmap = "BuGn")\
.bar(subset = ["std"], color ="red").bar(subset = ["mean"], color ="blue")
Describing Data (Source: Author)
#Getting all the unique values in each feature
features = data.columns
for feature in features:
print(f"{feature} ---> {data[feature].nunique()}")
Unique Values for all variables (Source: Author)

Null Value Distribution:

data.isna().sum()
Null Value Count (Source: Author)
msno.matrix(data)
Null Value Visualization (Source: Author)

Data Cleaning:

There is no missing value in the dataset, which allows us to analyze the data much more effectively and build accurate prediction models.

If the dataset contains missing values, please refer to the following links to assist you with data cleaning:

  1. Getting Started With Kaggle
  2. Geek for Geeks

Although some features are useless and can hamper our model. Those are :

  1. UTC - It merely indicates when the experiment was conducted, so it does not affect the results.
  2. Unnamed :0 - It's just the indexing.
  3. CNT - It is the count (similar to indexing).

Since these attributes are useless, we will drop them.

del_features = ['Unnamed: 0','UTC','CNT']
for feature in del_features:
data = data.drop(feature,axis = 1)
data.head()
Deleting Unwanted Features (Source: Author)

⭐Important Observations :

  • There is a total of 62360 rows and 16 columns in the data.
  • The data doesn't contain any missing values.
  • We drop UTC, Unnamed 0:, CNT attributes as they are of no use to us.
  • After all the modifications we have a total of 13 attributes on which we will perform EDA.
  • There are a total of 810680 (62360 x 13) observations.

Exploratory Data Analysis :

Feature Analysis Using Target Variable :

sns.set_style("whitegrid")
sns.histplot(data['Fire Alarm'])
Histogram of Frequency (Source: Author)
plt.figure(figsize = (6,6))
sns.kdeplot(data = data,x = 'TVOC[ppb]')
Probability Density Function (Source: Author)

HeatMap :

plt.figure(figsize = (12,12))
sns.heatmap(data.corr(),annot = True,cmap = 'GnBu')
Heatmap (Source: Author)

⭐Important Observations :

  • Considering correlation >=0.65 as high, we can say that Pressure and Humidity have a high correlation.
  • All the PM's and NC's have a high correlation with each other.
  • The difference between the mean and median of TVOC,PM's and NC's is very high. This tells us that there are many outliers present.
  • The TVOC,PM's and NC's are very important attributes for classification because the difference between the mean and median of the target variable is very large.

Modeling:

Data Pre-Processing:

X = data.copy()
X.drop('Fire Alarm',axis = 1,inplace = True)
y = data['Fire Alarm']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,random_state=0)
ss = StandardScaler()
X_train = ss.fit_transform(X_train)
X_test = ss.transform(X_test)

Model Implementation:

models = [KNeighborsClassifier(),SGDClassifier(),LogisticRegression(),RandomForestClassifier(),
GradientBoostingClassifier(),AdaBoostClassifier(),BaggingClassifier(),
SVC(),GaussianNB(),DummyClassifier(),ExtraTreeClassifier()]
Name = []
Accuracy = []
Time_Taken = []
for model in models:
Name.append(type(model).__name__)
begin = time.time()
model.fit(X_train,y_train)
prediction = model.predict(X_test)
end = time.time()
accuracyScore = accuracy_score(prediction,y_test)
Accuracy.append(accuracyScore)
Time_Taken.append(end-begin)
Dict = {'Name':Name,'Accuracy':Accuracy,'Time Taken':Time_Taken}
model_df = pd.DataFrame(Dict)
model_df
Accuracy and Time Taken (Source: Author)

Accuracy vs Model:

model_df.sort_values(by = 'Accuracy',ascending = False,inplace = True)
fig = px.line(model_df, x="Name", y="Accuracy", title='Accuracy VS Model')
fig.show()
Accuracy Vs Model (Source: Author)

Time Taken vs Model:

model_df.sort_values(by = 'Time Taken',ascending = False,inplace = True)
fig = px.line(model_df, x="Name", y="Time Taken", title='Time Taken VS Model')
fig.show()
Time Taken Vs Model (Source: Author)

Conclusion:

As a result of the above analysis, we can see that ExtraTreeClassifier requires less amount of training and execution time, as well as provides the highest level of accuracy.

👋 Greetings!

Thanks for sticking around for the rest of the blog! I hope you had a great time!

I cover all kinds of Data Science & AI stuff…. and sometimes Programming.

To have stories sent directly to you, subscribe to my newsletter.

--

--

Prathamesh Gadekar

A Computer Science Enthusiast. I write about machine learning and data science.