Classifying Air Pollution Standard Index

Dimas Buntarto
Analytics Vidhya
Published in
9 min readAug 9, 2021

Introduction

Air pollution is the presence of one or more physical, chemical or biological substances in the atmosphere in quantities that can harm human, animal and plant health, interfere with aesthetics and comfort, or damage property.

Air pollution can be caused by natural sources as well as human activities. Some definitions of physical nuisance such as noise pollution, heat, radiation or light pollution are considered as air pollution. The nature of air causes the impact of air pollution to be direct and local, regional, or global.

In this paper, we will try to classify the air pollution index using several supervised learning classifiers.

Objective

  • Estimate the classifier that can classify the standard index of air pollution.

Methods

1. Getting dataset.
2. Checking the dataset.
3. Doing visualization to the dataset.
4. Doing feature enginering to the dataset.
5. Comparing regression models with and without target transformation.

The clasifier models that we will use are:
1. Logistic Regression
2. Linear SVC
3. Decision Tree
4. Random Forest
5. SVC

About the Dataset

The dataset we get from the website https://data.jakarta.go.id/dataset/indeks-standar-pencemaran-udara-ispu-tahun-2020 and https://data.jakarta.go.id/dataset/indeks-standar-pencemaran-udara-ispu-tahun-2021. I’ve combined the entire data set, and you can find it on the website https://github.com/dhiboen/Project/blob/main/ISPU.csv.

Given is the attribute name, attribute type, the measurement unit and a brief description.

This dataset contains the Air Pollution Standard Index measured from 5 air quality monitoring stations in Jakarta Province in 2021. The explanation of the variables from the dataset above as follows:

1. tanggal : Date of air quality measurement
2. stasiun : Measurement location at the station
3. pm10 : Particulate (one of the measured parameters)
4. pm25 : Particulate (one of the measured parameters)
5. so2 : Sulfide (in the form of SO(one of the measured parameters)
6. co : Carbon Monoxide (one of the measured parameters)
7. o3 : Ozone (one of the measured parameters)
8. no2 : Nitrogen Oxide (one of the measured parameters)
9. max : The highest measurement value of all the parameters measured at the same time
10. critical : Parameters with the highest measurement results
11. categori : Categories of air pollution standard index calculation results

In this dataset, the target is the variable `categori`

Importing Libraries

We import the libraries we need first

Getting The Dataset

We get the dataset first. We get the dataset through the site https://github.com/dhiboen/Project/blob/main/ISPU.csv.

The dataset has 4,875 samples with 11 variables

Checking The Dataset

We first check the condition of the data which includes the types of features and data descriptions.

In the pm25 column, out of 4,875 samples only 1,081 samples were filled. So that in the pm25 column there are only about 22% of the data filled. Therefore we will remove the pm25 column.

We will also delete the max and critical columns because they only contain the highest value information on the measurement.

Next, let’s look at the target’s condition first.

There are 40 samples that have the target label “TIDAK ADA DATA” which means that there is no data in the sample. Therefore, we will delete the samples which have the target label “TIDAK ADA DATA”.

Now the data is 4,835 samples.

Let’s look at the condition of the data again.

We can see, the whole sample is filled. It’s just that if we look at the original data contained on the website, there are some empty parts marked with “ — -”.

We will deal with these missing values first.

Handling Missing Value

We first change “ — -” with NaN, which means Not A Number

We look once again at the data on the measured parameters. These data types are objects, so we will first convert them to float or integer.

We can see that the measured parameters column has missing values.

We will fill in the missing values based on the stasiun column group. Thus, we will fill in the missing values based on the average of each stasiun.

Now, all samples are filled.

Data Preprocessing and Visualization

Spliting The Dataset

We first look at the target’s condition

From the target conditions, we know that we are dealing with an unbalanced dataset. Therefore, we will perform a stratified split of the data.

But first we delete the tanggal and stasiun columns, because we only need the measured parameters.

Data Visualization

We will visualize the data using pairplot and personal component analysis, to estimate the right model.

Next, we will use principal component analysis (pca) to visualize the data.

We first separate the data into X_train, X_test, y_train, and y_test.

Next, we visualize the data using pca.

From the visualization, we can predict that the linear classifier will not work optimally on this data.

Modeling

Next we will do the modeling of the data.First we do the encoding of the target.

Encoding Target

There are 4 types of labels on the target, namely “BAIK”, “SEDANG”, “TIDAK SEHAT”, and “SANGAT TIDAK SEHAT”. The order of the labels is “BAIK” > “SEDANG” > “TIDAK SEHAT” > “SANGAT TIDAK SEHAT”, so the label code becomes as follows:

  • BAIK = 3
  • SEDANG = 2
  • TIDAK SEHAT = 1
  • SANGAT TIDAK SEHAT = 0

We split the data again into X_train, X_test, y_train, and y_test.

Let’s look again at the label conditions on y_train and y_test.

First modeling

For the first modeling, we will use logistic regression, linear SVC, and SVC models. This is because the model requires scaling to accelerate convergence.

Second Modeling

For the second model, we will use Decision Tree and Random Forest. Both models, do not require feature scaling.

Result and Discussion

From the models we use, we get the following confusion matrix

Logistics Regression Confusion Matrix
Linear SVC Confusion Matrix
SVC Confusion Matrix
Decision Tree Confusion Matrix
Random Forest Confusion Matrix

Class 0 (SANGAT TIDAK SEHAT)

Based on the confusion matrix, the class 0 conditions can be summarized in the following table:

Precision, Recall, Specificity, and F1-Score can be summarized in the following table:

All precision in class 0 is 1.00 in all models. Meanwhile, the recall value for all models is 1.00, except for the logistic regression, which is 0.50. This is because the four models identify class 0 perfectly, while the logistic regression has a False Negative value of 1. False Positive all models are zero, so True Negative Rate all models are 1 .

The F1 score on the four models for class 0 is 1, except for the logistic regression of 0.67.

Class 1 (TIDAK SEHAT)

Based on the confusion matrix, the class 1 conditions can be summarized in the following table:

Precision, Recall, Specificity, and F1-Score can be summarized in the following table:

The highest precision values and True Negative Rate were obtained by the Linear SVC model. The True Positive Rate and the highest F1 score were obtained by the Random Forest model.

Class 2 (SEDANG)

Based on the confusion matrix, the class 2 conditions can be summarized in the following table:

Precision, Recall, Specificity, and F1-Score can be summarized in the following table:

The Random Forest model classifies class 2 very well. This is indicated by the highest scores on precision, true positive rate, true negative rate, and F1 scores all in the Random Forest model.

Class 3 (BAIK)

Based on the confusion matrix, the class 3 conditions can be summarized in the following table:

Precision, Recall, Specificity, and F1-Score can be summarized in the following table:

The Random Forest model classifies class 3very well. This is indicated by the highest scores on precision, true positive rate, true negative rate, and F1 scores all in the Random Forest model.

Based on the data above, in this paper, we choose the Random Forest model to classify the standard index of air pollution. This is because the Random Forest model works well in terms of classifying all classes, especially when viewed from the F1 score. In each class, the Random Forest model has the highest score. Although in class 0, Random Forest’s F1 score is the same as the other three models, but the F1 score is the highest score obtained.

Based on these data, the accuracy of the Random Forest model is also the highest among other models. The accuracy value of the Random Forest model is 0.96. The accuracy of the models used in this paper is presented in the following table:

Conclusion:

Based on the things above, the following conclusions can be drawn:

  1. The best model in this paper that can be used to classify the standard index of air pollution is the Random Forest model
  2. The accuracy value of the Random Forest model is 0.96, with the following parameters:
  • criterion=entropy
  • max_depth=9
  • max_features=5
  • n_estimators=200
  • random_state=42

Further Analysis

If we look at the target composition of the dataset, we know that the dataset is unbalanced data. Therefore, resampling the dataset using either over-sampling or under-sampling or other methods needs to be considered in classifying this dataset.

--

--