Classifying Air Pollution Standard Index
Introduction
Air pollution is the presence of one or more physical, chemical or biological substances in the atmosphere in quantities that can harm human, animal and plant health, interfere with aesthetics and comfort, or damage property.
Air pollution can be caused by natural sources as well as human activities. Some definitions of physical nuisance such as noise pollution, heat, radiation or light pollution are considered as air pollution. The nature of air causes the impact of air pollution to be direct and local, regional, or global.
In this paper, we will try to classify the air pollution index using several supervised learning classifiers.
Objective
- Estimate the classifier that can classify the standard index of air pollution.
Methods
1. Getting dataset.
2. Checking the dataset.
3. Doing visualization to the dataset.
4. Doing feature enginering to the dataset.
5. Comparing regression models with and without target transformation.
The clasifier models that we will use are:
1. Logistic Regression
2. Linear SVC
3. Decision Tree
4. Random Forest
5. SVC
About the Dataset
The dataset we get from the website https://data.jakarta.go.id/dataset/indeks-standar-pencemaran-udara-ispu-tahun-2020 and https://data.jakarta.go.id/dataset/indeks-standar-pencemaran-udara-ispu-tahun-2021. I’ve combined the entire data set, and you can find it on the website https://github.com/dhiboen/Project/blob/main/ISPU.csv.
Given is the attribute name, attribute type, the measurement unit and a brief description.
This dataset contains the Air Pollution Standard Index measured from 5 air quality monitoring stations in Jakarta Province in 2021. The explanation of the variables from the dataset above as follows:
1. tanggal : Date of air quality measurement
2. stasiun : Measurement location at the station
3. pm10 : Particulate (one of the measured parameters)
4. pm25 : Particulate (one of the measured parameters)
5. so2 : Sulfide (in the form of SO(one of the measured parameters)
6. co : Carbon Monoxide (one of the measured parameters)
7. o3 : Ozone (one of the measured parameters)
8. no2 : Nitrogen Oxide (one of the measured parameters)
9. max : The highest measurement value of all the parameters measured at the same time
10. critical : Parameters with the highest measurement results
11. categori : Categories of air pollution standard index calculation results
In this dataset, the target is the variable `categori`
Importing Libraries
We import the libraries we need first
Getting The Dataset
We get the dataset first. We get the dataset through the site https://github.com/dhiboen/Project/blob/main/ISPU.csv.
The dataset has 4,875 samples with 11 variables
Checking The Dataset
We first check the condition of the data which includes the types of features and data descriptions.
In the pm25
column, out of 4,875 samples only 1,081 samples were filled. So that in the pm25
column there are only about 22% of the data filled. Therefore we will remove the pm25
column.
We will also delete the max
and critical
columns because they only contain the highest value information on the measurement.
Next, let’s look at the target’s condition first.
There are 40 samples that have the target label “TIDAK ADA DATA” which means that there is no data in the sample. Therefore, we will delete the samples which have the target label “TIDAK ADA DATA”.
Now the data is 4,835 samples.
Let’s look at the condition of the data again.
We can see, the whole sample is filled. It’s just that if we look at the original data contained on the website, there are some empty parts marked with “ — -”.
We will deal with these missing values first.
Handling Missing Value
We first change “ — -” with NaN, which means Not A Number
We look once again at the data on the measured parameters. These data types are objects, so we will first convert them to float or integer.
We can see that the measured parameters column has missing values.
We will fill in the missing values based on the stasiun
column group. Thus, we will fill in the missing values based on the average of each stasiun
.
Now, all samples are filled.
Data Preprocessing and Visualization
Spliting The Dataset
We first look at the target’s condition
From the target conditions, we know that we are dealing with an unbalanced dataset. Therefore, we will perform a stratified split of the data.
But first we delete the tanggal
and stasiun
columns, because we only need the measured parameters.
Data Visualization
We will visualize the data using pairplot and personal component analysis, to estimate the right model.
Next, we will use principal component analysis (pca) to visualize the data.
We first separate the data into X_train, X_test, y_train, and y_test.
Next, we visualize the data using pca.
From the visualization, we can predict that the linear classifier will not work optimally on this data.
Modeling
Next we will do the modeling of the data.First we do the encoding of the target.
Encoding Target
There are 4 types of labels on the target, namely “BAIK”, “SEDANG”, “TIDAK SEHAT”, and “SANGAT TIDAK SEHAT”. The order of the labels is “BAIK” > “SEDANG” > “TIDAK SEHAT” > “SANGAT TIDAK SEHAT”, so the label code becomes as follows:
- BAIK = 3
- SEDANG = 2
- TIDAK SEHAT = 1
- SANGAT TIDAK SEHAT = 0
We split the data again into X_train, X_test, y_train, and y_test.
Let’s look again at the label conditions on y_train and y_test.
First modeling
For the first modeling, we will use logistic regression, linear SVC, and SVC models. This is because the model requires scaling to accelerate convergence.
Second Modeling
For the second model, we will use Decision Tree and Random Forest. Both models, do not require feature scaling.
Result and Discussion
From the models we use, we get the following confusion matrix
Class 0 (SANGAT TIDAK SEHAT)
Based on the confusion matrix, the class 0 conditions can be summarized in the following table:
Precision, Recall, Specificity, and F1-Score can be summarized in the following table:
All precision in class 0 is 1.00 in all models. Meanwhile, the recall value for all models is 1.00, except for the logistic regression, which is 0.50. This is because the four models identify class 0 perfectly, while the logistic regression has a False Negative value of 1. False Positive all models are zero, so True Negative Rate all models are 1 .
The F1 score on the four models for class 0 is 1, except for the logistic regression of 0.67.
Class 1 (TIDAK SEHAT)
Based on the confusion matrix, the class 1 conditions can be summarized in the following table:
Precision, Recall, Specificity, and F1-Score can be summarized in the following table:
The highest precision values and True Negative Rate were obtained by the Linear SVC model. The True Positive Rate and the highest F1 score were obtained by the Random Forest model.
Class 2 (SEDANG)
Based on the confusion matrix, the class 2 conditions can be summarized in the following table:
Precision, Recall, Specificity, and F1-Score can be summarized in the following table:
The Random Forest model classifies class 2 very well. This is indicated by the highest scores on precision, true positive rate, true negative rate, and F1 scores all in the Random Forest model.
Class 3 (BAIK)
Based on the confusion matrix, the class 3 conditions can be summarized in the following table:
Precision, Recall, Specificity, and F1-Score can be summarized in the following table:
The Random Forest model classifies class 3very well. This is indicated by the highest scores on precision, true positive rate, true negative rate, and F1 scores all in the Random Forest model.
Based on the data above, in this paper, we choose the Random Forest model to classify the standard index of air pollution. This is because the Random Forest model works well in terms of classifying all classes, especially when viewed from the F1 score. In each class, the Random Forest model has the highest score. Although in class 0, Random Forest’s F1 score is the same as the other three models, but the F1 score is the highest score obtained.
Based on these data, the accuracy of the Random Forest model is also the highest among other models. The accuracy value of the Random Forest model is 0.96. The accuracy of the models used in this paper is presented in the following table:
Conclusion:
Based on the things above, the following conclusions can be drawn:
- The best model in this paper that can be used to classify the standard index of air pollution is the Random Forest model
- The accuracy value of the Random Forest model is 0.96, with the following parameters:
- criterion=entropy
- max_depth=9
- max_features=5
- n_estimators=200
- random_state=42
Further Analysis
If we look at the target composition of the dataset, we know that the dataset is unbalanced data. Therefore, resampling the dataset using either over-sampling or under-sampling or other methods needs to be considered in classifying this dataset.