A Machine Learning Solution to Air Pollution Problem In Chiang Mai

Worasom Kundhikanjana
7 min readMar 31, 2023

--

In this blog, I will show my machine learning approach to understand, model, and propose a solution to the high PM2.5 in Chiang Mai due to agricultural burning activities.

The first step to solving the air pollution problem in any city is to understand the source of the problem at the quantitative level. This needs an air pollution model. However, building one can be challenging due to the complex interaction between the weather factors and the sources of the pollution, which can be both locally generated and transported to the city. For this type of multivariable problem, machine learning models often perform better than traditional models, see references.

My approach is to build a machine learning model that can accurately predict the PM2.5 level using the satellite-detected hotspots, and weather data (wind speed, humidity, etc.). Then I will ask the model to simulate the PM2.5 level in scenarios where the hotspots are removed. This way, we can quantify the effect of reduced burning activities.

Head up, the next sections will drive into technical details of machine learning model construction. If you are not interested, please skip to the last section to see the animation. Now, let’s get to it.

Here are the data sources:

  1. Historical air pollution PM2.5 data is from Thai PCD. I also scraped some parts from air4thai.
  2. Chiang Mai weather data is scraped from weather underground.
  3. MODIS hotspot data is from NASA data products.

Seasonal Pattern of PM2.5

I will show that there is a strong seasonal pattern to the PM2.5 level in Chiang Mai and this pattern is the same as the burning activities. Let’s start with the raw PM2.5 level shown below. The peak values in some years reaching very unhealthy. In my previous blog, I showed that in the years with low burning activities, the peak pollution levels are not as high as those with a lot of burning.

PM2.5 level in Chiang Mai since 2012. The data is an average of two monitoring stations

To see the seasonal pattern, I plot the day of the year vs PM2.5 level for different years. This is shown in the plot below. The year is defined as the seasonal year for example December 2021 and January 2022 are considered the 2021 season year.

seasonal pattern of PM2.5. The blue line is the average behavior.

We can see that the pollution season in Chiang Mai is between December and April. This seasonal pattern is what I will use as a machine learning model to simulate in the simulation second below.

Note that the seasonal pattern of the number of hotspots in the 900 km radius from Chiang Mai is the same as the PM2.5 pattern. This is another strong evidence that the cause of air pollution must be these burning activities.

overlay seasonal patterns of PM2.5 and burning activities

Divide Burning Activities into Zones

To effectively feed the burning activities into a machine learning model, I divided the hotspots based on their distance from Chiang Mai. The figure below shows how divided the area into 5 zones. The first zone is within 100 km of the downtown area. This area is still within the Thailand border. It is meant to capture the local burning activities. The other two zones are 100–200 km and 200–400 km from the city. They cover the lower part of Thailand, Myanmar, and Laos. The outer zones are 400–600 km and 600–900 km. Studies have shown small but finite contributions from such far distances. I also found that the features from these far zones help improve the accuracy of the model.

fire zones defined by the distance from Chiang Mai

The hotspots are also weighted differently depending on the distance from the city. I also accounted for the time of travel to the city.

In addition to the hotspot data, wind speeds, humidity, and temperature are contribution factors used by the model. Anyone who has built a machine learning model knows that 70% of ML work is on feature engineering. I will not talk that in this blog. If you are interested please see the source code in this notebook.

Model Performance

I use a random forest regressor for the model. In general, tree-based models tend to perform better than neural networks for this type of problem. The data was split into the training and test set during the training.

In the figure below, you can see the model prediction on the test set (red). It can predict the seasonal pattern in PM2.5. The model has r2 score = 0.75, which is pretty good.

model’s prediction accuracy on the test data r2 score = 0.75

Simulation Results

The fun actually begins after getting the model working! I can the model to simulate the PM2.5 seasonal pattern we cut the burning activity in the area 0–100 km from the city down to 90%, 50%, and 10%.

simulate reduced burning activities in area within 100 km from Chiang Mai to 90%(left), 50%(middle), 10%(right) of the average activities. The number of hotspots in the map includes all the burning in 900 km radius.

Because there are still a lot of burning activities left, I don’t expect much change in the PM2.5 level when cutting down the burning in the 100km zone. This is what the simulation shows. In the plot below, you can see that PM2.5 seasonal pattern changes very little when reducing the burning activities from 90%(green) to 10%(blue). The peak pollution is still in the unhealthy range. This is because the number of hotspots only decreases from 40 spots/day to 20 spots/ day.

simulate seasonal PM2.5 pattern when decreases the burning activities in the area within 100km from Chiang Mai to 90%, 50%, and 10%

In the picture below, the simulation expands to an area within 400 km of the city. There are much more burning activities.

simulate reduced burning activities in area within 400 km from Chiang Mai to 90%(left), 50%(middle), 10%(right) of the average activities. The number of hotspots in the map includes all the burning in 900 km radius.

The effect on the PM2.5 level for this scenario is summarized in the plot below. When cutting down the burning activities by half, the model predicts a larger PM2.5 reduction. This makes sense.

simulate seasonal PM2.5 pattern when decreases the burning activities in the area within 400km from Chiang Mai to 90%, 50%, and 10%

Similarly, I can simulate what would happen if the burning activities in the 700 km radius decreases.

simulate reduced burning activities in area within 700 km from Chiang Mai to 90%(left), 50%(middle), 10%(right) of the average activities. The number of hotspots in the map includes all the burning in 900 km radius.
simulate seasonal PM2.5 pattern when decreases the burning activities in the area within 700km from Chiang Mai to 90%, 50%, and 10%

One could ask, why does cutting the burning activities to 10% still not get rid of the PM2.5 level completely? There are two possibilities: (1), the simulation still shows finite burning activities in the area or (2) we are seeing the limitation of the random forest model to interpolate edge cases.

In my opinion 400 km radius is the most critical zone. If the burning activities here decrease, we might see a big improvement in the average air quality around Chiang Mai.

Let’s summarize different burning reduction scenarios in the plot below. Each dot represents the simulated average PM2.5 for the entire season. Each line is from simulating the reduced burning in an area covered by different radii from the city. For example, the plot for 0–100 km corresponds to the blue line, and the plot for 0–200 km above corresponds to the orange line. It shows how much the PM2.5 level would decrease if the number of hotspots in different areas decrease.

simulate PM2.5 level when reducing burning activities at a large scale

For example, if you follow the orange line, which simulates the reduced burning activities in the area within 200 km of Chiang Mai. If there is no change, the average PM2.5 level would be about 60 ug/m3, if the burning activities decrease to 25%, the level would be at around 52 ug/m3. This is enough to bring the average PM2.5 level from the unhealthy range. In my opinion, this 200 km radius is the most critical in improving the air quality. The area is not too large and majority of the area are still within Thailand.

If more aggressive measured is implement, the air quality will be greatly improved. For example, the green line shows the scenario where the burning activities in the area within 400 km decrease. The model suggests a large reduction effect. If the burning activities are cut to 10%, the average PM2.5 level could reach the moderate range, a desirable outcome.

Note that this simulation approach is just an estimation. The model can not capture all contributing factors. During pollution season the PM2.5 is volatile and is difficult for the model to capture. Also random forest model cannot predict cases it has not seen before.

The previous three plots are bit abstract. Let me put these plots along with the map.

overall effect of reduced burning activities toward PM2.5 level in Chiang Mai

In summary, burning activities in an area as far as 900 km could contribute to the PM2.5 level in Chiang Mai. We probably will see the largest effect if reducing burning activities in the zone within 400 km of the city center. This area exceeds the Thailand border and solving this problem will not be easy. For folks living in the north of Thailand. Please take care.

Thank you for reading. Until next time!

The code can be found in my GitHub page. I would like to thank Mr. Matthew Perkins from UN-ESCAP for the collaboration and Thai PCD for the data.

Previous blogs on air pollution:

--

--