This is part 2 of a 4 part series on predicting air pollution (specifically PM2.5 levels) in Ulaanbaatar, the capital city of Mongolia. In Part 2 the data used to predict PM2.5 levels will be introduced and several visualizations will be shown to better understand the situation. Part 1 detailed the problem of air pollution, some solutions, and how prediction may help. Here you can find Part 1, Part 3, and Part 4.
The beginning of any successful machine learning project starts with data. Preferably a large amount of high quality labeled and structured data. Of course reality is normally at odds with your wants and desires.
In order to predict PM2.5 air pollution I would need some input data I can use to infer the air pollution level. This could be all manner of things. Time of day, number of cars, number of gers, TV usage metrics, and many others could all be helpful in inferring PM2.5 levels. However weather data seemed the obvious choice. After all many that I spoke with said, “If it’s cold there is pollution.” So I would need to know when it was cold to get started.
For this project I started with the most obvious thing, air pollution data. When I first decided to undertake this project in October of last year my options were very limited. At the time there was only one publicly accessible source of pollution data, the US Embassy in Ulaanbaatar. In 2015 the US Embassy launched an air quality monitoring program and have been tracking hourly PM2.5 levels since then. The data are available as CSV files and are aggregated by month.
The station records PM2.5 in one hour intervals. Occasionally the station stops reporting for some reason, so there are holes in the data for some hours.
PM2.5 levels are measured in micrograms per cubic meter (1.0 × 10^-9 kg/m³ or 1 μg/m³). To put this in perspective, 50 micrograms (μg) weighs about as much as a fingerprint. This amount of PM2.5 in one m³ is considered unhealthy. Pretty potent stuff.
What is PM2.5?
PM2.5 stands for particulate matter 2.5, which means particles in the air that are ≤2.5 microns in diameter. These particles are so small they can enter your bloodstream and cause all manner of ailments.
There are many others ways to measure air pollution, including PM10 (particulate matter around between 2.5 and 10 microns in diameter), carbon monoxide, sulfur dioxide, nitrogen dioxide, and ozone (O3). I chose PM2.5 as it is 1) one of the most dangerous forms of air pollution and 2) it was readily available going back to October 2015.
Using weather data created a whole host of questions. What weather indicators were useful? What’s the do dry bulb temperature, dew point, and wind speed tell us? Does it matter whether it rains or snows? I quickly learned why people spend years of their lives studying the weather.
Finding weather data was another issue. The National Agency of Meteorology and Environmental Monitoring does have a data archive. However you are required to fill out a request form and a fee is involved in retrieving the data. Thankfully after a few days of Googling around I found that NOAA (National Oceanic and Atmospheric Administration, part of the US Department of Commerce) collects data from thousands of weather stations around the world. These data are freely available through the Global Hourly Surface Dataset Common Access System. It turns out they have Ulaanbaatar weather data going back to 1956!
Putting It Together
Now that I have both weather and pollution data (my inputs and output, respectively) I can put them together. In the end I had over 19,000 rows of data, with each row representing an hour. The charts above show some of the most striking views of pollution in Ulaanbaatar. It is clear there is a predictable pattern we may be able to exploit to create our predictions.
There were also several challenges. As you can see from the time series above, there appear to be several outliers that are probably erroneous data points. In addition our weather data set has many missing values (some features have more missing data than actual data).
I chose to remove any record with a PM2.5 value above 600 μg/m³. After conversations subject matter experts I learned that measured values above 500 μg/m³ may not be reliable. As the AQI scale only goes to 500 values above this are simply considered extremely hazardous. Looking at the data seems to support this assertion.
The next decision was what to do with the missing data points. There were literally thousands of records with missing data. I attempted to fill these values with the mean of records, the previous value, and the next value. In the end I chose to simply remove records with missing values. This will be discussed in more detail in part 3, where the machine learning model will be introduction and explained.
All visualizations shown in this article and series were made using Matplotlib, a great plotting library for Python that is very flexible. My preferred environment is Jupyter Notebook, an amazing Python (among others) development environment that allows you to write code and documentation in one place.
After many hours of exploring and visualizing the data I was able to determine the following:
- While temperature is important, it isn’t as simple as cold = pollution
- The time of day has a great impact. When people are away from home pollution is the lowest. The day of the week has some small bearing as well.
- Winter months have the worst pollution levels, but some summer months have average levels in the unhealthy range.
- Wind speed has a negative correlation with pollution. The higher the wind speed, the lower the pollution level.
- Humidity level (our data has this measured as dew point) has some bearing on pollution.
These are only a few of the many visualizations and tables I created when going through the data. Many hours were spent pouring through the data to find the relationships between our key variables.
The next step is to run our cleaned data through the machine learning algorithms and check our results. Stay tuned for part 3 where I dive into the details of the model results.
City of Smoke
A great introduction to the problem of Ulaanbaatar’s air pollution is the City of Smoke Project by Peter Bittner. Check out the trailer below and then check out the project at CityofSmokeProject.com.
Have a question? Send me an email at email@example.com.