# Predicting PM2.5 Using Machine Learning — Part 3, The Model

Apr 25, 2018 · 13 min read

This is part 3 of a 4 part series on predicting air pollution (specifically PM2.5) in Ulaanbaatar, the capital city of Mongolia. This article will detail data cleansing, feature engineering, and finally the model itself using Python 3.6 in a Jupyter Notebook. This article will be more technical than the previous two. Part 1 introduced the problem and part 2 detailed the data. Here you can find Part 1, Part 2, and Part 4.

So far in this series we’ve looked at the problem, available data, and some analysis and visualizations of the dataset. Now we are going to use machine learning to answer the question: Can we predict PM2.5 air pollution?

Setting a evaluation metric for this is a good idea in order to be able to determine our success. My goal for this project is to be able to identify the PM2.5 concentration with a minimal error. The metric I chose is Root Mean Squared Error (RMSE). This is a measure of the average distance my predictions have from the ground truth. For a discussion of RMSE and using the Mean Absolute Error (another popular metric) see here.

Now that we have an evaluation metric we have something to measure our success (or failure) with. For the purposes of this project I am assigning a somewhat arbitrary goal of 30 micrograms per cubic meter or less for RMSE. A recent project to predict PM2.5 were able to reach a RMSE of 13.68 micrograms per cubic meter. This was done using deep learning, a combination of station data, and satellite imagery. However to my knowledge this will be the first model of its kind to use machine learning to predict pollution in Mongolia. As such we should keep our expectations reasonable and then iterate our model if needed.

You can find the Github repository that contains the raw dataset, cleaned dataset, and Jupyter Notebook here.

# Classification vs Regression

I chose to do a regression model over a classification model (although I actually tried both). This was an important decision as I intend to deploy this model to production and actually have it used by those living in Ulaanbaatar.

The reason for this decision came down to analytics. In a classification problem you are either predicting the correct class or not. I didn’t want a simple hit or miss metric, but instead a value I could plot. At the same time average people probably don’t care about the microgram per cubic meter concentration of PM2.5. Normal people just want to know if it’s good, bad, or really bad and the health implications of each.

These two demands are somewhat at odds, so I tried to find a compromise. Regression was chosen to give a value prediction, and from there the AQI category (from good to extremely hazardous) will be calculated and shown to users. In this way we will be able to balance getting the data we need to improve the model while simplifying the output for our end users.

To get started with our model we first need to load our data in the notebook into a dataframe and import the relevant libraries to get started.

`# Import relevant itemsimport pandas as pdimport numpy as npimport seaborn as snsimport matplotlib.pyplot as pltplt.style.use('ggplot')%matplotlib inlineimport datetime as dtfrom datetime import datetimeimport math# Import CSV file into a dataframedf = pd.read_csv('weather-and-aqi-v5.csv')`

This file, “weather-and-aqi-v5.csv” was created by merging our two datasets, detailed in part 2. They were merged in Excel Power Query by using a key field (full date: mm/dd/yyyy hh-mm). This could have also been done in Python using a merge or join.

Our dataframe has 19,615 rows and 42 features (columns). Our list of weather features is quite long:

• DIR — WIND DIRECTION IN COMPASS DEGREES, 990 = VARIABLE, REPORTED AS “***” WHEN AIR IS CALM (SPD WILL THEN BE 000)
• SPD & GUS = WIND SPEED & GUST IN MILES PER HOUR
• CLG = CLOUD CEILING — LOWEST OPAQUE LAYER
• SKC = SKY COVER
• L = LOW CLOUD TYPE, SEE BELOW
• M = MIDDLE CLOUD TYPE, SEE BELOW
• H = HIGH CLOUD TYPE, SEE BELOW
• VSB = VISIBILITY IN STATUTE MILES TO NEAREST TENTH
• MW MW1 MW2 MW3 = MANUALLY OBSERVED PRESENT WEATHER — LISTED BELOW IN PRESENT WEATHER TABLE
• AW AW1 AW2 AW3 = AUTO-OBSERVED PRESENT WEATHER — LISTED BELOW IN PRESENT WEATHER TABLE
• W = PAST WEATHER INDICATOR, SEE BELOW
• TEMP & DEWP = TEMPERATURE & DEW POINT IN FAHRENHEIT
• SLP = SEA LEVEL PRESSURE IN MILLIBARS TO NEAREST TENTH
• ALT = ALTIMETER SETTING IN INCHES TO NEAREST HUNDREDTH
• STP = STATION PRESSURE IN MILLIBARS TO NEAREST TENTH
• MAX = MAXIMUM TEMPERATURE IN FAHRENHEIT (TIME PERIOD VARIES)
• MIN = MINIMUM TEMPERATURE IN FAHRENHEIT (TIME PERIOD VARIES)
• PCP01 = 1-HOUR LIQUID PRECIP REPORT IN INCHES AND HUNDREDTHS
• PCP06 = 6-HOUR LIQUID PRECIP REPORT IN INCHES AND HUNDREDTHS
• PCP24 = 24-HOUR LIQUID PRECIP REPORT IN INCHES AND HUNDREDTHS
• PCPXX = LIQUID PRECIP REPORT IN INCHES AND HUNDREDTHS
• SD = SNOW DEPTH IN INCHES

In addition we have our AQI and Value field, several date fields (some are copied), location, duration, source, and unit. Now that we have our data we can move to the next step.

# EDA

After loading your data you generally will perform exploratory data analysis Exploratory data analysis was covered in part 2. However lets recap the key points.

• While temperature is important, it isn’t as simple as cold = pollution
• The time of day has a great impact. When people are away from home pollution is the lowest. The day of the week has some small bearing as well.
• Winter months have the worst pollution levels, but some summer months have average levels in the unhealthy range.
• Wind speed has a negative correlation with pollution. The higher the wind speed, the lower the pollution level.
• Humidity level (measured as dew point) has some bearing on pollution.

# Cleaning the Data

Often (if not most of the time) in real datasets there will be outliers, erroneous data, or even typos. As you saw in the time series chart in Part 2, there seem to be several of those issues in our data.

First let’s remove any unneeded features. As we will be predicting pollution we should look at what data is actually available as a forecast so we will be able to use this when making predictions. Several weather API’s exist that we can retrieve weather forecasts from for our model. Here is the consistently available information from weather API’s:

• Temperature
• Humidity
• Wind speed
• Wind direction

OpenAQ has an API where we can collect our ground truth PM2.5 concentrations. In addition we know month, hour, and day of the week can have predictive power (from our EDA in part 2), so we wan’t to keep our date and time features. With that let’s remove our unneeded features.

`#drop unneeded featuresdf = df.drop(['Year', 'Day', 'Date Key', 'Date Key.1', 'AQI', 'Source.Name', 'Site', 'Parameter', 'Unit', 'Duration', 'USAF', 'WBAN', 'GUS', 'CLG', 'SKC', 'L', 'M', 'H', 'VSB', 'MW', 'MW_1', 'MW_2', 'MW_3', 'AW', 'AW_4', 'AW_5', 'AW_6', 'W', 'SLP', 'ALT', 'STP', 'MAX', 'MIN', 'PCP01', 'PCP06', 'PCP24', 'PCPXX', 'SD'], axis=1)`

This leaves us with a much smaller set of features.

Our Date (LST) feature contains a lot of information, but it is currently in text format, and Python doesn’t understand it currently as a date. Given our date is in a standard format it is quite easy to convert the feature to a date time feature and rename it to remove punctuation.

`df['Date (LST)'] = pd.to_datetime(df['Date (LST)'])df = df.rename(columns={"Date (LST)": "Date"})`

In my first runs through the model my performance was not ideal. I theorized this is because our current data ignores all previous data. As you can imagine air pollution rises and falls gradually. However in our dataset this information does not exist. As a result I copied and shifted our features to create previous hour features for everything except time going back 5 hours. I found this was the optimal number as extra hours has no noticeable performance increase and fewer hours reduced performance.

The code to accomplish this is pretty verbose (this could be cleaned up in a custom function), so I included only the first hours code. You can see new features are created using the Pandas shift method. The same method is used to create previous hours 2–5 features.

`df['Value_1'] = df.Value.shift(periods=1)df['TEMP_1'] = df.TEMP.shift(periods=1)df['SPD_1'] = df.SPD.shift(periods=1)df['DEWP_1'] = df.DEWP.shift(periods=1)df['DIR_1'] = df.DIR.shift(periods=1)`

This is done before any other data cleaning activities so we have an uninterrupted dataset for the shift method. Below is the header of our dataframe with the shifted features added. Notice how the first several rows have ‘NaN’s (blanks) you can also see the time shifted features.

Next we can look at outliers. We can see several Values at or near the 1.0 mark. Considering the time of these points and how quickly they come and go these are clear outliers. To remove these outliers we can remove any values above 0.7. I found after the first run of the model that performance increase by a noticeable amount after these outliers are removed.

`df = df[df.Value <= .7]`

Next we will look at the beginning of the dataset, where you can see some odd behavior.

This is from when the station was first installed, and this behavior clearly looks irregular. Without knowing more it is safest to remove this from the dataset.

`df = df[df.Date > '2015-10-20 01']`

After these outliers have been removed we have our final dataset. The trend is quite clear. More analysis could be done to find further issues, however that was the low hanging fruit and sufficient for now.

Removing blank values (NaN’s in Pandas dataframes) is quite important as many machine learning algorithms can’t use these as inputs. Our dataframe now has a large number of rows with NaN’s.

`# Show rows where any cell has a NaNdf[df.isnull().any(axis=1)].shapeOutput: (11818, 33)`

This shows we have 11,818 rows that contain NaN values. The simplest and most straightforward option is to simply remove all rows with NaN values.

`# Shape of dataframedf.shapeOutput: (18583, 33)# Drop rows that contain NaNdf = df.dropna(axis=0)# Shape of dataframedf.shapeOutput: (6765, 33)`

You can see that our dataset is considerably smaller. I experimented with forward fill, back fill, mean, and other methods of filling these NaN’s. However none performed better than simply dropping the rows.

Now we have a clean dataset with no missing values or outliers.

# Feature Engineering

## Convert Value Field from mg to µg

Our Value field currently is in the format of mg/m³. We need our Values in µg/m³ as that is the global standard. This is a simple conversion.

`# 1 mg = 1,000 µgdf['Value'] = df.Value * 1000df['Value_1'] = df.Value_1 * 1000df['Value_2'] = df.Value_2 * 1000df['Value_3'] = df.Value_3 * 1000df['Value_4'] = df.Value_4 * 1000df['Value_5'] = df.Value_5 * 1000`

## Convert TEMP and DEWP from F to C

In addition, our temperature and dew point fields are in Fahrenheit, and we can convert them to Celsius with a simple conversion.

`# Formula to convert F to C is: [°C] = ([°F] - 32) × 5/9df['TEMP'] = (df.TEMP - 32) * 5.0/9.0df['TEMP_1'] = (df.TEMP_1 - 32) * 5.0/9.0df['TEMP_2'] = (df.TEMP_2 - 32) * 5.0/9.0df['TEMP_3'] = (df.TEMP_3 - 32) * 5.0/9.0df['TEMP_4'] = (df.TEMP_4 - 32) * 5.0/9.0df['TEMP_5'] = (df.TEMP_5 - 32) * 5.0/9.0# Formula to convert F to C is: [°C] = ([°F] - 32) × 5/9df['DEWP'] = (df.DEWP - 32) * 5.0/9.0df['DEWP_1'] = (df.DEWP_1 - 32) * 5.0/9.0df['DEWP_2'] = (df.DEWP_2 - 32) * 5.0/9.0df['DEWP_3'] = (df.DEWP_3 - 32) * 5.0/9.0df['DEWP_4'] = (df.DEWP_4 - 32) * 5.0/9.0df['DEWP_5'] = (df.DEWP_5 - 32) * 5.0/9.0`

## Convert SPD from Mph to Kph

Our speed values are in miles per hour, and we would like them in kilometers per hour.

`# 1 mph = 1.60934 kphdf['SPD'] = df.SPD * 1.60934df['SPD_1'] = df.SPD_1 * 1.60934df['SPD_2'] = df.SPD_2 * 1.60934df['SPD_3'] = df.SPD_3 * 1.60934df['SPD_4'] = df.SPD_4 * 1.60934df['SPD_5'] = df.SPD_5 * 1.60934`

## Convert DEWP to HUM

Our dataframe contains dew point, but most weather forecast API’s give humidity instead. These are very similar measurements. You can find more about them on Wikipedia — Dew Point. However, unlike speed and temperature, there is no official conversion I could find for these.

I did find a conversion method from researchers at the University of Miami that use a method called the August-Roche-Magnus approximation. Using this formula we are able to get a good approximation and convert our dew point to humidity percentage.

`df['HUM'] = 100*(np.exp((17.625 * df['DEWP'])/(243.04 + df['DEWP']))/np.exp((17.625 * df['TEMP'])/(243.04 + df['TEMP'])))df['HUM_1'] = 100*(np.exp((17.625 * df['DEWP_1'])/(243.04 + df['DEWP_1']))/np.exp((17.625 * df['TEMP_1'])/(243.04 + df['TEMP_1'])))df['HUM_2'] = 100*(np.exp((17.625 * df['DEWP_2'])/(243.04 + df['DEWP_2']))/np.exp((17.625 * df['TEMP_2'])/(243.04 + df['TEMP_2'])))df['HUM_3'] = 100*(np.exp((17.625 * df['DEWP_3'])/(243.04 + df['DEWP_3']))/np.exp((17.625 * df['TEMP_3'])/(243.04 + df['TEMP_3'])))df['HUM_4'] = 100*(np.exp((17.625 * df['DEWP_4'])/(243.04 + df['DEWP_4']))/np.exp((17.625 * df['TEMP_4'])/(243.04 + df['TEMP_4'])))df['HUM_5'] = 100*(np.exp((17.625 * df['DEWP_5'])/(243.04 + df['DEWP_5']))/np.exp((17.625 * df['TEMP_5'])/(243.04 + df['TEMP_5'])))`

## Create day of the week feature

We also found from our exploratory data analysis that certain days of the week and higher or lower average air pollution concentrations. We will use Pandas again to create this features.

`df['day_week'] = df['Date'].dt.weekday_name`

This gives us a text description of each day (Monday, Tuesday, Wednesday, etc). Our machine learnings models will need numerical features, so we will convert these to numbers. We will do this first by converting the feature to a categorical format, then using the cat codes method (again in Pandas) to convert the days of the week to numbers. I could also have mapped the days of the week using a specific dictionary.

`df['day_week_cat'] = df.day_week.astype("category").cat.codes`

That concludes our feature engineering. Now we can get into the really interesting thing, checking which model performs best and getting our results!

# Split into Training and Test Data

Cross validation is always desired when training machine learning models to be able to trust the generality of the model created. We will split our data into training and test data using Scikit learn’s built in tools. Also for scikit learn we need to separate our dataset into inputs and the feature being predicted (or X’s and y’s). We split the dataframe 70% for training and 30% for testing.

`from sklearn.model_selection import train_test_splity = df['Value']X = df.drop(['Value'], axis=1)X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3, random_state=1234)`

This yields 4,735 training rows and 2,030 rows for testing.

# Implement ML Algorithms

Our training set was run through seven algorithms in Scikit-learn to determine the best model. The model with the lowest RMSE is the one we will select. To see the full details of each model you can find the Jupyter notebook on Github here with all of the code. Here are the results of the seven algorithms tested.

XGBoost (a gradient descent algorithm) performed best. Our next best model is was a decision forest, which makes a series of decision trees and averages the output to make a prediction. We can plot our predictions against our actual measured numbers to see how well our model performs.

The XGBoost model is a tighter fit around our “perfect line”, which is the straight line on the plot. Points falling on this line would have a perfect prediction with no error.

Both models seem to have the majority of their predictions above 100 falling below the line. This mean our model is predicting lower levels than are actually measured. Below 100 the XGBoost model has a relatively balanced error (the neural network seems to have more points above the line). These are useful in determining how accurate our model will be when deployed to the real world.

For reference below are the PM2.5 values and their corresponding AQI values/categories. Using these categories we can posit that our XGBoost model will predict lower values (on average) than reality for higher AQI categories and higher values (on average) for lower AQI categories.

The AQI category ranges are much smaller for the Good, Moderate, and Unhealthy for Sensitive Groups categories. As such with a RMSE of 26.73, we could expect our model to be less accurate at lower AQI categories.

# Going Forward

This article described the creation of our model including problem description, cleansing our data, feature engineering, and algorithm results. When making a machine learning model it is always a bit unknown how it will perform in the real world. Overall our model showed acceptable results that were within our expectations of a RMSE below 30. Now we can move on to the deployment of the model.

Part 4 will detail the deployment of this model using Azure ML, our testing and monitoring architecture, and the future of the project.

Written by

## Mongolian Data Stories

#### Using data analysis, visualization, and machine learning to tell Mongolia’s story. Interested in writing for Mongolian Data Stories? Send an email to Robert Ritz (editor of MDS) at robertritz@outlook.com.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just \$5/month. Upgrade