Forecasting PM2.5 using AutoAI time series API with supporting features

Jun Wang
4 min readJul 29, 2022

--

AutoAI in IBM Cloud Pak for Data as a Service has recently introduced a new feature for time series data — supporting features, also known as exogenous features. Now you can use supporting features to provide context for the prediction. For example, if you are predicting sales revenue over time, you can include discounts and promotions as supporting features to make the predictions more accurate.

Let me demonstrate how you can use the IBM AutoAI Python API to easily include supporting features when you train a model using PM2.5 data to more accurately forecast air pollution levels for the next few days.

Setting up the environment

To work with AutoAI for time series you must have a Watson Machine Learning service instance (included with the free plan for Cloud Pak for Data as a Service). Watson Machine Learning provides the Python interface via ibm-watson-machine-learning package (available on pypi). You can easily install the package by running the following pip command:

pip install ibm-watson-machine-learning

Next, you must provide authentication information to initialize the Python client.

from ibm_watson_machine_learning import APIClient
client = APIClient(credentials)

The time series data set

The original hourly data set contains the PM2.5 data of the US Embassy in Beijing. Additionally, meteorological data from Beijing Capital International Airport is also included. The prepared data set is formed by aggregating hourly to daily data that tracks the pollution from January 1, 2010 till November 30, 2014. There are 8 columns: date, pollution, dew, temp, press, wnd_spd(wind speed), snow, and rain. The date column is used to sort the time series data, pollution is the prediction column, and the remaining columns are included as supporting features to support the forecast.

Here is a visualization of this data set prepared using the plotly package.

We can use the Python client to upload the data to Cloud Object Storage, to make it available for modeling with AutoAI.

AutoAI for time series

Using the Python API, we can easily define the AutoAI experiment for time series data. We will define the following parameters for our experiment’s optimizer:

· name - experiment name

· prediction_type — problem type – in this case, time series

· prediction_columns — names/indices of target columns

· timestamp_column_name — date&time column name/index

· feature_columns – names/indices of supporting feature columns

· forecast_window — future date/time range to be predicted

· holdout_size - number of holdout records

· lookback_window – past date/time range used for model training, -1 means auto-determined

· backtest_num – number of backtests

· supporting_features_at_forecasting – whether leveraging future values of supporting features. This means that if know future values, such as temperature features, we can include the future values as part of the deployment payload data

· pipeline_types – specify an individual or a group of pipelines by type

Now, call the fit() method to start the training job.

After training is completed, we can list all potential models trained for us by AutoAI.

We can retrieve each pipeline details by calling the get_pipeline_details() method.

Each pipeline details contains data for visualization. This is a simple comparison of observed vs. predicted values on a holdout data set and backtest data set.

Pipeline_4 is the best model returned by the AutoAI. So, we will save Pipeline 4 as a model for deployment and scoring.

Deployment and scoring

In this section we will deploy the best pipeline model as a web service, or online deployment. Then we will use the web service’s scoring endpoint to forecast predictions for the next 7 days.

After we create the deployment, we can submit new input data and ask for predictions using the score() method.

We can visualize the next 7 days forecasting values together with the observed vs. predicted values on holdout data for a better comparison.

Based on the prediction, we can expect the PM2.5 would be around 130 in the next week.

If we get some new observations, for example, for the next week, we can do forecasting for the week after the next week.

Go to IBM Cloud and check this new feature out.
You can also find sample AutoAI notebooks here.

Thanks to Julianne Forgo

--

--

Jun Wang

Architect, Data Scientist and Master Inventor @IBM Data&AI