FindMyAQI: Geolocation-based Air Quality Index Prediction

13 min readDec 13, 2021

Kevin Hare, Mark Penrod, and Sivananda Rajananda

This article was produced as part of the final project for Harvard’s AC215 Fall 2021 course.

Air quality is an important determinant of quality of life. High-resolution estimates of air quality allow individuals to take protective measures and assist policymakers in providing local guidance in many situations, including seasonal environmental variations (e.g. pollen), industrial impacts (e.g. pollution), and poor air quality resulting from the increasing incidence of forest fires. However, these estimates are often inaccessible as air quality monitors can be expensive for personalized use and are not portable. Moreover, even if an individual were to carry a personalized monitor it is only capable of reporting the information for a given moment in time. By the time that the air quality index (AQI) is monitored, preventive measures may be more difficult to take.

Fortunately, the EPA monitors and records high-fidelity measures of the air quality across thousands of stations in the United States. This is a rich data set containing daily observations from over a thousand sites, across eight common pollutants and meteorological factors for nearly forty years. Additionally, those estimates are publicly available (albeit with a lag) and thus easily accessible.

From a technical perspective, deep learning, as well as transfer learning, have begun to be applied to time series. Because of the sequential nature of this data, recurrent neural networks have shown great promise [see e.g. Hewamalage et al. (2020)]¹ though novel attention-based methods, originally developed for natural language tasks, have been applied as well [Wu et al. (2020)]². Convolutional networks have also been extended to the time domain. Both one-dimensional and multi-dimensional convolutions may extract representations from patterns in time series, just as they succeeded in extracting image representations. Finally, probabilistic forecasting methods are core to the science of time series, as modelers value not only accurate estimates, but also providing a reasonable range of possible outcomes.

In this project, we unite these two principles by developing a scalable application — FindMyAQI — that leverages state-of-the art deep learning models to forecast air quality for addresses across the United States. Below we highlight our application development process and then walk through Data & Models, Frontend UI, and Deployment components of the project.

Application Development Process

A summary of the application development process, which was facilitated nearly entirely on the cloud, is shown below:

*Fig. 2: FindMyAQI Cloud-based development process*

Data

For this project we leveraged time series data from the US Environmental Protection Agency (EPA). The data is aggregated by day and contains various measures (e.g. concentration of particulates, meteorological information, toxics, and Air Quality Indices) and contains records from 1980 to present.

Collection sites distributed across the country gather the air particulate data, though the frequency of collection varies by particulate-type. More population dense regions host a higher concentration of the sites as shown in the map below. Note, the data point on the lower right of the map denotes sites in Puerto Rico and the U.S. Virgin Islands.

Fig. 2: Map of Air Quality Monitoring Locations in United States

Different sites collect different forms of pollutants and the collection of each pollutant varies in terms of consistency. In the chart below we plot the pollutant data from a single random site in the year 2020. There are a total of 7 pollutant measurements in the dataset (Carbon monoxide, Nitrogen dioxide (NO2), Ozone, PM2.5 — Local Conditions, Acceptable PM2.5 AQI & Speciation Mass, PM10 Total 0–10um STP, and Sulfur dioxide) and this particular site collected data for 6 pollutant measurements. We also observe that PM2.5 has fewer data points collected compared to the other pollutants (as noted by the straight lines that span across many days).

Fig. 3: Sample data for a given air quality monitoring station, highlighting variability in measurements.

The table below shows the AQI distributions for each of the pollutants. Most pollutants have concentrated peaks at low AQI values, PM10 being the exception showing an apparent uniform distribution of AQI values from Good (0–50) to Unhealthy (150–200).

Fig. 4: Histogram of AQI values by pollutant

For our modeling, we opted to predict Carbon monoxide, Sulphur dioxide, Ozone, and PM2.5 as these pollutants have more consistent and reliable data across collection sites.

Models

To develop FindMyAQI, we investigated a number of deep-learning based models for time series. Many traditional time series models are based on autoregressive forecasting and statistically modeling noise processes.³ Deep-learning is relatively new on the scene in this area, so we opted to explore a variety of model frameworks to this problem. To this end, we have constructed one baseline model, five distinct point-estimate forecasting models, and lastly a single probabilistic model. For each of these models, we evaluate performance based on the mean average percentage error (MAPE) over a holdout set.⁴

Because this is time series forecasting, we divided the data sequentially by time window in order to generate the necessary train/validation/test splits for training, hyperparameter tuning, and model evaluation. One subtlety about this data is that it consists of measurements from many different in-situ monitors. Thus, we divided the data for each monitor into training, validation, and test sets by time period and created datasets from those disaggregated components.

This approach mitigates the potential data leakage of random split or even a split stratified by collection site. Specifically, the risk of data leakage arises from the high correlation between geographically proximal collection sites. For example, there are more than one within the city of Boston alone, all of which are likely to report similar values. If data from one site appears in the train set and a neighboring site in the validation, both covering the same time period, then the tuning and evaluation process may be biased towards overfitting.

In addition to the considerations discussed above, we also set a list of criteria that a satisfactory model will meet. First, it should beat naïve, but realistic baselines from humans. Expecting the air quality tomorrow to be rather like today is a realistic prior for a user of our application and thus our models should beat that standard. Second, the model should reflect the time-dependence relationships within time series data. Finally, we hope to have a model that represents a probabilistic forecast, as we expect users will value the uncertainty of the predictions and range of possible outcomes.

Persistence Model

Our first attempt at a baseline model was a naïve persistence model. Intuitively, the persistence model takes today’s AQI for each pollutant (i.e. the final values of the training sequence) and forecasts that into the future. While this is of course naïve, any model which cannot beat such a simplistic heuristic will not be useful. To facilitate comparable model development and tracking in the Cloud, we implemented this model under the standard Keras framework that was subsequently necessary to develop the more complex models.

Single-Shot LSTM

Much research demonstrates the quality of LSTM models for prediction with univariate and multivariate time series data. Our review of the relevant literature demonstrated a preference for LSTMs over other RNN-based architectures such as vanilla RNNs or GRUs (for example, see Thai-Nghe and Thanh-Hai (2020)).⁵ One potential pitfall of the LSTM architecture is the high parameter count as compared to vanilla RNNs or GRUs. However, because of the large number of samples in our training dataset, we believe that we can sufficiently parameterize an LSTM. In our single-shot formulation, we pass the result of the LSTM through a dense layer with nodes equivalent to the number of series multiplied by the output length (in our case, this is 20). As such, we are directly fitting each output and it must be parameterized separately.

Temporal Convolutional Network

The temporal convolutional network is a time-series-focused architecture derived from CNNs (Lara-Benitez et al., 2021; Bai et al., 2018⁶). In particular, there are two advantages of TCNs relative to baseline CNNs. First, the convolutions are causal, thus reducing the information loss which occurs in the representation learning paradigm. Second, TCNs can be extended to multi-step output, whereas a traditional CNN would be restricted to a single forecast step. Thus, while CNNs may be promising for classification, our multi-step forecasting problem is amenable to a TCN.

Seq2Seq LSTM

One limitation of the single shot models for multi-step time series forecasting paradigm is that they effectively encode a representation which must correspond to each and every time step in the output. Much of the success of autoregressive models for multi-step forecasting comes from their ability to iteratively take outputs produced by the model and repurpose them as inputs. This is strikingly similar to the motivation behind Seq2Seq architectures, first developed by Sutskever et al. (2014).⁷ Thus, we seek to improve upon the single-shot models by implementing a similar architecture to our Single-Shot LSTM but introducing an encoder and decoder.⁸

DeepAR

Despite the potential success of the Seq2Seq model, estimating uncertainties of the outputs is still difficult. Because our task is regression rather than classification, there are fewer tools at our disposal compared to classification to report the confidence of our predictions beyond costly bootstrapping. Fortunately, DeepAR, a novel deep learning architecture for time series, incorporates the Seq2Seq LSTM described above along with probabilistic sampling during inference (Salinas et al. (2019)).⁹ In a traditional Seq2Seq model, the output of the encoder — both the hidden state and final prediction — are fed through the decoder step by step. There are two key changes for DeepAR:

Rather than predict a single output at each time step, DeepAR parameterizes the mean and standard deviation of a Gaussian distribution. As a result, the loss calculation transforms from comparing the true and predicted values to evaluating the negative log likelihood of the true value, assuming the true outputs follow the proposed distribution.¹⁰
By using ancestral sampling, DeepAR produces probabilistic estimates at each time step. In the decoder, where the input and hidden state have been passed, the outcome is a distribution. For each time step, we sample from the distribution and use that sample in a walk-forward manner as the input to the next step. When repeated this process produces a range of samples for each time step.

The advantage of this method is twofold. In a vanilla Seq2Seq, if the AQI prediction for t + 1 for a given pollutant is 50 and the true value is 30, this error will be propagated throughout the outputs. If on the other hand, the distribution is Gaussian, centered around 50 but with a standard deviation of 20, then we would expect to capture the true value quite often. By repeating the walk forward sampling, we achieve a set of realistic paths the data could follow

Fig 5: DeepAR process, from Salinas et al. (2019)

For hyperparameter tuning, we considered the size of the hidden dimension of the LSTM as well as the dropout rate for the dense layers within the LSTM cell.¹¹ Ultimately, we found the optimal combination of parameters to be an LSTM dimension of 128 and 30% dropout. However we note that the results were not particularly sensitive to the hyperparameters.

Results

Below we present the results of our initial model. For comparison between the probabilistic and point-estimate models, we rely on the median of the samples for the probabilistic model. As can be seen below, DeepAR significantly outperforms the point-estimate models, validating the approach of using a probabilistic architecture. One potential weakness to this method, however, is that the median may exhibit some shrinkage relative to the single-shot predictions. As the development process for our application is dynamic, we hope to evaluate other probabilistic frameworks in a similar fashion to yield similar comparisons.

Frontend & UI

FindMyAQI’s user interface follows a simple HTML/Javascript framework and leverages the D3 package for the time series visualizations. When users come to the site they are invited to enter their desired address, which they may select from a set of suggestions served by the GoogleMaps API or simply enter. In the backend, Google Maps identifies the coordinates of the provided address and passes them to our API which in turn finds the nearest collection site and initiates model prediction. The results are then served in two forms. First, the plot on the left shows the AQI trend for the next 5 days as well as a shaded uncertainty region for each pollutant. On the right, the user can find the precise median AQI point estimates for each day and each pollutant.

Deployment

Deployment of the app is carried out via a number of components as outlined below:

1. Google Cloud Storage

Google Cloud Storage stores the trained model, training data, and scalers used to scale the data for training. This allows us to seamlessly pull the models and data from Google Cloud Storage to the Virtual Machines on Google Cloud Platform that will run the web app.

2. API Service

The API Service is a Docker container that handles the backend operations of the app. On startup, a script is run to create a tracker service which downloads the models, scaler, and data. This tracker service also periodically checks whether there are new models in the GCP bucket and then downloads the best model if there are any changes. This ensures that any improvements to the model will be picked up in real-time and the best model is served as soon as possible.

The API Service also receives prediction requests from the frontend container via a POST method (the request includes the coordinates of the location to predict). It then pulls the relevant data from the Google Cloud Storage bucket based on the closest data collection site, scales the data, runs it through the best model to get the prediction, and finally returns the predictions to the frontend container.

3. Frontend

The frontend runs in a separate docker container. When deployed using Kubernetes, the frontend is routed via NGINX such that it can be accessed via a URL. The Google Maps API has been integrated into the frontend app such that the user is able to key in an address to search for a location. The coordinates of the location is returned by the Google Maps API and is then sent to the model via a POST request to the service API explained above. The AQI predictions are returned to the frontend-simple app and D3 is used to create the animated chart to display the predictions of the 4 pollutants in the next 5 days. A table is also created to show the specific values for the predictions.

4. Ansible

Ansible is used to deploy the web app on Google Cloud Platform from a local machine. After images of the containers are built and pushed to Google Cloud Registry, Ansible is then used to spin up the VMs on GCP. This includes setting up http rules and opening ports. The Dev Server is also provisioned on GCP by installing Docker on the VMs and ensuring that the environment is able to deploy the containers. Once the environment is set up, Ansible deploys the containers on GCP and serves the web app. Finally, the app is deployed to a Kubernetes cluster wherein the containers are orchestrated for optimal efficiency and scalability.

5. Kubernetes

Kubernetes deployment follows directly from the previously described Ansible deployment, only with a different target host. Specifically, the two together spin up a cluster of VMs, each containing a collection of pods, that are orchestrated by Kubernetes to manage load balancing and scheduling among other tasks. This framework allows for greater scalability and ensures greater robustness to system failures. Both the frontend and API containers for FindMyAQI are hosted within a single cluster with the services distributed among individual pods.

Conclusions

As a part of this project, we have accomplished three main goals. First, we successfully implemented a distributed data pipeline using Dask to process large volumes of air quality data. Second, we experimented with state-of-the-art deep learning models for time series, namely DeepAR. This required extensive research and custom implementations, engaging with the research frontier in this area. Finally, we were able to take the first two components and successfully package them into a working application, deploying it in a scalable manner. While of course we do not expect enormous volumes of users from the start, our Kubernetes deployment ensures that many instances of the API service and frontend can be spun up to serve an increasing number of users.

From a process perspective, our development lifecycle has facilitated experimentation with models and data sources. By working with automated, containerized deployments, we are able to develop modeling solutions locally while emulating the cloud-based environment. Thus, as future areas for development, we can continue to refine our models as well as deploy additional frontend components to enhance the user experience.

Footnotes & References

Hewamalage, Hansika, Christop Bergmeir, and Kasun Bandara, “Recurrent Neural Networks for Time Series Forecasting: Current Status and Future Directions,” 2020, arXiv preprint, available at https://arxiv.org/pdf/1909.00590.pdf.
Wu, Neo, Bradley Green, Xue Ben, and Shawn O’Banion. “Deep Transformer Models for Time Series Forecasting: The Influenza Prevalence Case, ” 2020, arXiv preprint, available at https://arxiv.org/pdf/2001.08317.pdf.
Lim, Bryan and Stefan Zohren, “Time-series forecasting with deep learning: a survey” Philosophical Transactions of the Royal Society: Mathematical, Physical, and Engineering Sciences, Feb. 12, 2021, available at https://royalsocietypublishing.org/doi/10.1098/rsta.2020.0209.
Lara-Benitez, Pedro, Manuel Carranza-Garcia, and Jose C. Riquelme, “An Experimental Review on Deep Learning Architectures for Time Series Forecasting,” International Journal of Neural Systems, Vol. 31, №3 (2021).
Thai-Nghe, Nguyen and Nguyen Thanh-Hai, “Forecasting Sensor Data Using Multivariate Time Series Deep Learning,” International Conference on Future Data and Security Engineering, November 2020.
Bai, Shaojie, J. Zico Kolter, and Vladlen Koltun, “An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling,” arXiv pre-print, 2018. Available at https://arxiv.org/pdf/1803.01271.pdf.
Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. “Sequence to sequence learning with neural networks.” In Advances in neural information processing systems, pp. 3104–3112. 2014.
Conceptually, the Single-Shot LSTM can be viewed as an encoder and then a dense layer on top of the encoder.
Salinas, David, Valentin Flunkert, Jan Gasthaus, “DeepAR: Probabilistic Forecasting with Autoregressive Recurrent Networks,” arXiv preprint, February 2019. Available at: https://arxiv.org/abs/1704.04110.
DeepAR is highly flexible and not limited to only Gaussian distributions. In fact, in the original paper, Salinas et al. (2019) demonstrate the negative binomial case for count distribution. Because each parameter is simply a dense neural network layer, this can be adapted to fit an arbitrary number of distributions as long as the log likelihood is able to be computed through the deep-learning framework of your choice.
We elected to treat the three-level stacked LSTM architecture intact from Salinas et al. (2019) as all parameterizations and time series produced similar results along this hyperparameter.