Predicting Indoor Air Quality

Published in

Lion IQ

4 min readJun 30, 2017

Air quality monitoring, like weather forecasting, is incredibly complex. This blogpost explores a deep learning approach to forecast indoor PM2.5 air quality with a type of recurrent neural networks called Long Short-Term Memory (LSTM).

GAMS provides indoor air quality monitoring for commercial venues such as schools and offices. Their monitors record indoor air metrics such as PM2.5, CO2, etc and sends them to their backend every minute or so. IoT data is interesting because they are naturally data-rich, regular (numeric floats) and automatically framed in a time series.

Our project with GAMS was to build a prototype to predict PM2.5 in the near future, e.g. look at last N hours and predict next hour measurement. Typically, a data scientist may approach the problem by exploring various correlations between the columns. Our approach will instead model air quality measurements as time series, using recent history to predict near-future measurements, i.e. predict Xi given [X(i-1), X(i-2), X(i-3)…, X(i-k)].

Long Short-Term Memory Networks

A glaring limitation of Vanilla Neural Networks (and also Convolutional Networks) is that their API is too constrained: they accept a fixed-sized vector as input (e.g. an image) and produce a fixed-sized vector as output (e.g. probabilities of different classes) [1]. Recurrent neural networks allow us to operate over a sequence of inputs.

Long Short Term Memory networks — usually just called “LSTMs” — are a special kind of RNN, capable of learning long-term dependencies. They were introduced by Hochreiter & Schmidhuber (1997), and were refined and popularized by many people in following work. They work tremendously well on a large variety of problems, and are now widely used. [2]

Dataset

The dataset [3] consist of several air quality measurements recorded from our IoT device in an undisclosed office space in Shanghai, as well as public outdoor air quality recordings from nearby US Consulate station.

The dataset is described in detail in this blog post.

Prototyping

Before we feed our data into our LSTM, we need to “roll up” our data. Suppose at time step i, we will use historical data from i to i-k to predict measurement at time i+n. Therefore, our input data needs to have for each row i, an array of tensors [X(i), X(i-1), … X(i-k)]. Our labels would have for each row i our target value X(i+n).

Let’s use a simple example of looking back past 3 hours of data, to predict PM2.5 1 hour in the future. (Note: we’ll want to do some kind of back and forward filling to avoid NaNs)

def preprocess(features, labels, timesteps=3, ahead=1): """ Create dataset with timesteps, in sync with labels """ dataX, dataY = [], [] for i in range(len(features) - timesteps - 1): x = features[i:(i + timesteps), :] dataX.append(x) dataY.append(labels[i + timesteps + ahead -1, ]) return np.array(dataX), np.array(dataY) # generate data for input to LSTM labels = dataset[3] # suppose pm25 is in column=3 X_train, Y_train = preprocess(dataset, labels, timesteps=3, ahead=1)

We can then feed the preprocessed inputs into an LSTM. With Keras, the model is something like this:

from keras.models import Sequential, Model from keras.layers import Input, LSTM model = Sequential() model.add(LSTM(8, input_shape=X_train.shape)) model.add(Dense(1)) model.compile(loss='mean_squared_error', optimizer='adam')

There are some hyper parameters to play with, and the network can be more sophisticated by “stacking” LSTM layers on top of each other. For our purpose of predicting X_pm25(i+1) given X(i), X(i-1), X(i-2), a simple network can capture some of the sequential relevance of the data (and more importantly, runs on CPU just fine).

Whats next

The above experiment outlines our prototype for predictive analytics for IoT data powered by deep learning and LSTMs. There’s still alot of work left to go from a Jupyter notebook plot, to an actual product!

Data loaders to read from, and also write predictions back to production data stores, e.g. Influxdb, or Spark.
Model loaders and continuous update
Configurable models, to allow us to run multiple models with different parameters
Monitoring to evaluate models — TensorBoard is a good candidate.
Asynchronous pipelines, so we can run multiple models in production
Deployment and containers — this is underrated and seriously not trivial for deep learning platforms.

References

[1] http://karpathy.github.io/2015/05/21/rnn-effectiveness/
[2] http://colah.github.io/posts/2015-08-Understanding-LSTMs/
[3] http://www.measureofquality.com/
[4] Long Short Term Memory, http://dl.acm.org/citation.cfm?id=1246450

Originally published at lioniq.wordpress.com on June 30, 2017.

Predicting Indoor Air Quality

Written by Jerry Liu