Nowcasting Air Quality Using Social Media

Smoke billowing from a fire in an open field in Riau

How authorities prepare and respond to forest and peatland fires is crucial to prevent adverse effects on the local population. With satellite data and ground sensors to monitor these fires, the Indonesian National Board of Disaster Management has been improving its response and coordination efforts.

These efforts are constrained by the speed at which air quality information is disseminated; in some cases the local authorities only receive this information a few times a day. This gap inspired our data science team’s attempt to predict air quality information in near real-time using deep learning.

Our team is familiar with the subject of haze, having gained considerable amount of knowledge developing Haze Gazer. With the idea of developing a model that leverages real-time sensing to nowcast air quality levels, we identified four types of data to use:

  1. Air Quality Index — from the Indonesian authorities’ ground sensors,
  2. Fire Hotspot Data — from NASA’s satellites,
  3. Air Temperature — from US National Oceanic and Atmospheric Administration, and
  4. Social Media Photos — from Twitter and Instagram users.

These data sets cover eight months in 2014, and are unique to Pekanbaru city in Riau Province (a region that has experienced frequent episodes of forest and peatland fires over the years).

We generated 21 features based on these data sets, including measures of daily air temperature and the number of hotspots detected by sensors on a daily basis (the list of features is below, we’ll get to explaining some of the terms later). To decide which features to use in developing the model, our team relied on the Pearson correlation coefficient formula, limiting the features to ones that significantly correlated with the air quality index.

Nowcasting Air Quality in Three Steps:

1. Classifying Outdoor Photos Shared on Social Media

This is a pre-processing step to classify the photos shared on social media. This was done using a deep learning image classifier known as VGG-16 that was trained with a set of images from ImageNet and Places 365. The model was trained to categorise photos shared by social media users in Pekanbaru based on the binary labels of either indoor or outdoor photograph.

2. Inferring Visibility Levels from Outdoor Photos Shared on Social Media

The visibility level is defined as the difference between original and dehazed images. We trained DeHazeNet (an end-to-end system for single image haze removal) and AOD-Net (an image dehazing method) models with original images, as well as computer-generated haze images using an atmospheric scattering model on NYU-Depth V2 datasets which could then infer visibility levels from social media images shared.

Boxplot of haze inference results by AQI level

3. Producing A Near Real-time Air Quality Index Level

We came up with two models in order to compare whether having the visibility information from social media images made a difference. Both models were trained with historical air quality related data over a few days: the first model (a baseline for air quality) included meteorological, satellite and air quality index data; and the second model included meteorological, satellite, air quality index data, with the addition of social media images.

Comparing the performance of both models

Visualised in the chart above, the model which integrates social media images consistently outperforms the baseline model. The model with visibility information produces at best 87.24% forecast accuracy (using data aggregated from the previous 7 days) an improvement of 18.11% compared to the baseline model.

The team also developed a severity map by extracting spatio-temporal predictions from the model using social media images at the district level. The red and black colours represent heavily and severely polluted regions, respectively.

Map showing haze severity in near real-time between March 8th and March 9th in 2014 at the district level

Our data science team’s efforts at combining these conventional data sets with photos shared on social media demonstrates how social media images with the smarts of deep learning can be used to nowcast AQI level with reasonable accuracy at the city level. This adds an important real-time resource that can potentially improve disaster management and mitigation efforts during haze crises.

Up next, we will explore the inclusion of other data sources, which can improve the model’s accuracy, and in the coming months we hope to test the model with haze cases in other cities — and see how it can ultimately be integrated in our Haze Gazer platform. Get it in touch, if you’re interested in nowcasting air quality in other cities or countries.


Pulse Lab Jakarta is grateful for the generous support from the Government of Australia.