Analyze Global Air Pollution Using Apache Spark & BigQuery

How to analyze global air pollution data on the cloud

Kapil Sreedharan
Analytics Vidhya
Published in
5 min readFeb 7, 2021


According to WHO, 7 million people die every year from exposure to fine particles in polluted air that lead to diseases such as stroke, heart disease, lung cancer, chronic obstructive pulmonary diseases, and respiratory infections, including pneumonia.

91% of the world’s population live in places where air quality exceeds WHO guideline limits.

Treemap of the worlds most polluted countries (pm2.5) 2020

Here we analyze global air quality data from openaq:

  • Extract and aggregate global historical air pollution data from openaq s3 bucket using Apache Spark
  • Find the worlds most polluted cities and countries (measured by PM2.5 levels)
  • Calculate monthly and yearly averages for air quality indicator
  • Visualize using Google Data Studio, create Treemaps and Geo charts
  • Load all of openaq data into Bigquery and find insights


Here we use the dataset from OpenAQ, publicly available on AWS s3.

OpenAQ is a non-profit organization empowering communities around the globe to clean their air by harmonizing, sharing, and using OpenAQ data.

OpenAQ platform measures the following pollutants:


In this project, we will analyze PM2. 5 (particles less than 2.5 micrometers in diameter) pollutant since it is the most hazardous*.

Life Is Too Short to RTFM: Source code

Infrastructure Setup

We will need AWS credentials to read from a public S3 bucket, as well as a GCP account to analyze the dataset on Google cloud.

  1. Create AWS IAM user credentials to read NOAA data in Amazon S3

Sign up with AWS free tier if you don’t have an AWS account, and then follow the instructions to Create IAM user

Grant your user AmazonS3ReadOnlyAccess

Create an access key for your user. (Note down your Access Key ID and Secret access key. We will provide these credentials to our spark job.)

2. Create a GCP account if you don’t have one

GCP Free Tier — Free Extended Trials and Always Free | Google Cloud

Once you signed up and logged in to the GCP console. Activate Cloud Shell by clicking on the icon highlighted in the below screenshot. Cloud Shell provides command-line access to a virtual machine instance, and this is where we are going to set up our use case. (Alternatively, you can also do this on your laptop terminal)

GCP Console

Google Cloud Shell

Now that we have our AWS and GCP credentials, let’s get started.

3. Set up your GCP Data Lake

To set up our GCP environment, we need to:

  • Create a project
  • Enable billing(will still be in the free tier)
  • Enable Dataproc and BigQuery Services
  • Create a google cloud storage bucket
  • Create a BigQuery Dataset
  • Create a Dataproc cluster

To set up your GCP data lake, execute the below commands on your google cloud shell

GCP service setup commands

Read Aggregate and Load OpenAQ data into BigQuery

  1. Build and package the spark application

Execute the below script in your cloud shell

Verify the generated application configuration file, this should contain your AWS key and secret:

cat $HOME/application.conf

Check your AWS key and secret. The start date and end date are the range of data set that we will extract from openAQ.

2. Submit Dataproc job to aggregate and load data into BigQuery

To submit spark job on Dataproc, execute the below command

It took 7 mins to complete the load and aggregation of 4 years worth of historical openaq air quality data.

Now that we have the data loaded into BigQuery, let's take a look.

OpenAQ Dataset insights

Go to your BigQuery console and review the newly created table

A preview of the dataset should look like this:

OpenAQ pm2.5 dataset

Most polluting cities in the world 2020(pm2.5)

SELECT city, country, yearly_avg, Jan, Feb, March, April, May, June, July, Aug, Sept, Oct, Nov, Dec
FROM `world-air-quality-openaq.openaq.pm25_global`
WHERE year = 2020
ORDER BY yearly_avg DESC
Top 100 polluted cities 2020(PM2.5)

Top polluting countries

Yearly averages for 2020 and 2019

TRUNC(AVG(IF(year = 2019, yearly_avg, NULL)), 2) AS `AVG_2019`,
TRUNC(AVG(IF(year = 2020, yearly_avg, NULL)), 2) AS `AVG_2020`
FROM `kapilsreed12-1dataflow.OpenAirAQ.pm25_global`
GROUP BY country

DataStudio Visualization:

Geo Chart PM2.5 yearly average 2020
Google Maps view of pm2.5 yearly average 2020
Treemap of the worlds most polluted countries 2020


Here is what we did:

  1. Using Apache Spark, we connected to the OpenAQ dataset hosted on AWS S3, applied aggregations, and enhanced the dataset with Geographical information. Finally, we loaded this dataset into BigQuery
  2. Gained insights on PM2.5 pollutant globally using BigQuery
  3. Visualized a few data points using Data Studio

The goal of this blog is to show how to use cloud and open-source tools to explore and gain insights from this large dataset.

Create and share your own insights.