Analyze Global Air Pollution Using Apache Spark & BigQuery
How to analyze global air pollution data on the cloud
--
According to WHO, 7 million people die every year from exposure to fine particles in polluted air that lead to diseases such as stroke, heart disease, lung cancer, chronic obstructive pulmonary diseases, and respiratory infections, including pneumonia.
91% of the world’s population live in places where air quality exceeds WHO guideline limits.
Here we analyze global air quality data from openaq:
- Extract and aggregate global historical air pollution data from openaq s3 bucket using Apache Spark
- Find the worlds most polluted cities and countries (measured by PM2.5 levels)
- Calculate monthly and yearly averages for air quality indicator
- Visualize using Google Data Studio, create Treemaps and Geo charts
- Load all of openaq data into Bigquery and find insights
Dataset
Here we use the dataset from OpenAQ, publicly available on AWS s3.
OpenAQ is a non-profit organization empowering communities around the globe to clean their air by harmonizing, sharing, and using OpenAQ data.
OpenAQ platform measures the following pollutants:
In this project, we will analyze PM2. 5 (particles less than 2.5 micrometers in diameter) pollutant since it is the most hazardous*.
Life Is Too Short to RTFM: Source code
Infrastructure Setup
We will need AWS credentials to read from a public S3 bucket, as well as a GCP account to analyze the dataset on Google cloud.
- Create AWS IAM user credentials to read NOAA data in Amazon S3
Sign up with AWS free tier if you don’t have an AWS account, and then follow the instructions to Create IAM user
Grant your user AmazonS3ReadOnlyAccess
Create an access key for your user. (Note down your Access Key ID and Secret access key. We will provide these credentials to our spark job.)
2. Create a GCP account if you don’t have one
GCP Free Tier — Free Extended Trials and Always Free | Google Cloud
Once you signed up and logged in to the GCP console. Activate Cloud Shell by clicking on the icon highlighted in the below screenshot. Cloud Shell provides command-line access to a virtual machine instance, and this is where we are going to set up our use case. (Alternatively, you can also do this on your laptop terminal)
GCP Console
Google Cloud Shell
Now that we have our AWS and GCP credentials, let’s get started.
3. Set up your GCP Data Lake
To set up our GCP environment, we need to:
- Create a project
- Enable billing(will still be in the free tier)
- Enable Dataproc and BigQuery Services
- Create a google cloud storage bucket
- Create a BigQuery Dataset
- Create a Dataproc cluster
To set up your GCP data lake, execute the below commands on your google cloud shell
Read Aggregate and Load OpenAQ data into BigQuery
- Build and package the spark application
Execute the below script in your cloud shell
Verify the generated application configuration file, this should contain your AWS key and secret:
cat $HOME/application.conf
Check your AWS key and secret. The start date and end date are the range of data set that we will extract from openAQ.
2. Submit Dataproc job to aggregate and load data into BigQuery
To submit spark job on Dataproc, execute the below command
It took 7 mins to complete the load and aggregation of 4 years worth of historical openaq air quality data.
Now that we have the data loaded into BigQuery, let's take a look.
OpenAQ Dataset insights
Go to your BigQuery console and review the newly created table
A preview of the dataset should look like this:
Most polluting cities in the world 2020(pm2.5)
SELECT city, country, yearly_avg, Jan, Feb, March, April, May, June, July, Aug, Sept, Oct, Nov, Dec
FROM `world-air-quality-openaq.openaq.pm25_global`
WHERE year = 2020
ORDER BY yearly_avg DESC
Top polluting countries
Yearly averages for 2020 and 2019
SELECT
country,
TRUNC(AVG(IF(year = 2019, yearly_avg, NULL)), 2) AS `AVG_2019`,
TRUNC(AVG(IF(year = 2020, yearly_avg, NULL)), 2) AS `AVG_2020`
FROM `kapilsreed12-1dataflow.OpenAirAQ.pm25_global`
GROUP BY country
HAVING AVG_2020 IS NOT NULL AND AVG_2019 IS NOT NULL
ORDER BY AVG_2020 DESC
DataStudio Visualization:
Recap
Here is what we did:
- Using Apache Spark, we connected to the OpenAQ dataset hosted on AWS S3, applied aggregations, and enhanced the dataset with Geographical information. Finally, we loaded this dataset into BigQuery
- Gained insights on PM2.5 pollutant globally using BigQuery
- Visualized a few data points using Data Studio
The goal of this blog is to show how to use cloud and open-source tools to explore and gain insights from this large dataset.
Create and share your own insights.