GCP/Dataproc/BigQuery/Data Studio/Apache Spark/Amazon S3/Climatology/Climate Change
Explore & Visualize 200+ Years of Global Temperature
Visualize observable changes in global temperature using NOAA’s historical weather data, Apache Spark, BiqQuery, and Data Studio
--
We have all read about and experienced the effects of climate change every day around us. We have seen numbers like: The current global average temperature is 0.85ºC higher than it was in the late 19th century, and each of the past three decades has been warmer than any preceding decade since records began in 1850*.
I got curious about how climatologists determine these numbers. There is a whole lot of research going on in this area. I came across one important weather dataset from NOAA that is widely used in research.
In this blog, I will explain how I performed some simple analysis and visualized important patterns from NOAA’s historical weather data.
There are lots of insights that can be gained from this dataset. Keep reading to learn how you can do your own weather analyses using free cloud resources and open source technologies.
Prerequisite:
We will need AWS credentials to read from a public S3 bucket, as well as a GCP account to analyze the dataset on Google cloud.
- Create AWS IAM user credentials to read NOAA data in Amazon S3
Sign up with AWS free tier if you don’t have an AWS account, and then follow the instructions to Create IAM user
Grant your user AmazonS3ReadOnlyAccess
Create an access key for your user. (Note down your Access Key ID and Secret access key. We will provide these credentials to our spark job.)
2. Create a GCP account if you don’t have one
GCP Free Tier — Free Extended Trials and Always Free | Google Cloud
Once you signed up and logged in to the GCP console. Activate Cloud Shell by clicking on the icon highlighted in the below screenshot. Cloud Shell provides command-line access to a virtual machine instance, and this is where we are going to set up our use case. (Alternatively, you can also do this on your laptop terminal)
For the source code, here is the GitHub project.
Now that we have our AWS and GCP credentials, let's get started.
Set up your GCP Data Lake
To set up our GCP environment, we will need to perform the following:
- Create a project
- Enable billing(will still be in the free tier)
- Enable Dataproc and BigQuery Services
- Create a google cloud storage bucket
- Create a BigQuery Dataset
- Create a Dataproc cluster
To perform the above tasks:
Check out my project ghcn-daily-analysis on GitHub, and execute the data lake creation script, as shown below.
On your google cloud shell, run the below commands.
Here is how the output will look like
So far so good. Next, let's get the data into BigQuery.
Load NOAA Global Historical Climatology Network Daily(GHCN-D) from Amazon S3 to BigQuery
To copy NOAA GHCN-D data from public Amazon S3 to your Google Bigquery instance, follow the below two steps.
- Build and package the spark application, by running the below script
You will see an output like this:
Verify the generated application configuration file:
cat $HOME/application.conf
Check your AWS key and secret. The start year and end year are the range of data set that we will extract from AWS open data.
2. Submit Dataproc job to load data into BigQuery
Run the below script to submit the Dataproc job:
You can check the Dataproc console to see the job status. On the free tier infrastructure, this job will take a couple of hours to complete.
Now that we have NOAA Global Historical Climatology Network Daily in our BigQuery dataset. Let's go ahead and do some analysis.
BigQuery (GHCN-D) Dataset insights:
Go to your BigQuery console and review the newly created table GlobalHistoricalWeatherData.ghcn_daily
The details tab shows the size of the dataset: 125 GB, with 1.75 Billion rows.
Note that the table is partitioned by year, on the column partition_date. Which is set to the beginning of the year.
Also, to improve the query performance, I have clustered the table data on columns country, as this is the most commonly used filter. The cluster columns are used by Bigquery to colocate related data. The order of the columns is important, it determines the sort order of the data.
Let’s run some queries on our BigQuery dataset
The total number of observations :
The warmest day of 2020 so far in the US by state:
Hottest days for SanFransico station since 1980:
Next, let's do some visualization to better understand the trends in the weather pattern.
Visualize global historical weather data with Google Data Studio
Data Studio lets you easily connect to a variety of data sources and visualize your data through highly configurable charts and tables.
Use the Data Studio BigQuery connector to connect to your ghcn-d weather dataset.
Here is the visualization of the global weather station and the total number of observations.
Minimum and Maximum temperature for 2019 Toronto City
To plot the time-series chart of max and min temperature for Toronto city in 2019, set the Data range dimension to ghcn_d.date column and include metrics min_temp and max_temp.
Temperature anomaly at Hohenpeißenberg Meteorological Observatory
An anomaly is when the conditions depart from average conditions for a particular place at a given time of year. A positive anomaly means that the temperature was warmer than normal; a negative anomaly indicates that the temperature was cooler than normal.
Here we calculate the temperature anomaly at Hohenpeissenberg, Germany. It was the Augustinian monks who carried out the first meteorological observations on the site and these have been carried out continuously since 1781.
The measurement series ranks among the longest ever, and was never impacted by urban heat island effects.
Here is the query to calculate the annual mean temperature anomaly.
Visualizing rising temperatures at Hohenpeißenberg Meteorological Observatory
Let's plot the monthly average temperature from 1781 to 2000 and monthly average temperature from 2000 to 2020 at Hohenpeissenberg, Germany.
Here is the query to calculate the monthly averages. Create a view and import this view in Google Data Studio.
The gap between the bellow two lines, show the increase in temperatures over 200 years.
Recap
Here is what we did:
- Using Apache Spark, we connected to NOAA’s Global Historical Climatology Network — Daily dataset hosted on AWS S3, applied simple transformations, and enhanced the dataset with Geographical information. Finally, we loaded this dataset into BigQuery
- Gained insights on the historical weather data using BigQuery
- Visualized historical trends in the weather using Data Studio
The goal of this blog is to show how to use cloud and open-source tools to explore and gain insights from this large weather dataset.
As I said lots more explorations can be performed on this dataset. Create and share your own insights.