GCP/Dataproc/BigQuery/Data Studio/Apache Spark/Amazon S3/Climatology/Climate Change

Explore & Visualize 200+ Years of Global Temperature

Visualize observable changes in global temperature using NOAA’s historical weather data, Apache Spark, BiqQuery, and Data Studio

Kapil Sreedharan

Published in

Google Cloud - Community

6 min readOct 12, 2020

We have all read about and experienced the effects of climate change every day around us. We have seen numbers like: The current global average temperature is 0.85ºC higher than it was in the late 19th century, and each of the past three decades has been warmer than any preceding decade since records began in 1850*.

I got curious about how climatologists determine these numbers. There is a whole lot of research going on in this area. I came across one important weather dataset from NOAA that is widely used in research.

In this blog, I will explain how I performed some simple analysis and visualized important patterns from NOAA’s historical weather data.

There are lots of insights that can be gained from this dataset. Keep reading to learn how you can do your own weather analyses using free cloud resources and open source technologies.

Prerequisite:

We will need AWS credentials to read from a public S3 bucket, as well as a GCP account to analyze the dataset on Google cloud.

Create AWS IAM user credentials to read NOAA data in Amazon S3

Grant your user AmazonS3ReadOnlyAccess

Create an access key for your user. (Note down your Access Key ID and Secret access key. We will provide these credentials to our spark job.)

2. Create a GCP account if you don’t have one

GCP Free Tier — Free Extended Trials and Always Free | Google Cloud

Once you signed up and logged in to the GCP console. Activate Cloud Shell by clicking on the icon highlighted in the below screenshot. Cloud Shell provides command-line access to a virtual machine instance, and this is where we are going to set up our use case. (Alternatively, you can also do this on your laptop terminal)

For the source code, here is the GitHub project.

ksree/ghcn-daily-analysis

Analysis of Global Historical Climatology Network(GHCN)-Daily - ksree/ghcn-daily-analysis

github.com

Now that we have our AWS and GCP credentials, let's get started.

Set up your GCP Data Lake

To set up our GCP environment, we will need to perform the following:

Create a project
Enable billing(will still be in the free tier)
Enable Dataproc and BigQuery Services
Create a google cloud storage bucket
Create a BigQuery Dataset
Create a Dataproc cluster

To perform the above tasks:

Check out my project ghcn-daily-analysis on GitHub, and execute the data lake creation script, as shown below.

On your google cloud shell, run the below commands.

Execute these commands on your google cloud shell.

Here is how the output will look like

So far so good. Next, let's get the data into BigQuery.

Load NOAA Global Historical Climatology Network Daily(GHCN-D) from Amazon S3 to BigQuery

To copy NOAA GHCN-D data from public Amazon S3 to your Google Bigquery instance, follow the below two steps.

Build and package the spark application, by running the below script

Build and package the spark app by running this script on your cloud shell.

You will see an output like this:

Verify the generated application configuration file:

cat $HOME/application.conf

Check your AWS key and secret. The start year and end year are the range of data set that we will extract from AWS open data.

2. Submit Dataproc job to load data into BigQuery

Run the below script to submit the Dataproc job:

You can check the Dataproc console to see the job status. On the free tier infrastructure, this job will take a couple of hours to complete.

Now that we have NOAA Global Historical Climatology Network Daily in our BigQuery dataset. Let's go ahead and do some analysis.

BigQuery (GHCN-D) Dataset insights:

Go to your BigQuery console and review the newly created table GlobalHistoricalWeatherData.ghcn_daily

The details tab shows the size of the dataset: 125 GB, with 1.75 Billion rows.

Note that the table is partitioned by year, on the column partition_date. Which is set to the beginning of the year.

Also, to improve the query performance, I have clustered the table data on columns country, as this is the most commonly used filter. The cluster columns are used by Bigquery to colocate related data. The order of the columns is important, it determines the sort order of the data.

Let’s run some queries on our BigQuery dataset

The total number of observations :

The warmest day of 2020 so far in the US by state:

Query result for the warmest day of 2020 so far in the US by state

Hottest days for SanFransico station since 1980:

SQL Hottest day San Fran since 1980

Query output, hottest day San Fran. Top 3 active stations.

Next, let's do some visualization to better understand the trends in the weather pattern.

Visualize global historical weather data with Google Data Studio

Data Studio lets you easily connect to a variety of data sources and visualize your data through highly configurable charts and tables.

Use the Data Studio BigQuery connector to connect to your ghcn-d weather dataset.

Here is the visualization of the global weather station and the total number of observations.

Google Data Studio Dashboard for global weather stations and observations

Minimum and Maximum temperature for 2019 Toronto City

To plot the time-series chart of max and min temperature for Toronto city in 2019, set the Data range dimension to ghcn_d.date column and include metrics min_temp and max_temp.

Min and Max temperature deg C, Toronto 2019

Temperature anomaly at Hohenpeißenberg Meteorological Observatory

An anomaly is when the conditions depart from average conditions for a particular place at a given time of year. A positive anomaly means that the temperature was warmer than normal; a negative anomaly indicates that the temperature was cooler than normal.

Here we calculate the temperature anomaly at Hohenpeissenberg, Germany. It was the Augustinian monks who carried out the first meteorological observations on the site and these have been carried out continuously since 1781.
The measurement series ranks among the longest ever, and was never impacted by urban heat island effects.

Here is the query to calculate the annual mean temperature anomaly.

Visualizing rising temperatures at Hohenpeißenberg Meteorological Observatory

Let's plot the monthly average temperature from 1781 to 2000 and monthly average temperature from 2000 to 2020 at Hohenpeissenberg, Germany.

Here is the query to calculate the monthly averages. Create a view and import this view in Google Data Studio.

The gap between the bellow two lines, show the increase in temperatures over 200 years.

Recap

Here is what we did:

Using Apache Spark, we connected to NOAA’s Global Historical Climatology Network — Daily dataset hosted on AWS S3, applied simple transformations, and enhanced the dataset with Geographical information. Finally, we loaded this dataset into BigQuery
Gained insights on the historical weather data using BigQuery
Visualized historical trends in the weather using Data Studio

The goal of this blog is to show how to use cloud and open-source tools to explore and gain insights from this large weather dataset.

As I said lots more explorations can be performed on this dataset. Create and share your own insights.