Investigating Global Temperature Trends with BigQuery and Tableau

Using the NOAA GHCN and GSOD datasets

Reto Meier
Google Cloud - Community

--

The analysis in today’s Today I Learned with BigQuery was performed by @savio_lawrence using BigQuery, Tableau, and the NOAA GHCN and GSOD datasets to see what observations we can make about changes in the average temperatures recorded at weather observation stations around the world.

Before we dive-in, note that while interesting, this analysis by itself is insufficient to draw conclusions for something as complex as climate change or global warming.

Scientific organizations and researchers work hard to account for the challenges we’ll highlight throughout the analysis, as well as combining datasets like these with remote sensing observations to obtain global and consistent coverage of the Earth.

One of the beauties of having access to such vast datasets is opportunity to explore some of the data that underpins scientific research to find areas for potential future exploration and analysis.

Using BigQuery and Tableau to Analyze Climate Data

Using BigQuery we processed and summarized 196 tables (74GB+ with over 2.5 billion records) from the GHCN and GSOD public datasets.

NOAA’s Global Historical Climatology Network (GHCN) is an integrated database of climate observations subjected to a common suite of quality assurance reviews.

The NOAA Global Surface Summary of the Day (GSOD) dataset includes weather observations from a much larger set of stations (over 9,000), but these haven’t undergone any review. As a result they tend to be less reliable than the GHCN observations.

We used the SQL statement below to perform the following functions, and applied it to both datasets.

  • Find the yearly average temperature per station using the aggregate function.
  • Group stations into latitude bands.
  • Group years into decades using conditional statements and the like operator.
  • Finding the number of readings per year for each station, using the count function.
SELECT
ID AS STN,
PERIOD_TS,
NUMOFDAYS,
AVERAGE_TMP_C,
LATITUDE,
LONGITUDE,
NAME,
STATE,
CASE
WHEN LATITUDE >= 66 AND LATITUDE <=90 THEN
"ARCTIC CIRCLE"
WHEN LATITUDE >= 23 AND LATITUDE < 66 THEN
"NORTHERN TEMPERATE ZONE - BETWEEN TROPIC OF CANCER & ARCTIC CIRCLE"
WHEN LATITUDE >= 0 AND LATITUDE < 23 THEN
"NORTHERN TROPICS - BETWEEN TROPIC OF CANCER & EQUATOR"
WHEN LATITUDE >= -23 AND LATITUDE < 0 THEN
"SOUTHERN TROPICS - BETWEEN TROPIC OF CAPRICORN & EQUATOR"
WHEN LATITUDE >= -66 AND LATITUDE < -23 THEN
"SOUTHERN TEMPERATE ZONE - BETWEEN TROPIC OF CAPRICORN & ANTARCTIC CIRCLE"
WHEN LATITUDE >= -90 AND LATITUDE < -66 THEN
"ANTARCTIC CIRCLE"
END AS STN_ZN,
IF (NUMOFDAYS >= 300, "GREATER THAN 300", "LESS THAN 300") AS NUMOFDAYS_IND,
CASE
WHEN YEAR LIKE "192%" THEN "20S"
WHEN YEAR LIKE "193%" THEN "30S"
WHEN YEAR LIKE "194%" THEN "40S"
WHEN YEAR LIKE "195%" THEN "50S"
WHEN YEAR LIKE "196%" THEN "60S"
WHEN YEAR LIKE "197%" THEN "70S"
WHEN YEAR LIKE "198%" THEN "80S"
WHEN YEAR LIKE "199%" THEN "90S"
WHEN YEAR LIKE "200%" THEN "2000S"
WHEN YEAR LIKE "201%" THEN "2010S"
END AS DECADES
FROM (
SELECT
A.ID,
A.PERIOD_TS,
COUNT(*) AS NUMOFDAYS,
ROUND(AVG(A.VALUE/10),1) AS AVERAGE_TMP_C,
B.LATITUDE,
B.LONGITUDE,
B.NAME,
B.STATE,
A.YEAR
FROM (
SELECT
*,
TIMESTAMP_TRUNC(TIMESTAMP(date), YEAR) AS PERIOD_TS,
CAST((EXTRACT(YEAR from DATE)) as String) AS YEAR
FROM
`bigquery-public-data.ghcn_d.ghcnd_*` ) A
INNER JOIN
`bigquery-public-data.ghcn_d.ghcnd_stations` B
ON
A.ID = B.ID
WHERE
A.ELEMENT = 'TAVG'
GROUP BY
B.LATITUDE, B.LONGITUDE,
B.NAME, B.STATE, A.ID, A.PERIOD_TS, A.YEAR)

Once we summarized the data — along with the derived attributes including latitude zone, decade, and station observation count — we used Tableau to visualize the results.

Tableau allows us to provide a SQL statement to use as an input into a dashboard, and provides multiple visualization techniques including:

  • Map view with colored labeling to help visualize temperature trends by station location.
  • A bar chart with reference lines to plot average temperatures by decades.
  • Line charts to plot temperature averages by year.

Using this approach, we created simple dashboards that made it easy to understand our results, and perform more detailed analysis.

GHCN Tableau Dashboard | GSOD Tableau Dashboard

Both sets of data indicate a trend of global temperature increase since the 1970s

The graphs below show a gradual increase in net average global temperature each decade using both the GSOD and GHCN station data.

Net average global temperature observations each decade using GSOD and GHCN station data

To generate these graphs we used stations with at least 300 measurements in a given year, starting in 1973 (as explained below.)

Of course, this doesn’t tell the whole story — global temperature isn’t consistent, so some places may have gotten warmer and others cooler. Later in this article we’ll look at trends at different latitudes to better understand the data.

We want to consider the full data, so we’ll perform as little filtering as possible

Significant changes in the number and location of weather stations is likely to have a significant impact on the average temperature, as will big shifts in population, and relative density of reporting locations.

GSOD and GHCN reporting stations in 1972 and 1973

For our analysis to be meaningful, we want to start in a year after which there aren’t big jumps in the number of reporting stations, and there’s a reasonable global coverage of stations.

Our data suggests that 1973 was the first year in which the GSOD and GHCN databases both matched these criteria, as shown in these graphs comparing 1972 and 1973.

Notice, for example, that in 1972, there are few stations in South America; in 1973, there are a lot more. It’s also important to note that the GSOD database includes many stations that record only a small number of observations per year — for example, some station observations are only in the summer.

To avoid potential seasonal bias in our data we’ll filter out any station with fewer than 300 observations in a given year.

Comparing the GSOD to GHCN trends since 1973 for stations with 300+ observations

The graphs below break down the decade-averages into a year by year plot for each data set.

Year by year average temperatures for GHCN and GSOD reporting stations

This data looks like it’s gotten a lot warmer, but let’s not rush to any conclusions. Averages can be influenced in many ways, and our data might be biased. Notice that over the same time period, there’s a significant increase in the number of reporting stations.

Number of reporting stations reporting average temperature each year within GHCN and GSOD datasets

If those new stations are disproportionately located at warmer regions, that could bias the average towards warming.

Reporting station global locations

Looking at the overall distribution of stations for both datasets, the map shows that there are more stations in the most heavily populated latitudes (that have more temperate temperatures), and that there are fewer stations in more remote areas (that are frequently much colder).

As a result the regions with most stations will have disproportionate influence on the overall climate trends.

To begin to account for these potential issues, we can use a common data science technique of separating out our analysis into smaller slices of the data — in this case regions based on latitude bands.

The Arctic: The North Pole (90°N) to the Arctic Circle (66°N)

Both datasets indicate a significant gradual increase in the average temperature reported over the arctic circle each decade.

Average daily temperature per decade within the Arctic Circle using GHCN and GSOD data

GSOD shows a 1.08°C increase from the 70’s to the current decade, and GHCN a 0.67525°C increase over the same period .

Observing both results suggests that the Arctic circle has been warming.

If we take a look at the number of stations used each year, it suggests there’s no correlation between the reporting station count and the temperature recorded.

Northern Temperate Zone: The Arctic Circle (66°N) to The Tropic of Cancer (22.5°N)

The Northern Temperate Zone includes the largest number of stations, and represents the most highly populated region.

In the graph below, notice the gradual increase in average temperature as we move from the Arctic Circle to the tropics as indicated by the color of each dot representing the average temperature at each measurement station.

Location and average temperature of stations within the Northern Temperate Zone

As with the Arctic Circle, both the GSOD and GHCN datasets show a gradual increase in the average temperature reported across the Northern Temperate Zone each decade, though at a slower rate resulting in 0.455°C and 0.65275°C increases in the GSOD and GHCN sets respectively.

Northern Tropics: The Tropic of Cancer (22.5°N) to The Equator (0°)

When we reach the Northern Tropics, the graphs below show that while the number of stations providing readings has increased significantly since the 1970’s, a gradual temperature increase over time is seen in both datasets — we observe an average temperature increase of 0.1312°C and 0.0279°C per decade in GSOD and GHCN sets respectively.

This further reinforces the observation that the larger number of stations isn’t a significant factor in increased global average temperature increases, and also demonstrates that the overall warming trend isn’t consistently observed in all locations.

Southern Tropics: The Equator (0°) to The Tropic of Capricorn (23°S)

The same is true for the Southern Tropics, with only a slight increase in average temperature and a significant increase in reporting stations.

The average increase in temperature per decade for GSOD and GHCN has been 0.186°C and 0.283°C respectively.

Southern Temperate Zone: The Tropic of Cancer (22.5°S) to The Antarctic Circle (66°S)

In the Southern Temperate Zone, both datasets saw a small, but gradual decrease from the 1970’s to 2000’s of around 0.5°C.

Interestingly, the 2010’s have so far almost entirely countered that drop, recording the highest average since the 1970’s thanks principally to a very hot 2015 and 2016.

Are these two years outliers or the beginnings of a new warming trend for the Southern Temperate Zone? It’s too early to say based on only two data-points, but it’s an observation worth tracking.

The Antarctic: The Antarctic Circle (66°S) to The South Pole (90°S)

The number of stations in the Antarctic is very low compared to the other latitude bands— the GHCN records don’t include any stations with over 300 readings until 1980 within the Antarctic Circle.

The measurements we do have, indicate a general trend of decreasing average temperature each decade, of 6°C and 10°C for the GSOD and GHCN data respectively.

Note that most of the weather stations are on the edge of the continent, with very few stations in deeper Antarctica.

There is a pattern that Northern Antarctic stations have higher temperature than those located deeper within Antarctica.

The newer reporting stations are skewed towards those deeper, lower temperature, locations that may be contributing to the observed temperature decrease.

These challenges make the Antarctic data especially difficult to analyze, and require additional steps to gain a better understanding of changing temperatures there.

You’ll find that investigation in a future article.

The raw GSOD data generally agrees with the quality controlled GHCN results

Both datasets, when observed globally, indicate an increase in observed temperatures of 0.7125°C (GSOD) and 0.532°C (GHCN) each decade since the 1970s.

However, when we dig into the details on a regional basis the picture is more complex and a lot less open to simple conclusions.

The Arctic circle appears to be warming at the highest rate of approximately 1.084°C (GSOD) and 0.675°C (GHCN) each decade; and average temperatures in the Northern Temperate zone are also increasing at 0.455°C(GSOD) and 0.653°C (GHCN) / decade.

Meanwhile the tropics have maintained a fairly constant to marginal temperature rise per decade — Northern Tropics 0.13125°C (GSOD) and 0.0279°C (GHCN) while Southern Tropics have a rate of 0.186°C (GSOD) and 0.283°C (GHCN).

And the Southern Temperate zone has been slowly cooling at around 0.01905°C (GSOD) and 0.13°C (GHCN) each decade.

Climate science is… complicated

When measuring something as complex as climate change, it’s important to understand the many variables that can influence our results. For example, even in our relatively simple datasets new stations can be added, measurement stations can move, the surrounding landscape can change, and instruments can become more precise / accurate.

Analyzing this data can provide useful starting points to form hypotheses and begin further investigations into the broader trends — and causes — that they may represent.

If you’re new to BigQuery follow these getting started instructions, and remember that everyone gets 1TB at no charge every month to run queries.

Share your investigations with us at reddit.com/r/bigquery and subscribe to Today I Learned with BigQuery for more BigQuery public dataset investigations.

--

--

Reto Meier
Google Cloud - Community

Developer Advocate @ Google, software engineer, and author of “Professional Android” series from Wrox. All opinions are my own.