(Part 3 of 3) Embrace Choice — Curate & Analyze NOAA Weather Dataset using Big Data Engines Hive/Spark & Presto

In the previous two parts of this story, we looked at how we can curate the NOAA GHCN Weather dataset using Hive or Spark to achieve more or less the same end result i.e., a curated refined Weather Table stored in an optimized ORC format.

In this 3rd and final part we will explore how we can enable accelerated access to the curated dataset using Presto and solve the final piece of the puzzle i.e, a BI / Reporting use case using Tableau to explore and visualize historical weather trends in New York.

For this purpose, lets zero down on New York Central Park weather station with ID: USW00094728 . To bring the New York weather data into Tableau lets create a view in Presto using the below SQL.

DROP VIEW IF EXISTS GHCN_WEATHER_NY_CNTRL_PK;
CREATE VIEW GHCN_WEATHER_NY_CNTRL_PK AS
select * from
(select station_id, latitude, longitude, elevation, name, state, year, month, day, ROUND(1.8*(element_at(element_map,'TMAX')/10)+32,2) as tmax, ROUND(1.8*(element_at(element_map,'TMIN')/10)+32,2) as tmin,
element_at(element_map,'PRCP') as PRCP, element_at(element_map,'SNOW') as SNOW, element_at(element_map,'SNWD') as SNWD, element_at(element_map,'AWND') as AWND FROM
(
select station_id, latitude, longitude, elevation, name, state,year, month, day, MAP(element_list,value_list) as element_map FROM
(
select station_id, latitude, longitude, elevation, name, state,year, month, day,
array_agg(element) element_list, array_agg(value) value_list
from noaa_weather.GHCN_DAILY_WEATHER_REFINED where
station_id='USW00094728' and
element in ('PRCP','SNWD','SNOW','TMAX','TMIN','AWND')
group by station_id, latitude, longitude, elevation, name, state,year, month, day
)x)y)
where tmax>-1767 and tmin>-1767 --Filter out the rows/records where temperature observations were not recorded
;

Using the view, lets answer a few curious questions about extreme weather in New York. We can validate the results from a NY Central Park Extreme weather report published by weather.gov at https://www.weather.gov/media/okx/Climate/CentralPark/extremes.pdf

  1. What was the maximum recorded temperature in New York and when was it recorded?

Answer: 105.98 Fahrenheit, recorded on 9th July 1936

select  tmax, month, year, day , rank() over (order by tmax desc) from GHCN_WEATHER_NY_CNTRL_PK limit 1
################Result####################
tmax month year day _col4
105.98 7 1936 9 1

2. What was the lowest recorded temperature in New York and when was it recorded?

Answer: -14.98 Fahrenheit, recorded on 9th February 1934

select  tmin, month, year, day , rank() over (order by tmin) from GHCN_WEATHER_NY_CNTRL_PK limit 1
################Result####################
tmin month year day _col4
-14.98 2 1934 9 1

3. What was the warmest month in New York and which month & year was it recorded in.

Answer: July 1999, recorded 81.36 Fahrenheit as average daily temperature.

select  avg((tmax+tmin)/2) as avg_monthly_temp, month, year from GHCN_WEATHER_NY_CNTRL_PK group by month, year order by avg((tmax+tmin)/2) desc limit 1
################Result####################
avg_monthly_temp month year
81.36 7 1999

4. What was the coldest month in New York and which month & year was it recorded in.

Answer: February 1934, recorded 19.90 average daily temperature

select  avg((tmax+tmin)/2) as avg_monthly_temp, month, year from GHCN_WEATHER_NY_CNTRL_PK group by month, year order by avg((tmax+tmin)/2) asc limit 1
################Result###################
avg_monthly_temp month year
19.90 2 1934

5. What was the wettest month in New York on record and which year was it recorded in.

Answer: August 2011, recorded a total precipitation of 18.95 inches.

select  SUM(PRCP * 0.003937) as prcp_inches, month, year from GHCN_WEATHER_NY_CNTRL_PK group by month, year order by SUM(PRCP) desc limit 1
################Result###################
prcp_inches month year
18.95 8 2011

Using Qubole’s ODBC driver, I was able to connect the Presto engine with Tableau to visualize the curated weather data set for New York. For other Presto distributions the open source ODBC PrestoDB driver should work. Below is the tableau public link that has metadata links to all the interactive worksheets and dashboards.

https://public.tableau.com/profile/pradeep6567#!/vizhome/NewYorkCentralParkWeather/AverageTemperatureTrends

Conclusion:

Embracing choice in big data is vitally important. No one big data engine, tool or technology is the be all end all. This entire blog series was curated using Qubole’s Hive, Spark , Presto & ODBC driver for Tableau connectivity. Qubole is a cloud native big data activation platform that supports multiple cloud providers (AWS, Azure & Oracle) and offers choice & support for multiple open source engines (viz. Hive, Spark & Presto), tools & technologies. You may try the solution options detailed in this blog on Qubole by signing up for a free account at https://www.qubole.com/products/pricing/