How to Build a Weather Data Platform
Climate change has increased the frequency and severity of extreme weather events around the world. Although this trend is just beginning, extreme weather is already wreaking havoc on billion-dollar businesses. Consequently, business leaders are scrambling to understand how the weather impacts their businesses and prevent the next catastrophe.
The challenge of integrating weather data into these organizations naturally falls to engineering teams. Many organizations are adding business logic surrounding weather forecasts to existing applications. Estimates of weather-related risk are becoming critical components of business intelligence suites. Mature data teams already have weather-driven predictive models in production. Software is eating the world, and weather is eating software.
While petabytes of weather datasets have been published by climate agencies over recent decades, this data is geared towards climate researchers and requires specific expertise and tooling. Hundreds of variables are available and documented at length, but this documentation is tailored towards meteorologists, not software engineers. Climate-specific data formats and services are not compatible with mainstream data tools, such as Pandas or Apache Spark. As a result, working directly with these datasets is time-consuming and error-prone. A well-architected weather data platform will allow software engineers, data scientists and analysts to be immediately productive with weather data.
Data platforms must provide insights into the weather’s impact on each affected department. Using weather forecasts, agriculture companies estimate crop yields, hedge market risk, and predict demand for their products. Each use case might focus on different locations and variables. A well-architected weather data platform will satisfy multiple use cases and evolve as new ones emerge.
To serve these use cases, weather data can be consumed in many ways. Let’s consider a logistics firm’s usage of weather data. First, it trains a machine learning model to predict delivery times using historical weather and operational data. It then uses weather forecasts as inputs to the model during inference. Finally, it analyzes historical data to learn how the weather actually impacted operations. Using the same systems to serve each of these use cases is a recipe for disaster.
I built Weather 20/20’s data platform and have navigated these challenges with our expert meteorologists. This post outlines the most important factors and tradeoffs in designing a weather data platform. I recommend considering how these affect your use cases and will be discussing Weather 20/20’s choices in parallel.
Each variable captures information about the state of the climate at a particular point in or interval of time, i.e. the temperature at a certain time or snowfall on a particular day. It’s important to understand which variables impact both current and potential use cases as long-term data ingestion is both time-consuming and costly when dealing with large weather datasets.
The most commonly used variables are temperature, precipitation, humidity, air pressure, and wind speed and direction. However, this only scratches the surface. Agriculture operations are especially concerned with solar radiation, evapotranspiration, soil temperature, and soil moisture. Weather forecasting systems calculate how much solar radiation is trapped by the atmosphere using many variables, including snow albedo and cloud cover. Airlines and militaries are not only interested in surface conditions, but also conditions at various levels in the atmosphere. Take some time to discuss with domain experts which variables might matter to your use cases.
Having access to the necessary variables is important. Without this, your users will be frustrated and you will struggle to gain adoption of your weather data platform. Adding the desired variables could take awhile depending on the scope of your platform. However, having too many variables can be expensive and confusing. In sum, explore variables that potentially impact your use case, curate datasets for each use case, but don’t confuse your data platform’s users with extraneous variables.
A deep neural network can perform “automated feature engineering” on a weather data set. While this is an appealing shortcut, it could reduce your model’s ability to predict the most extreme events since with limited examples your model could over-attribute the severity to an extraneous variable.
How We Do It
At Weather 20/20 we ingest variables that are used by our weather model: temperature, precipitation, pressure, humidity, cloud-cover, wind speed and wind direction. We’re also interested in ingesting data relevant to agriculture such as soil temperature, soil moisture and evapotranspiration.
Spatial coverage is the geographical region you’re interested in. This might consist of a set of points, countries, continents, or even the entire world. This factor has a bearing on data availability and total cost of ownership (TCO).
You might limit the dataset to regions or even countries in which the business operates, perhaps ingesting some unnecessary data as a result. Alternatively, you could ingest data for known points-of-interest, such as an agriculture company ingesting data for specific farms, potentially spanning the globe.
Spatial coverage is proportional to the size of your dataset. Different departments might be interested in totally different regions (i.e. a chain of cafés has locations in the United States and Canada, but coffee farms in South America and Southeast Asia). The spatial coverage necessary to satisfy all use cases is liable to change, and your data platform must respond in a proportional timely manner. This means data will be consistently available within agreed-upon boundaries, which can expand upon request within a sprint or so.
How We Do It
We have a global data platform and organize our data into a hierarchical discrete global grid comprising 2,016,842 cells. This allows modeling interactions of weather patterns over long distances and also quickly serving new international use cases. However, if you want to know about sandstorms on Mars, you’ll have to wait.
Spatial resolution is the distance between points in a dataset. The Earth’s (spherical) shape makes understanding spatial resolution a bit more difficult. Here’s your relevant xkcd:
Gridded datasets come in all different shapes and sizes, but most are 1˚, 0.5˚, 0.25˚ or 0.1˚, meaning the points are that far apart in both latitude and longitude. Ground station datasets do not follow this pattern except for a subset of stations in North and Central America.
There are two major tradeoffs: precision and data size. Certain variables, particularly precipitation, are more geographically variable than others. Depending on your platform’s spatial coverage, the irregular resolution of ground station datasets could cause insufficient precision. Think about how big of a change could affect your use case. If you’re deciding how much dry ice to ship your ice cream with, a few degrees warmer could cause an enormous loss.
Data size increases proportionally to the inverse square of spatial resolution. Consequently, the marginal cost of increasing resolution can become staggering. For example, a 1˚ global grid contains 64,800 data points, but a 0.1˚ grid contains 6,480,000 — a 100-fold increase in data size for a 10-fold reduction in spatial resolution.
Your organization cares about how relevant the data you have is. The distance between points is a directionally correct indicator of this. The spatial resolution of the dataset does not translate to distance between points, since the resolution in kilometers of a “uniform” coordinate grid varies by latitude and the arc of each degree of longitude is shorter the arc of each degree of latitude. Absolutely higher latitudes have lower real spatial resolution than lower latitudes, due to the Mercator Projection. Weather 20/20 ultimately provides a more uniformly spatially distributed dataset by using the H3 grid, which is based on the “Dymaxion” map projection (the one that looks like a stegosaurus).
How We Do It
We ingest gridded data at 0.1˚ and 0.25˚ resolution and project this data onto the H3 grid.
Temporal coverage is the time interval of data your users have access to, often beginning at a fixed point in the past and extending to the present, or alternatively the present plus a fixed interval when using forecasts. It’s especially important when working with historical data as it could be the difference between working with gigabytes and terabytes of weather data.
Many popular weather datasets extend to the 1980s, with some station datasets extending to the 19th century, albeit with comparatively limited spatial resolution and granularity. Climate agencies provide forecasts up to sixteen days in advance. At Weather 20/20 we provide a 100+ day forecast by using an alternative forecasting methodology.
The marginal benefit curve here depends on the depth of comparative data sources. If you only have sales data going back to 2012, you cannot use weather data from 2010 to better understand the impact of weather on your sales, much less train a machine learning model. Extreme weather events are rare, so deeper historical coverage will allow you to better understand the impact of rare weather events. It’s best to use at least 10 years if you have matching operational data.
How We Do It
Our historical dataset extends to 1980. This allows us to improve our weather forecasts by analyzing global climate trends and patterns to the extent reliable data exists. It also allows us to satisfy even the most extensive customer demands for historical data analysis.
Our forecasts dataset is one of the longest range in the industry, calculated on a rolling 365-day basis.
Temporal resolution is the frequency of data your platform provides. You might provide data at multiple frequencies, allowing users to choose the best frequency for their use case. You can down-sample one high-frequency time series to provide lower frequency data to users, but cannot turn a low-frequency time series of weather data into a higher one with 100% accuracy.
Hourly, daily and monthly data are broadly available. Your platform might provide several or all of those.
The size of your dataset is proportional to the frequency of your time series. Hourly data is 24 times as large as daily data. This can have a large impact on storage and computing costs.
The right frequency depends on your use case’s sensitivity to intra-day climate variability. For example, a delivery service might experience especially high traffic when there’s a lot of rain between 5pm and 7pm local time and need to increase supply of drivers during that window. If the rain actually arrives between 7pm and 9pm, the delivery service will have wasted money and frustrated its drivers. However, if your use case varies more on a narrative such as “people buy ice cream when it’s warm and sunny”, it might not matter.
Hourly data can be aggregated into daily data, but if users want hourly data but you only have daily data, you’ll need to ingest a new dataset.
How We Do It
We store and ingest both hourly and daily data, but analyze and forecast on a daily resolution to be more cost-effective. We are working towards providing hourly forecasts based on location-specific patterns of intra-day variability.
There are several types of weather datasets, and understanding which should be used in your data platform is critical. While I tried to make these factors as orthogonal as possible, this one depends on the rest.
The three categories are ground-stations, satellites, and re-analysis. You may end up using all three data sources in your platform.
Ground stations contain instruments including, but not limited to barometers, anemometers and rain gauges. Some measurements, such as visibility or cloud cover, are manually recorded by a human at the station. I’m sure they’re great at their jobs, but am not a big fan of having a human in the loop there. Ground stations are concentrated around populated areas and airports, meaning your point-of-interest could be over one hundred miles from the nearest station. This can be problematic since small-scale, high-impact weather events could go completely undetected.
Since the Space Race, the collection of weather data via satellites has revolutionized meteorology. Climate agencies around the world have weather satellites in orbit, constantly monitoring electromagnetic radiation reflected from the Earth. Variables such as precipitation and cloud cover can be calculated from the raw data, and then post-processed to yield a uniform grid. These satellite data are the backbone of modern weather forecasting systems.
Re-analysis datasets are one level above the satellite datasets, using similar machinery as numerical weather forecasting models to yield a dataset with additional variables, and a higher spatial and temporal resolution. The bad news is this process inevitably adds some error. The good news is this error is random so you should not see the sum of errors dramatically accumulate over time.
Ground stations are only sufficient if your points-of-interest are near at least one station, but preferably multiple stations as this data has gaps, say if coffee gets spilled on the equipment. All but the most popular ground station datasets require extensive data munging due to archaic formatting choices. Beyond a minimal baseline of temperature and precipitation variables, the set of variables differs among ground stations, so variables corresponding to a particular location in your output dataset could originate from different stations, which can also be problematic.
For example, if:
- The nearest stations reporting high and low temperatures are different. And,
- The reported low temperature is higher than the reported high temperature.
- Voila, there is a bug in your dataset.
Satellite and re-analysis datasets tend to be more productive to work with due to their uniformity, consistency and often wide range of variables. It’s also nice that these datasets stretch all the way back to the 1980s (with work being done to extend them even further). There’s a specific toolset developed for working with this gridded weather data, but it does not play very nice with the modern data stack. There’s some added uncertainty when using re-analysis data, especially earlier on in the temporal coverage of the dataset. Some industries with strict auditing requirements (think electricity trading) also require data to be traced back to a ground station.
How We Do It
We use a combination of station, satellite and re-analysis datasets to provide a unified spatially and temporally gapless historical dataset, choosing which values to use based on a combination of proximity and quality control indicators (easier said than done). Using satellite and re-analysis data allows maintaining a gapless, uniform global grid.
By now, the contours of your weather data platform and Weather 20/20’s data platform have begun to take shape. The factors outlined in this post influence the components and characteristics of your data platform, especially evolvability, time to operationalization, and TCO.
If you think Weather 20/20 might be a good fit for your organization you can contact us through our website. Even if not, we would love to provide guidance and hear about your weather data use cases.