In this post, I want to tell you about the data challenges we face at the UK Met Office, and how we think Pangeo could address these problems. Pangeo has lots of different useful features, but here we focus on two we are particularly excited about: elasticity and laziness. But first…
Our use case
The UK Met Office is the UK’s national weather forecaster and climate research institute. We spend a lot of time thinking about how to deal with data. Our on-site high performance computer (HPC) runs physical models of the Earth system, generating hundreds of terabytes of gridded model output everyday. Our archive of data is approaching the exabyte scale.
In addition to the sheer volume of data, we also have a wide range of users. Our primary business is issuing the weather forecast for the UK: we have a team of meteorologists who work 24/7 inspecting model output and issuing forecasts. However, we also have several hundred climate and weather scientists on site who want to explore the output from their experimental model runs and publish papers or reports.
Weather forecasting affects very many things: our customers include the public, aviation, supermarkets, insurance companies, the military, and local councils to name just a few. In addition to this, weather is very predictable compared to other external factors.
As the computational capacity of HPCs has grown, so has the the detail and skill of our Earth system models. To make this data useful, we need to analyse it to turn it into useful information for stakeholders. As more areas of the economy start to leverage “big data”, we also increasingly need the ability to mix our datasets with consumer data-sets.
What’s the problem?
This all boils down to being able to ask questions of the data — a process that might look something like this:
You iterate round this loop getting closer and closer to your eureka moment of insight, be it a scientific paper, a new product, or advice to a consumer. At the Met Office, we pride ourselves on our expertise in weather and climate science. In an ideal world, these experts would spend the vast majority of their time doing what they’re good at — thinking about atmospheric science — and a very short amount of time waiting for computations to run.
However, this ideal workflow presents a problem — the compute load is inherently volatile.
We spend too much time waiting
We have previously processed all this data using on-site compute clusters, which presents an unpleasant choice:
We can soak up the volatile compute load by having a very large on-site cluster, which copes with inevitable spikes in demand, allowing experts to quickly gain insight from their calculations. However, it inevitably is under-utilised during periods of low demand (nights, weekends), arguably providing bad value for money.
Or, we can smooth out the spikes in demand by putting the experts’ calculations on queues. Our compute cluster now has high utilisation but our experts have to wait for their calculations to run, arguable providing bad value for money.
Our approach so far has been to try and sit somewhere in the middle, but as data volumes continue to grow, we are struggling to allow analysts to quickly get insight from their data.
But it’s not just about efficiency
When computation time is a limiting factor, there is an inevitable pressure on analysts to look for things they expect to see in places they expect to see them. This is a very bad thing, as it stops us getting prober value for money from our data.
How does Pangeo help?
The Pangeo approach has many attractive qualities: support for gridded data, thin-client, modularity, flexibility, and of course its burgeoning international community. But there are two principles in particular that can help with the challenges mentioned above.
Firstly, Pangeo can be run elastically in the cloud. This means that it can quickly run compute intensive calculations, but it’s also cost efficient. This is possible because cloud compute providers serve so many customers simultaneously that they are able to smooth out spikes in demand. Currently in Pangeo we use Kubernetes to elastically scale our compute clusters.
Secondly, Pangeo is “lazy” when it evaluates and accesses data. That is computations are only executed when they are needed with as little as possible pre-calculated, and only the data points needed for a given calculation are accessed. We use Dask to parallelise and distribute our computations. Dask has lazy evaluation (also called just-in-time evaluation) built in.
From an analysts point of view, this system feels very similar to a non-lazy system. For example, we might want to…
- Load multiple data files covering lat x long x time, and 12 physical variables
- Calculate the climatologies (i.e. mean points in the same months for the same variables)
- Plot the 12 monthly values for temperature climatology for the grid point over Exeter, UK.
A non-lazy system might load the entire data set into compute node memory, calculate climatology fields for each of 12 variables, and then return the 12 monthly values comprising the temperature climatology for Exeter. But most of the data we calculated is never used in the final returned values (as it’s not over Exeter, or not temperature data).
In contrast, a lazy system loads only metadata in step one, and registers a bunch of latent computations in step two. Only in step three does a lazy system start to actually load and process data. At this point, a lazy system can inspect the latent computations in order to load only the data needed to perform only the calculations required. So, while parallelisation speed up computations, laziness ensures we do the minimum amount of work possible.
This “lazy” approach means we do the minimal amount of work — always a good thing. However, it also means we can now choose between crystalised “non-lazy” data, which comes with the cost of storing and serving it) and latent “lazy” data (which transfers the cost onto compute cycles).
For further reading, keep an eye out for the book chapter Giving Analysts back their flow: Analysing Big Atmospheric Data-sets in the Cloud; in Big Earth Data Analytics in Earth, Atmospheric and Ocean Sciences, Wiley, Ed: T. Huang et al., in review.
The Future of Pangeo at the Met Office
Ultimately we’d like analysts to use Pangeo without even knowing it exists. That’s why we’ve released a new major version of Iris, our python module for analysing earth science data. Now high level, expressive Iris operations can be distributed across elastic Pangeo instances under the hood.
We also want to make full use of the flexibility of Pangeo. We want to be able to seamlessly orchestrate tasks between our HPC, on-site compute clusters, and burst out to the elastic cloud when needed. This next-level orchestration is a work in progress, but all the technologies are in place.