Analysis ready data
Here at the Informatics Lab, we’ve been looking at tools, techniques and technologies for what we are calling Cloud-Ready Data. Getting the most out of moving to cloud means much more than just sticking your data in the object store of your choice and continue doing what you always used to.
Cloud-Ready Data should:
- Efficiently support complex queries on huge datasets
- Support highly parallel read and write
- Store highly complex many-dimensional datasets
- Ensure metadata and data relations are maintained
- Integrate with high-level tools that allow working in the domain paradigm instead of the data paradigm.
To this end, we have been researching the capacity of Zarr and TileDB to meet these requirements. We have blog posts on our findings so far with Zarr and TileDB so do read those if you want a more technical dive into some of the tech. This piece however is focussing on a higher level concept that has come out of this work — Analysis Ready Data.
Analysis ready data
At the Met Office we create a lot of data and great care is taken to ensure that it is accompanied by a fully descriptive set of metadata, usually conforming to one or more international standard or conventions. This gives you everything you need to understand the data. Job done, right? Not quite.
Let's illustrate this with our AWS Earth open datasets. This is a 7-day rolling archive of our four most important atmospheric models. Conceptually it’s four data sets, one for each model. Consider one of these datasets — MOGREPS-UK. This is a dataset consisting of fully describing NetCDF files, each file perfectly describes the domain, phenomenon, aggregation methods, creation time, etc of the data it contains. The problem? A 7-day archive contains about 2 million files. What’s more these files represent a series of complex rolling lagged ensembles, have overlapping validity times, different run lengths at different times of day, each phenomenon is in a different file, files have randomly assigned filenames, and other obscurities galore. On top of this the whole dataset is so huge you need specialist environments and toolsets to work with it. Each file is a beautifully self-describing entity, the whole dataset is an incomprehensible beast that’s virtually impossible to work with. This is opposite of Analysis Ready Data.
Analysis ready data is data made available with the tools, documentation and infrastructure to allow instant and easy analysis across the entire domain.
A Simple real-world example
We are working with some colleagues to expose a high-resolution climate reanalysis of China for the past 160 years. Compared to the MOGREPS-UK dataset mentioned above it is significantly smaller in data volume, has many fewer files, is less dimensionally complex and is static. Still, at approximately three Terrabytes and thousands of files, it’s proving a challenge to transform into Analysis Ready Datasets (see the other blog posts mentioned above for some details). Here I want to demonstrate why this significant effort is worth every drop of blood, every bead of sweat and every bitter tear.
To compare our various approaches to Cloud Ready Data we are creating benchmarks for this dataset and performing the same analysis but with the different data formats. We want to understand the raw performance but also have some sort of gauge of the cognitive burden and other ‘soft’ metrics. One of our benchmarks is plotting a rolling mean of some data for a given grid point. With our Analysis Ready Dataset stored as a Zarr this is how we do it:
Now here is the same analysis but using a collection of a few hundred NetCDF files (that the Zarr above was derived from).
Without going into the details of these examples it is clear that there was a huge burden placed on me as a result of the dataset not being Analysis Ready. Here are some of the things I had to solve.
- The dataset can’t be loaded all at once, not even just the metadata because of the data format not being appropriate for object storage. I had to manually load the odd file here and there and try to figure out the ‘schema’ of the dataset.
- The filenames encode information about what is in the files but this encoding was no intuitively understandable. I had to seek out the data authors to find the set of rules that decode a filename into a date-time.
- The HDF5 library used under the hood of Iris didn’t like it when the same file was opened by more than one process. I had to further complicate my analysis by working out how to batch the rolling mean in such a way that no one file was being accessed by two processes concurrently.
- I had to manually manage batching, processing and re-aggregating.
As a result of the above cognitive burden A) it took a lot longer than working with Analysis Ready Data and B) I’m fairly sure I’ve done it wrong. Frustratingly, the fact that I’ve gone through this cognitively taxing process doesn’t necessarily help another user of the data who may well be battling over the same issues right now. Finally, it doesn’t even necessarily help me when I want to perform a different analysis as many of the solutions I found were unique to that analysis and may be of limited use to a different analysis.
The above complaints are in stark contrast to Analysis Ready Data. With Analysis Ready Data the task of understanding the quirks and intricacies of the data are largely handed to the data author who is best placed to understand them. By placing this responsibility on the author the solutions to these issues are re-usable and reaped by every subsequent user of the data.
Analysis Ready Data removes cognitive burden from every single user of a dataset and places it where the answers are most likely to be known and the solution reused.
Increasingly the projects and datasets the Met Office produce have the mandate to be “open”. However, if you publish a dataset that isn’t Analysis Ready it is not truly open. It becomes open to a small niche of domain specialists or perhaps a small number of companies or organisations that can afford and understand how to run a compute cluster and create sophisticated distributed algorithms. This excludes many people including the small to medium enterprises that I work with each week who are keen to use this data to power their businesses but find it inaccessible to them even if it is “open”.
Be a hero to your users
If you are a data author/publisher take the hit and try to publish Analysis Ready Data. The not inconsiderable effort will be quickly recouped by your users and likely yourself.