How to (and not to) handle metadata in high momentum datasets

Theo McCaie
Apr 11, 2019 · 8 min read

In my previous article on big and fast moving (high momentum) datasets I discussed ways to efficiently manage the data within huge and rapidly changing datasets. However, I concluded that there was currently no clear way of safely and efficiently managing the metadata (data about the data). Here I present some of the challenges and potential solutions to that problem. The aforementioned post is recommended but not required reading.

Whats metadata

Here is some data:

13.7

It’s a nice number but without metadata, it’s meaningless. Metadata is the context that gives numbers, figures or data meaning.

Let’s give our data (13.7) some metadata. Units: “Billion Years”, description: “Age of the universe”, source: “hubblesite.org”. Suddenly our data has meaning; that’s the power of metadata.

At the Met Office we handle petabytes of data in all different shapes, sizes and file types. If this data didn’t have reliable metadata associated with it we couldn’t do what we do.

A typical data set

To explore the problems of metadata handling in large fast moving datasets let’s work with an air temperature dataset as a simple example. This dataset gives you forecasts for the next three days on a two by four latitude-longitude grid. Here it is pictorial representation:

This is a 3D data set with dimensions time of forecast (forecast reference time), latitude and longitude. Many of our datasets would have higher numbers of dimensions as they may include forecast period, height, ensemble member or other dimensions.

Saving as Zarr

Sticking with our simple example, if we use xarray to save this data set as Zarr at root my.zarr it would look something like this:

This places the data in one array my.zarr/air_temperature and includes some metadata like the units in the .zattrs of this array. However, dimension metadata (such as which lat-long points does this data cover) gets placed in extra arrays, one for each dimension. An element (dimensions) in the .zattrs informs the client that these other arrays are dimensions of this data array and the order they are applied in.

Updates on moving data

This structure works ok with static data sets but if we consider a dataset that moves or changes we come across problems. Let’s work with the example of a forecast the rolls forward in time, providing a new forecast every day and dropping the oldest forecast at 3-days old. In my previous article, I discussed how using an offset attribute could allow for the data in a Zarr to be safely rolled forward without creating any inconsistencies. Now, however, there is the added complication of also updating the metadata.

The simplest course of action would be to simply update the chunk my.zarr/forecast_reference_time/0 on each new forecast. This chunk contains the array of forecast date times that represent the data. On update the new forecast datetime would be added at the end and the earliest dropped. This would work if the Zarr is backed by some sort of ACID storage such that you can ensure that the rolling of the data and the updating of the metadata either both happen or both fail and this consistent view is shared with anyone who accesses the data. This is not the case for the likes of S3 or other object stores which are the foundation for a lot of big data storage.

When reading data on an eventually consistent storage backend like S3, users could get one of four outcomes:

  1. They access the updated data and updated metadata, this is the ideal scenario.
  2. They access the old data and old metadata. This scenario is ok because they access valid data and metadata, though not the very latest.
  3. They access the updated data but old metadata. This is very bad, they are being misled about what data they are looking at.
  4. They access the old data but updated metadata. Again, a bad scenario where the data is being misrepresented.

The first case is great, the second is fine but the other two are dangerous. To the user this looks like success and so they carry on and use the data, but believe it to be something it is not. If instead of daily forecast updates you imagine 12-hourly updates then in the two fail cases the user would get night time temperatures presented as daytime and vice versa. This is clearly not acceptable and we are much more worried about these silent failures/corruptions than a situation where things ‘just break’ and the data is inaccessible/unreadable.

Solution one: put it all in one place.

The fundamental problem is that the metadata and the data have been split out into different objects. In our example we have units information in one place (the .zattrs) and coordinate information in another (different Zarr arrays). When these different objects are updated we cannot guarantee that changes to all objects will be in sync on read.

Perhaps the simplest option is to take all the metadata and place it in the .zattrs object. This way when the dataset is rolled by updating the offset in .zattrs the metadata could be updated simultaneously. Either you get the newer rolled data set and the updated metadata or you get the older non-rolled dataset and older metadata. Either way the dataset is consistent.

In most cases metadata is small compared to the data and so shouldn’t be too much of a concern to add to an array’s attributes. For a fairly typical example, our UK-V model is on a lat-lon grid of 970 by 1042 points. If the values of the grid are going to be stored as arrays in the .zattrs JSON we estimate they would take up about 20kB. If you consider this expressed as 2D coordinates, which are not uncommon, then for our 970 by 1042 lat-long points the metadata is more like 40MB. This doesn’t bloat our .zattrs object overly much so is a feasible option.

This solution would look something like this:

Schematic of an air temperature dataset saved as Zarr with metadata stored in the attributes of the data array.

Solution two: version you metadata

The other option is to version your metadata and ensure that the data array references a specific version of the metadata. This could be done by using the versioning system present in many object store implementations but perhaps simpler is to create new metadata at a new path.

In this scenario when the data array is rolled then we create a new array for the new metadata and put this at some unique path such as my.zarr/forecast_reference_time_v1/ and on the next update my.zarr/forecast_reference_time_v2/ and so on. The element (dimensions) in the data array .zattrs that informs of the link between the data array and metadata array(s) would then need to be updated to this new path at the same time that the offset was updated. This is illustrated below, modified objects in blue, deleted objects in red, new objects in green:

The updates to these objects can happen simultaneously as they can not lead to corrupt data. If on read you get the updated .zattrs you either get the updated metadata too or you will not be able to find the updated metadata and get an error. On the other hand if you get the old metadata you will either also get the old metadata or if they have been deleted you will get an error. Whatever scenario you get consistent data or an error that you can handle. The dangerous misrepresented data problem is avoided. Since the consistency times are not long in these object stores any errors would likely be resolved with a retry.

In this system we can continue to treat metadata in the same manner as data, storing it in Zarr arrays. This is good as metadata is just data and the line between what is data and what is metadata isn’t hard and sharp. Treating all data the same does have a certain appeal. It also allows for more than one data array to point to the same metadata, a potentially useful feature.

The third way: generative metadata

There is a third way (although perhaps it’s a special case of one of the other solutions) which uses generative/algorithmic metadata. Algorithmic metadata could work for data that evolves predictably, just like our regularly issued forecast.

Using this method, rather than storing the metadata in the .zattrs (or in some other way ‘with’ the data) the rules for creating the metadata are stored. In English this might be something like: There are 24x7 forecast reference times, one every hour starting from now minus 7 days rounded down to the nearest hour. The beauty of this system is that you never need to update the metadata as long as your data production continues to follow the same predictable algorithm.

In this algorithmic system it would need to be ensured that the indexing is aware of this evolution. By this we mean that when you ask for data[-1] today you get the latest forecast and when you ask for data[-1] tomorrow you get the new latest forecast. One way of achieving this, as mentioned in the previous post, is by setting the offset to an algorithm instead of a static value (e.g. offset = x * the number of whole hours past some date). Another way is for the indices of the objects to be intrinsically tied to the metadata that represents them. For example asking for data[-1] would not look for object my.zar/3 rather it would look for object my.zarr/{todays_date}_{current_hour}. This is both simpler and more complex, and we are making exciting progress with work in this area so watch this space.

Not Zarr’s problem

Much of the ideas I’ve talked about or presented are based around the Zarr storage specification but likely relevant to other storage paradigms too. However, much like NetCDF is a specification build on top of HDF5, it’s likely that the solution to these problems lie higher up the specification hierarchies than Zarr. We believe that Zarr (and other) specification development needs to be aware of these problems because these tools need to be able to facilitate the solution to these problems even if not directly resolving them.

Embrace eventual consistency, embrace constant change…

If there is one take away from this article it is that we need future storage specifications to embrace eventual consistency and constant change. Our current thinking is that the following three points are probably necessary to achieve this:

  1. Ensure that versioning can be achieved. When updates happen in an eventually consistent world you will have multiple versions of one or more objects. This needs to be explicitly handled rather than left to chance.
  2. Allow for partial updates. Versioning needs to allow changing the things that need to change without needing to change those that don’t. Perhaps the best example of this is the current problem with prepending to Zarrs.
  3. A single entry point from which all else cascades. There needs to be a single source of truth to start from when reading a dataset from which access to all other object proceeds predictably. For example, if I update a Zarr by changing attributes in .zarray and .zattrs then there is no telling which combination of new and old versions of these objects a user will get on read. This could create misleading or corrupt data. If .zarray pointed to a specific version of .zattrs (or .zattrs was a field within .zarray) this would be resolved.

Informatics Lab

Met Office Informatics Lab — Pushing the boundaries of technology & design to make environmental science and data useful

Thanks to Rachel Prudden and Kevin Donkers

Theo McCaie

Written by

Somewhat a developer ‘jack of all trades’. My background is in science but my profession is IT. I work holistically across all areas of computer systems.

Informatics Lab

Met Office Informatics Lab — Pushing the boundaries of technology & design to make environmental science and data useful

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade