Storing Cloud-Ready Geoscience Data with TileDB
Representing thousands of NetCDF files in a single logical dataset
The Helicopter View
Not enough time to read the whole blog post? Here’s a quick summary:
TileDB offers some great functionality which could be very powerful if applied to earth system science. We are actively collaborating with the TileDB Team to make TileDB functionality available alongside existing earth system science tooling, which we think could be a game-changer for storing and interacting with earth system science datasets.
Data volumes in computational data science are increasing. This is as true in earth system science as in other disciplines. For the Met Office this means about 14TB per day of operational (that is, weather and climate forecast) data being produced by the Met Office supercomputer, with some estimates putting daily research data volumes at 8x that.
All this data needs to be stored somewhere, found and accessed by people who wish to use it in their own research and, more pertinently for this blog post, also stored somehow. Typically the ‘how’ of storing earth system data has been a variety of formats, with a slight preference towards NetCDF files. Often a single logical dataset is represented with hundreds or even thousands of discrete files. Reading and working with these files can introduce a significant code and mental overhead that is a blocker to getting on with using the contents of these files in earth system research.
In this blog post I will detail some recent exploratory work we’ve been doing in the Informatics Lab with TileDB, a reasonably new and cloud-ready approach for storing large volumes of data.
TileDB is a new, open-source storage engine for chunked, compressed nD arrays. It is similar to Zarr as it introduces a cloud-native, parallel engine for dense arrays, and is an alternative to HDF5. Over and above the features offered by Zarr, TileDB also supports sparse arrays, offers multiple language APIs — it is built in C / C++, and provides efficient wrappers for Python, R, Java and Go — implements time traveling (more on this later), and integrates with many popular tools (Spark, dask, PrestoDB, MariaDB, GDAL and PDAL). TileDB can store data on multiple cloud backends, including AWS S3, Azure Blob Storage, and Google Cloud Storage.
TileDB and Earth System Science Data
We can use TileDB to store earth system science data, providing some exciting new possibilities for earth system sciences. In the Lab we are interested in using TileDB to store very large earth system science datasets. These may be composed of thousands or tens of thousands of individual files, or be composed of fewer, much larger files on the order of tens of terabytes of data. We are exploring using TileDB to represent all this data, however it is formed, in a single logical TileDB array. One thing we in the Lab are excited about with TileDB is that it offers a genuine solution for doing just this.
In order to completely store the contents of NetCDF files in any storage engine (TileDB, Zarr, or something else), you effectively need to be able to store three things: nD arrays of data, key-value pairs of arbitrary array metadata, and label arrays for the axes of the nD arrays. With TileDB you can store nD arrays with arbitrary metadata, but there is currently no native support for labelled array axes, although this is on the TileDB roadmap.
TileDB does not currently integrate with Iris and xarray, common libraries for interacting with earth system science data in Python. These libraries provide lots of useful functionality when interacting with earth system science data, which we’d like to make use of for TileDB arrays too. On account of this, we have written an adaptor that bridges the gaps between earth system data, TileDB arrays and high-level libraries like Iris and Xarray.
We’ve seen two areas where there’s a need for an interoperability library that connects TileDB to an existing Python ecosystem: for making the link between TileDB arrays and analysis libraries such as Iris and xarray, and for writing TileDB arrays containing extra metadata that’s specific to earth system science data.
We wrote a Python package to allow us to create TileDB arrays from NetCDF files and interact with these arrays using Iris and Xarray. The package provides three primary functions:
- A data model for NetCDF files as loaded using the netcdf4-python library, which classifies NetCDF variables as describing data arrays, coordinate arrays, or other metadata.
- A writer to convert NetCDF files represented using the data model (data and metadata) to TileDB arrays.
- A reader to load TileDB arrays (data and metadata) into high-level libraries like Iris and xarray.
This package is available on GitHub. Writing it was needed to bridge the gap between the existing, metadata-rich file formats such as NetCDF and high-level libraries such as Iris, and the TileDB array specification. With it, we’ve been able to create TileDB arrays comprised of data and metadata from thousands of individual NetCDF files, and load these into Iris as a single Iris cube object:
This is a really exciting result, as it is the biggest Iris cube that we have created in the Informatics Lab to date! It shows that TileDB can be used to store data plus metadata from NetCDF files that’s representable as a single Iris cube. As we’ve recently blogged about elsewhere, it’s possible to make such an Iris cube directly from the input NetCDF files, but the cognitive burden is much higher than simply being able to load a single TileDB array directly as a single Iris cube.
Going further, we can leverage dask and functionality available in Iris to perform operations on such cubes that are backed by TileDB arrays, such as extracting a single latitude/longitude grid point, or performing a statistical collapse across all time points.
TileDB’s Unique Capabilities
Our exploration of TileDB has led to us finding a lot of interesting functionality in TileDB that we can make use of when storing earth system data in TileDB arrays. Let’s take a look at some of these in a bit more detail.
All data within a TileDB array must be contained within the array’s domain — the labelled, nD space that describes the limits of the array. One or more axes of the domain can be made effectively infinite in length, however, meaning that near unlimited appends of data to that axis become practicable. One good use-case for this is storing data from a very long time period in a single array.
You may think that an infinite domain could make indexing a very expensive operation. Fortunately TileDB includes clever indexing, and the concept of a non-empty domain — the extent of the indices of your array that have actually been written — to make sure this isn’t a hefty overhead.
The contents of each write to a TileDB array are stored in a fragment, but successive writes do not overwrite previous writes. Instead each write fragment is time-stamped and kept. On read, the TileDB API returns data that matches the read keys from the most recent fragment by default.
The fact that older fragments are retained, however, means that you can explore the array from times in its history, or time travel through the array. Another way of looking at this is that TileDB implements a form of data version control.
Of course, storing all this historical array data requires storage space, and can make read operations slow as multiple fragments may need to be accessed. TileDB has a consolidation algorithm that takes the most recent write to each cell within the array and makes it available in just a single fragment, which reduces the read overhead.
Time Travel and Met Office Model Data
Data from Met Office forecast models have two time dimensions. Each model is executed on a certain cadence (a new model run kicks off every three hours, for example), and each model run forecasts a certain time period into the future (one model run may forecast only 48hr ahead while the next model run may forecast out to a whole week, for example).
The time at which a model run is kicked off is its forecast reference time, and the offset in time between the model run’s forecast reference time and each given timestep in the model run is that timestep’s forecast period. From these two values a single standard time coordinate can be calculated by adding the forecast period value at each timestep in the model run to the model run’s forecast reference time, but the extra forecast period and forecast reference time metadata are also retained.
Storing and comprehending this data makes for an interesting challenge! We’re excited to apply TileDB’s time travel functionality to this challenge. For example, we could use the time travel axis to store all model data at a given forecast reference time; that is, all the data from a single model run.
Fancy Array Functionality
TileDB includes some array functionality that’s unique to TileDB. TileDB arrays can have both be multiple attributes and variable-length attributes. Let’s take a look at each of these in turn:
- multiple attributes: a single TileDB array can store multiple attributes in the same array. In TileDB terminology, an attribute is a named nD array of data, equivalent to a phenomenon in Iris terminology, or a NetCDF data variable.
- variable-length attributes: in addition, a single cell in a TileDB array can hold multiple data values in a list, not limited by type. More information on this is available in the TileDB developer documentation.
Both of these effectively equate to increased array dimensionality. This functionality might also provide new ways of storing existing earth system science data. For example, you could use either of these examples of TileDB fancy array functionality to store x-wind and y-wind — either as multiple attributes of the same array, or as multiple elements in the same cell of a single xy-wind array.
This is not directly a piece of TileDB functionality, but something that could come as an extension on account of the fancy array functionality that TileDB does provide. We just saw how you could, for example, store both x-wind and y-wind in the same array, but you could also extend this to create virtual datasets from the multiple-attribute array containing x-wind and y-wind.
Wind speed and direction are commonly derived from x-wind and y-wind so we could create virtual datasets of wind speed and direction, based on the multiple-attribute x-wind and y-wind. These virtual datasets would contain no actual data, just the knowledge of how to index the underlying array to get x-wind and y-wind data at a given point, and the mathematical operations needed to turn these points from x-wind and y-wind into wind speed and direction.
As mentioned earlier, TileDB also supports a number of different languages, not just Python. While written in C++, it has bindings to provide a TileDB API in C, Python, R, Java and Go as well. So TileDB’s cross-language interoperability is very good, and means that any earth system science data (for example, from NetCDF files) stored in a TileDB array is immediately accessible in multiple languages.
This is, incidentally, a nice step away from common Informatics Lab thinking on data provision and data pipelines. This has unintentionally become quite Python-centric, assisted by the fact that a lot of data science now takes place quite exclusively in Python, with many high quality data science tools provided as Python libraries. Making the data available in a format that’s accessible in multiple languages is a good first step on the way to making a whole data pipeline that’s multi-language, and not so Python-first.
Comparing with other Formats
In the Informatics Lab we are exploring the best ways to store large volumes of earth system science data in a cloud-ready format. We feel that TileDB represents a strong candidate for doing this, but we are also experimenting with Zarr, and with NetCDF — the current de facto file format for earth system science data.
One of the aims of this work, then, is to compare TileDB to these other formats, particularly for storing large volumes of logically structured earth system data on the cloud. For example, we’re interested in exploring:
- How does TileDB compare to Zarr in terms of speed and scale when reading and writing data?
- How much of a reduction in cognitive load is there from using earth system science data stored in TileDB or Zarr over multiple NetCDF files, particularly when we can continue to use Iris and xarray to interact with this TileDB and Zarr data?
We hope we’ll be able to answer these questions as we continue with the next steps in this investigation (see below). We’ll share our findings in upcoming blog posts. So far, though, we’ve successfully converted large volumes of data from NetCDF to TileDB. We can perform experiments on these TileDB arrays that will help us answer these questions and others.
One of the main aims of the broader Informatics Lab project that this work is a part of is benchmarking the performance of different cloud data storage formats. We are particularly interested in comparing TileDB to Zarr, but also to NetCDF. Some of the benchmarks we’ll run target common earth system science data analysis operations, which can be challenging with very large volumes of data. These include calculating point means and rolling window calculations over a long time period.
The Lab has an ongoing partnership with Microsoft Azure, so all our experiments with TileDB have been run on Azure. Initially we were using FUSE (Filesystem in Userspace) to connect TileDB to Azure blob stores, but we have been trialling the recent dev release of TileDB that includes a direct interface to Azure blob, which we have found to be a great improvement over using FUSE.
The direct interface shows much better performance for both reads and writes to the TileDB array on blob, which is as we would expect. We have also found that a few odd errors when trying to distribute reads and writes using dask have also been solved, which we hoped would be the case. The sort of errors we were seeing were typical of parallelising something: a key that exists not being found when running in parallel, file lock errors and similar. It seems that having FUSE as an intermediate between the Python runtime and the TileDB array stored on Azure blob was making the operation not thread-safe, so we’re very pleased to have this solved by using TileDB’s direct Azure blob interface.
There are a number of improvements that could be made to the Lab’s Python library that allows interoperability between TileDB, NetCDF and Iris / xarray. For example, not all NetCDF metadata is being preserved by this library when writing from NetCDF to TileDB. Instead, only essential metadata (data values, name and units for the data values, and dimension-describing coordinates) are being preserved. This can be improved by passing more metadata out of the data model and into the TileDB array when it is written.
We are also working closely with the TileDB team on the technical side. We are working together to provision labelled axes in the core of TileDB. This will add to the functionality offered by TileDB, and the Lab’s Python Library can make use of it rather than providing its own opinionated approach for storing labelled axes in TileDB. We are also working on providing integrations for TileDB with Iris and xarray.
We have successfully used TileDB to store the contents — data and metadata — of hundreds of discrete NetCDF files in a single logical TileDB array. We have been able to interact with the data stored in this array directly. We have also been able to interact with this data using Iris and xarray, meaning we can take advantage of the rich data exploration APIs provided by these libraries. This means we have also proven the concept of extending the functionality of Iris and xarray to support a new data storage format.
There’s still more work to do — for example, the Python interoperability library we’ve written as part of this work is missing some useful features. What we’re really interested in doing now is using the data we’ve successfully stored in TileDB and exploring how it is to work with, and how fast we can run operations common to earth system science data analysis on this data stored in TileDB. We’re also excited to continue to collaborate with the TileDB team on future enhancements to TileDB.