Representing thousands of NetCDF Files using TileDB

Published in

Met Office Informatics Lab

15 min readDec 3, 2020

This blog post supports a poster that I presented at AGU2020. You can find the poster here. Note: the QR code in the poster links to this blog post, so if you scan the QR code you’ll just end up right back here!

Introduction

The Cloud is increasingly being used for processing very large volumes of scientific data. This reflects the increasing attractiveness of the Cloud for data processing — including the ease of access to large amounts of compute and storage, limited only by your budget; the wide array of hardware available (including fast CPUs, accelerators such as GPUs, ARM-based chips, and even HPC systems); different managed services to meet your needs (VMs, machine learning studios, kubernetes and more); the simplicity of not having to manage your own hardware; the ability to access other people’s data that’s also Cloud-hosted; and the offering of Cloud-based services and platforms such as Pangeo.

Given this increased uptake in usage of the Cloud, it follows that it is also increasingly important that data be easily findable, accessible and usable on the Cloud. To put it more succinctly, we require that scientific data that’s available on the Cloud be both Cloud-optimised and analysis-ready.

Cloud-optimised data: fileformats and data storage formats that are specifically designed to work effectively on Cloud data stores (that is, object stores).

Analysis-ready datasets: datasets that are completely ready to be analysed by a user, where the user doesn’t first have to perform data cleaning tasks, or locate missing elements of the dataset, or combine the contents of multiple files in order to be able to use the dataset, for example.

Let’s take a look at improving the provision of weather and climate data on the Cloud by providing this data in a cloud-optimised, analysis-ready format.

Data on the Cloud

Classical data storage (that is, methods of storing data that pre-date the Cloud) and Cloud data storage are very different entities. They have differing strengths and weaknesses, meaning data storage formats optimised for one will likely exhibit challenges for use on the other. This is certainly true of using classical data storage formats on Cloud data storage, and is the reason why — at least for data storage and processing on the Cloud — pursuing cloud-optimised data storage formats is important.

Let’s take a look at the key attributes of classical and Cloud data storage in turn. From these attributes we’ll draw some conclusions on ideal characteristics for data storage formats optimised both for classical storage and Cloud data storage.

Note that I’m intentionally only considering one classic data storage system and one Cloud data storage system. Other systems of course exist in both cases to which the comments below may not so directly apply. But for the sake of brevity we will consider just the one case each for both classical and Cloud data storage systems.

Classical Data Storage

Hardware: typically made up of a single locally-mounted disk, or possibly a few network-mounted disks.

Network access: latency — very low; throughput — high
Locally mounted disks exhibit very low latency and high throughput as there is a very short data transfer path between disk and compute.

File access: sequential — fast; parallel — very slow.
Network access characteristics to a single local disk give fast sequential access to parts of files or complete files. The single disk means that parallel access is very slow as the parallel accesses compete for the disk’s IO bandwidth.

File and folder hierarchy: typically POSIX-like hierarchical files and folders.

File browsing: POSIX-like paths on local or network-mounted disks.

Byte-range requests: always available, but must be supported by the application requesting the bytes from disk.

A data storage format optimised for classical data storage would then be a single, reasonably small file, with a logically separated internal byte structure. This single file can either be read in a single operation, or via multiple small byte-range requests that take advantage of the logical internal byte structure. This takes advantage of fast sequential reads without requiring slow parallel reads.

Cloud Data Storage

Hardware: thousands of disks logically connected to make a vast single storage volume.

Network access: latency — high; throughput — low
For typical access to a remote cloud data store, high latency and low throughput are natural outcomes of relying on the Internet for all interactions with the data store. If compute is close to the data — in the same Cloud region, for example — this pattern does not hold and network access is closer to locally mounted storage (as compute and storage are likely in the same datacenter on the same network).

File access: sequential — slow; parallel — very fast
Sequential file access is limited by the latency and throughput of the network, so will be worse than for locally-mounted disks. The fact that Cloud storage is made up of numerous disks makes parallel file access very fast as different requests can be made to different disks. And while each request is limited by the same network overhead as sequential files, this is also parallelised, so its slowdown is much smaller than the parallel file access speedup.

File and folder hierarchy: none; instead there is a flat file structure made up of key-value pairs, where keys are file names and values are the data bytes of those files.

File browsing: by http requests to the Cloud data store, which may be provided via a web interface.

Byte-range requests: only if exposed by the API for the Cloud data store, which must also be exposed by the application making the request. The high latency of requesting data from the Cloud data store means byte-range requests are often impractical to utilise.

A data storage format optimised for a Cloud data store is thus very different to one optimised for a classical data store. The high cost of making requests over a network means that individual requests are ideally kept small, with larger requests ideally being parallelised. The cost of requesting large amounts of data means that ideally the data bytes of a dataset be chunked into many smaller files that can be logically reconstructed into a complete dataset. The dataset’s metadata is included in separate metadata files, which contain both information about the the values in the data bytes (such as what the data represents and units of measure) and also a mapping that describes how the data chunks fit together to re-form the complete dataset. This allows the metadata to be requested separate from any of the data bytes, which can be requested in parallel in subsequent operations.

Some Cloud-Optimised Data Storage Formats

We have laid out the characteristics of Cloud data storage and how it is different to classical data storage. We’ve used this information to draw up how a Cloud-optimised data storage format would work. Let’s now take a look at some examples of cloud-optimised data storage formats.

TileDB

TileDB is described as a universal data engine. It can store gridded and sparse arrays, dataframes and custom scientific datasets and provide database-like access to the data. TileDB is cloud-native and stores data in a cloud-optimised format, but can also store data to a local filesystem.

The core TileDB engine is a fast, efficient and parallel C++ library that exposes a rich API. This API is also made available in C, Python, Java, R and Go.

Zarr

Zarr is a very lightweight, open-source, cloud-first data storage specification for storing chunked, compressed nD arrays. It offers chunked storage of datasets and parallel reads and writes to and from the data store. Zarr can also be used to create data stores in memory and on local disks.

Zarr is an intentionally lightweight specification for storing nD arrays on the Cloud. There is a reference implementation written in Python, but the lightweight nature of the specification means implementations in other languages are easy to produce.

There is also work ongoing to produce an implementation of NetCDF that uses Zarr as its storage backend. The aim of this work is to make the NetCDF fileformat itself cloud-optimised by means of the Zarr data storage backend, and so provide all the benefits of NetCDF and all the benefits of Zarr in a single package.

Others

There are other Cloud-optimised data storage formats too. These include Cloud-Optimised GeoTiff (COG) and Parquet. We won’t consider these formats further as they are not designed to store nD arrays, being more targeted at image files and columnar data respectively. So while they enable other sectors to work with cloud-optimised data, they aren’t directly applicable to our use-case.

Producing Cloud-Optimised TileDB arrays

A series of volumes of an encyclopaedia. Just as a series of files make a dataset, so these volumes make an encyclopaedia. — Photo by James on Unsplash

Let’s now take a look at how we represent the contents of thousands of NetCDF files in TileDB. The NetCDF files that we wish to translate are effectively a single dataset containing high-resolution UK weather forecast data produced by the Met Office. The dataset covers approximately 20 common weather phenomena that get forecasted (such as air temperature and humidity, forecast rainfall, surface pressure and wind), and cover approximately the whole period from 2016 to 2020 at three different temporal resolutions: daily, hourly and five-minute timescales. Our aim is to take all of these individual NetCDF files and represent them in a single cloud-optimised and analysis-ready TileDB (although technically we will create one TileDB array per temporal resolution).

The high-level overview of the process for converting thousands of NetCDF files into a single TileDB array looks like the following:

Generate an in-memory representation of the contents of a single NetCDF file.
Create a TileDB array from the contents of that NetCDF file.
Extend the TileDB array by adding the contents of further NetCDF files to the array.
Access and interact with the resultant array.

We ran all of these steps in Python, making use of existing Python libraries for loading NetCDF files and interacting with weather data, and using the TileDB Python API. As there is no existing translation layer between NetCDF and TileDB we wrote a Python library containing this functionality: tiledb-netcdf.

1. Representing a NetCDF file

The Python NetCDF interface provides access to the contents of NetCDF files as in-memory `Dataset` objects containing attribute dictionaries of metadata items and a number of `Variable` objects. These represent anything in the NetCDF file that have data bytes, which includes data variables, coordinate metadata variables, and other more bespoke variables as well.

What the `Dataset` object does not include is any indication of what sort of `Variable` any of the variables are — there’s no programmatic method of determining with certainty whether a given `Variable` is a data variable, a coordinate variable, or something else. For an accurate translation of a NetCDF file to a TileDB array we need to ensure that all the elements of the NetCDF file are equivalently stored in the TileDB array.

To enable this, tiledb-netcdf includes a custom data model for representing NetCDF files, based on the CF-NetCDF specification. This uses a series of heuristics to try and infer the type of each variable in the NetCDF file. The outcomes of this inference is stored in a series of lookup dictionaries in the custom data model, ready to be interrogated when the TileDB array is created in the next step.

2. Create a TileDB array

We can use the `Writer` interface of the tiledb-netcdf library to convert the contents of a NetCDF file into a TileDB array. The `Writer` instance interacts with a NetCDF file via the custom data model described above, and requests data variables, coordinate variables, metadata and more from the NetCDF file via the lookup dictionaries populated in the data model.

These are used to create a TileDB array with a slightly bespoke format that groups data variables according to the number of dimensions of the data and the coordinate variables that describe them. These data variables are grouped together in multi-attribute TileDB arrays that include metadata relevant to the group of data variables, with separate arrays that store data and metadata for the coordinates that describe each group of data variables.

3. Extend the array

Once a TileDB array has been created it can be extended to include the contents of other NetCDF files — ultimately so that the contents of all of the thousands of NetCDF files in the dataset are stored in the single TileDB array. There are a number of complexities to extending an array with extra data. These complexities come down to accurately placing the data to be appended within the index-space of the existing array and ensuring that appropriate metadata is preserved

In tiledb-netcdf the complexities associated with appending extra data are handled by the `append` method of the Writer class. The `append` method determines the offset along the append dimension for each NetCDF file to be appended and handles metadata and coordinate values. Note that append functionality is limited to being able to extend only a single dimension in an append operation.

4. Interact with the array

Finally, TileDB arrays created with tiledb_netcdf can be interacted with using the Python libraries Iris and Xarray. The `Reader` interface in tiledb_netcdf includes functionality to access the contents of the arrays via Iris or Xarray by loading the metadata contents of the TileDB array from storage and populating an Iris Cube or an Xarray DataArray from it. A TileDB array containing the contents of all the NetCDF files included in the dataset will be very large, so it would be infeasible to load all of the data bytes into memory at once. To avoid this tiledb-netcdf includes functionality to defer loading the actual data bytes by maintaining only a reference to the array that can be interrogated for byte ranges on request.

The advantage of having this functionality is that we can interact with the contents of the whole dataset, represented in a TileDB array, using tools that are well known for interacting with weather data. This reduces the bar to entry for scientists and data users, meaning they do not need to learn new approaches for interacting with the data. It also means we can continue to benefit from the domain-specific, often highly optimised functionality provided by Iris and Xarray without having to re-write it specifically for dealing with TileDB arrays.

A short code example

The following code snippet shows the code necessary to run all of the above steps in Python, using tiledb-netcdf, the TileDB Python API and the Iris Python library:

Benefits

Let’s take a look at some of the benefits of representing our NetCDF files as a cloud-optimised, analysis-ready TileDB array.

Data access: it is much easier to interact with the contents of thousands of NetCDF files in a TileDB array than it is via the individual files directly. The fact that there is just a single TileDB array in the place of thousands of NetCDF files means that accessing all the data is a lot quicker and a lot simpler as well. Much less pre-processing of the data is needed as it comes ready-made as a single logical dataset rather than the contents of all the NetCDF needing to be combined each time you wish to work with them. The data can still be interacted with using common tools like Iris and Xarray, and the whole dataset can be subset to retrieve specific elements of interest.

Data volume: as noted above, we can use a single TileDB array to represent the contents of thousands of NetCDF files. This allows a very large volume of data to be accessed from a single array, and all this data can be represented (metadata-only) in memory in a Python script, for example. For example, the image below shows an Iris cube constructed from such a TileDB array.

A very large Iris cube constructed from a TileDB array.

Data interaction: we can load all the data we need — be it an entire TileDB array or a subset of it — in a single operation and interact with this data via common Python tools like Iris and Xarray. The fact that this is done in a single operation means preparing the data is no longer a time-consuming process, nor a blocker on performing analysis operations on the data. Further, TileDB allows parallel interaction with actual data values, so analysis operations such as statistical aggregations can be distributed on scalable cloud data processing platforms.

Next Steps

There are substantial benefits to using cloud-optimised, analysis-ready datasets to represent large volumes of NetCDF data in the Cloud. There are further steps that we can take, though, to better understand the nature of the datasets that we have created and to further improve their usability. Some possible routes for further work, then, are detailed below.

User Experience

A major driver for representing NetCDF files as cloud-optimised, analysis-ready TileDB arrays is the experience of the data user. As we have explored already, it is much easier to load a single analysis-ready TileDB array than to wrestle with loading and combining thousands of NetCDF files.

There are still improvements we can make to the experience of the data user in accessing and interacting with such cloud-optimised, analysis ready data. For example, there is a reasonable amount of setup needed to connect TileDB arrays to an object store, and you need to know the name of the object store and possibly an access key as well. Ideally this complexity would not be exposed to the data user (and the access key never made public!). Similarly, without access to the object store you wouldn’t necessarily know what datasets are available.

One way to improve this would be to catalogue all the available cloud-optimised, analysis-ready datasets using Intake, for example. We could include all the complexity of accessing the object store in the Intake catalogue entry and remove that from the data user, so that loading a cloud-optimised, analysis-ready dataset from the Intake catalogue returned the dataset as an Iris cube or Xarray DataArray directly.

Library Improvements

The Python library tiledb-netcdf provides us with an interface between NetCDF and TileDB to store NetCDF data in TileDB, and a further interface to interact with TileDB arrays using Iris and Xarray. This is very beneficial functionality, but there are improvements that can still be made to the library.

From a functionality point of view, a key improvement that can be made is to utilise TileDB’s built-in labelled array functionality rather than using a bespoke system. The library was written before TileDB’s labelled array functionality was released. The downside of the bespoke system is that TileDB arrays created using tiledb-netcdf are not fully portable, so will only be correctly interpreted when being read via tiledb-netcdf.

From a usability point of view, the library would benefit from a test suite to ensure complex code is functioning as expected, as well as expanding the documentation to make it easier to use. The library is installable from the Python Package Index, but would also benefit from being conda installable.

Benchmarking

We will get a better understanding of the value of storing NetCDF data in cloud-optimised, analysis-ready TileDB arrays by benchmarking various elements of creating and using such arrays, and comparing TileDB to other options, such as Zarr and plain NetCDF. Another valuable attribute of such arrays to benchmark is their usability from a user experience perspective.

Performance benchmarks to run and compare include:

speed of writing data from NetCDF files to cloud-optimised storage formats.
speed of reading data from Cloud storage to an analysis platform to create an analysis-ready dataset. Here it would be valuable to compare reads of metadata only as well as reads of data bytes, and comparing reads to a local analysis platform as well as Cloud analysis platforms.

A major driver for using cloud-optimised, analysis-ready datasets is their usability. We can use this fact to create usability benchmarks that we can use to compare different cloud-optimised data storage formats, including:

Ease of creating analysis-ready datasets from input data.
Ease of accessing analysis-ready datasets once they’re created.
Ease of performing custom analysis on analysis-ready datasets once accessed.

To Sum Up

By representing a dataset composed of thousands of NetCDF files in a single cloud-optimised, analysis-ready dataset, we can greatly simplify the process of accessing and analysing the contents of the dataset, and entirely remove the requirement to pre-process the contents of the dataset to make it ready for analysis. This means data scientists are able to start performing analysis of the datasets directly without having to first pre-process and tidy the data. Cloud-optimised, analysis ready datasets are also ideally suited to representing the increasingly large volumes of data typical to many science domains.

A number of different cloud-optimised, analysis-ready data storage formats exist, so formats can be chosen to meet the requirements of many different use-cases. We have primarily focused on TileDB here, as this (along with Zarr) best suits our use-case. There are different advantages and disadvantages to each format, but they all share the advantage of providing cloud-optimised data storage.

There are still some challenges to providing and using cloud-optimised, analysis-ready datasets. For data providers, there is extra work needed to provide datasets cloud-optimised and analysis-ready, from translating existing datasets and ensuring the datasets accurately represent all the data they contain. For data users there can be challenges around learning new methods of finding and interacting with such datasets. Providing tooling to improve interactions to cloud-optimised, analysis-ready datasets can reduce the impact of these challenges, both for data providers and data users.