Metadata Queries

Another option for user interactions with large datasets

Peter Killick
Met Office Informatics Lab
17 min readMar 30, 2021

--

Photo by Anthony Martino on Unsplash

Analysis-ready, cloud-optimised (ARCO) data has become an area of particular interest recently — within the Pangeo community and more widely. The concept of ARCO data reflects that, for large volumes of data stored in the Cloud to be maximally useful to end users, it must be both:

  • analysis-ready; that is, the data can be used for a variety of analysis operations without any processing other than loading the data.
  • Cloud-optimised; or provided in formats, such as Zarr or CoG, that are designed to work effectively on Cloud data stores. We’ve blogged recently about a few different options Cloud-optimised data storage formats, and what makes them particularly suited for use on the Cloud.

These attributes of ARCO data mean that users can work with it as provided without first having to expend significant amounts of time making the data usable. In this sense, ARCO data is a good step towards also having FAIR data (Findable, Accessible, Interoperable and Reusable data) — ARCO data being Accessible, Interoperable and Reusable — and, via new tools like Pangeo-Forge, also Findable as well.

Often it’s the case that the data attributes of analysis-ready and Cloud-optimised come together. So, for example, hundreds of small files in a legacy format may be able to be combined together to form a single larger dataset. By converting all these files to a single Zarr, we can make one dataset that is analysis-ready (the user doesn’t need to combine the small files before analysing the whole dataset) and Cloud-optimised (the data has been converted from legacy formats to the Cloud-ready Zarr format).

Usage Patterns

The nature of ARCO data means it lends itself to one particular usage pattern.

This pattern is that of supersetting to form a single large dataset structure that can be loaded and analysed. This is certainly a useful pattern for many workflows, particularly workflows that operate on large volumes of data such as are provided by whole ARCO datasets. It is not the only usage pattern, however, and other workflows may be better served by different usage patterns, such as those that require a specific or known subset of the ARCO data rather than the whole dataset.

To demonstrate this, here’s a diagram that shows three conceptual, orthogonal axes of options for interacting with ARCO data. All three axes are familiar, but we’ve only considered the top one in relation to ARCO data so far:

Axes of interaction with ARCO data.

At the centre of this diagram is some “ARCO data”. We don’t know too much about it, and in a sense we don’t need to know too much about it either — from the perspective of the data user, we shouldn’t need to care about how the data is provided, just that it is provided and that we can access and use it without difficulty. For the sake of the example, though, let’s assume that the ARCO data has been provided as a number of individual files, and the data provider has done some work to ensure the files are provided in a cloud-optimised format, and to ensure they’re analysis-ready somehow.

As we thought about above the diagram, ensuring the files are analysis-ready would typically mean combining them into a single object that can be used for large-scale analysis. This diagram, however, suggests two further patterns by which users of the ARCO data can interact with it. The diagram also includes a secondary vertical axis that indicates whether each pattern tends to provide the user with a superset of all available data, or access to a subset of it. To briefly define these two terms:

  • superset: combining inputs to form single logical entity made up of all the inputs.
  • subset: extracting elements of the inputs based on specified criteria.

For a more in-depth consideration of these two terms, take a look in the following §On Supersetting and Subsetting. Let’s now explore the three access patterns introduced in the diagram above in a little more detail.

Dataset

We can take ARCO data and superset it to produce a single dataset that can be easily loaded and analysed. This is particularly suited to bulk analysis patterns that need to operate on large volumes of data.

For example, we need to find the difference over time between two possible future climate scenarios. To do this we need to load a century of climate forecast data for each scenario, find the spatial mean of each scenario at all timesteps, and finally plot the two resultant datasets together.

Catalog

We can present all of our ARCO data in a data catalog, such as might be presented by Intake or STAC. This allows users to get an overview of the contents of the ARCO data from the metadata, so that specific elements of the ARCO data can be selected from the catalog and then loaded and analysed.

For example, exploring the catalog of the ARCO data could reveal that the data describes global weather forecast data produced by Met Office models; specifically of air temperature, humidity pressure and wind data, for the whole of 2019, on hourly, daily and monthly timescales.

Query

We can make queries against the metadata of the ARCO data. This allows users to select specific elements of the ARCO data that match a metadata query and analyse these elements only.

For example, in a dataset containing air quality observations over the UK, the user could submit a query to retrieve all data describing PM2.5 (fine particulate matter) measurements within Birmingham in March 2018.

Catalog vs. Query

The catalog and query access patterns are quite similar. Both provide a route for a data user to explore the data provided and select specific subsets of that data. You could also consider searching the contents of a STAC Catalog to find a specific STAC Item.

There are some differences between the two patterns as well:

  • Catalogues offer functionality for browsing a dataset to explore its contents before selecting items of interest.
  • A query engine allows you to select items from a dataset based on specific search terms.

The idea of querying a dataset based on metadata is not intended as a competitor to catalogues. It is just a demonstration of another option for a data user to utilise to explore and interact with a large dataset. In practice, it may prove easier to catalog some datasets and provide metadata query engines for others. The following §Possible Applications provides some examples of where metadata query engines may prove valuable.

Technical Detail

Let’s draw the diagram above in a different way, which provides more focus on technical detail. Doing so reinforces the idea that these three usage patterns are effectively three different views, or ways of interacting with the same ARCO data:

A more detailed representation of options for interaction with ARCO data.

In this stylised example, our ARCO data is composed of six NetCDF (.nc) files, named a — f, and we can interact with this data using all three usage patterns.

We can take the superset of all the data by combining all the data into a single dataset via a bespoke curation. This bespoke curation could be defined by the data provider, or the dataset could be registered on Pangeo Forge, for example.

We can also subset the data to only retrieve elements of it. The catalog approach, backed by Intake or STAC, for example, registers all of the files in the ARCO data in the catalog and allows a user to explore all the files via the catalog. Similarly, users could submit metadata queries to a query engine, which would return only matching files from the ARCO data (a.nc and c.nc in this case).

Note that this also highlights a functionality gap — in the provision of a metadata query engine. The rest of this blog post will focus on producing an implementation of a metadata query engine as an example of this functionality.

On Supersetting and Subsetting

Let’s take a moment to better define what we mean by supersetting and subsetting. To do this, let’s assume that our ARCO data is made up of hundreds of small files — represented by the six files a.nc — f.nc in the diagram above — that could be combined into a single logical dataset. For example, each file might represent one particular timestep (such as successive hours or days). The logical dataset in this case would have a new dimension for time, with each point along this dimension described by a particular file.

Taking this example, then, a superset of the ARCO data would combine the files together as described above and provide that as the single logical dataset with a time dimension, ready for analysis without data users first having to combine each file to build the time dimension themselves. On the other side, a subset of the ARCO data would provide the contents of all the individual files that match either to some query criteria, such as all the files that fall within a certain time period; or are certain files selected from browsing a catalog of the dataset.

Metadata Query Engine

A metadata query engine provides the functionality needed to enable data subsetting based on matches to metadata queries. All metadata pertinent to the data to be queried is stored in a database along with references that allow each metadata item to be related back to the original data item. Or, to put it more tacitly, it’s a typically-functioning database.

A metadata query engine therefore allows you to select out of a large volume of data only a subset that matches a submitted query. Thus you could select data that describes a certain phenomenon, such as air temperature; or select all data with certain attributes, for example it’s observations data, or was produced by a specific version of a model; or select data that covers a specified spatio-temporal area, such as data from 2018 over London.

Overview

To demonstrate how a metadata query engine could work, let’s look at one practical example. This example takes a collection of files describing weather and climate data, exports each file’s metadata into a document database, and provides a query client to easily connect to and submit queries against the document database as well as a demonstration notebook. This example explicitly works at the file level — so each file is interrogated for all metadata, which is included, along with the path to the file, in the document database.

In particular, for this demo implementation:

Detail

To build a metadata query engine we need both a database of metadata entries linked to files that provided the metadata in each entry, and a mechanism for submitting queries to the database to retrieve matching files. Thus, a query run against the document database finds all metadata matching the query and returns the file path stored in all matching documents. These paths to matching files can then be consumed by downstream functionality.

The following architectural diagram provides an overview of how the different elements of the metadata query engine connect together, including the functionality of the metadatabase library as an adaptor between weather and climate data and mongoDB:

Architectural diagram of the metadata query engine, including the metadatabase Python library.

All document database functionality is provided by mongoDB Atlas, mongoDB’s managed Cloud database service. This handles storage of all database documents and executes queries submitted via pymongo. The metadatabase library provides an intermediate layer between weather and climate datasets and mongoDB, handling the conversion of dataset metadata into JSON format and providing a custom interface to query the database, which handles only returning the file paths to matching files.

The full set of operations, then, is as follows:

  1. One or more files are passed to metadatabase. These files are loaded as Iris cubes by functionality provided by the CubeToJSON class. Each cube’s metadata is saved to JSON in a standardised format.
  2. The JSON files are passed to mongoDB Atlas (via a Client method) to be inserted into a specified database and collection.
  3. Further Client methods wrap mongoDB database read and query functionality, allowing users to make metadata queries against a specified database and collection. These bespoke wrappers provide extra functionality, such as returning query results ready-loaded as Iris cubes.

Possible Applications

There are some particular situations where having a metadata query engine provided to interact with a large volume of data may prove beneficial.

The particular use-case this demonstration was put together for is to simplify access to data stored in very large volume data archives, such as those attached to institutional data centres. Typically the sheer amount of data stored in these archives is too large and heterogeneous to be able to be represented as a single logical ARCO dataset. The interface to such archives can also be cumbersome, and both of these challenges together make the data stored in these archives neither easily findable or accessible. By building up an index of the contents of all the files in the archive, users can more easily search the contents of the archive by running queries against the metadata query engine to find and extract files of interest from the archive.

With a little bit of work, a metadata query engine such as the one demonstrated here could be adapted to run queries against a data lake / data warehouse. Again, the database would need to maintain an index of the metadata of all the files contained in the data lake, but with that in place, a user could explore and retrieve files from the data lake by running queries against the metadata query engine.

One more use-case could be querying ARCO datasets directly. For example, if a user of an ARCO dataset knew they needed a specific subset of the ARCO dataset described by a few of the original files used to make up the dataset, the user could extract those files and interact with them directly by making queries to the metadata query engine, rather than first loading the entire logical ARCO dataset and extracting from that the specific subset they needed.

GUI

The final element of this practical example of a metadata query engine demonstrates running metadata queries via a GUI to provide a simple user interface to the metadata database. A preview is shown in the image below, and the original Jupyter notebook used to make the GUI example is included in the metadatabase git repo.

A screengrab of the metadatabase GUI demo.

The GUI is provided as a Jupyter notebook with embedded IPywidgets providing authenticated access, database and collection selection, and a query field. In the image above the notebook has been served by voila. File matches to the query within the database are presented both as the full path to the file, and also as Iris cubes in a cubelist.

Limitations

Considering this demonstration of a metadata query engine even for a short while will make it clear that it includes a number of inherent limitations. Some of these are simply to keep the complexity low, some may be excusable on account of the intended applications of such a metadata query engine, while still others point to genuine limitations in the approach being demonstrated.

Let’s take a look at some of the limitations in more detail.

Tight Coupling
The metadata query engine works by maintaining a record of the specific location of the file that provided each metadata document in the database. Thus, if the file is changed in any way (including being moved, modified, or deleted) then the query engine will return results that are out-of-date. Similarly, it means the database is not portable, but is tightly coupled to the archive (or similar) against which it has been constructed.

If we do not expect a given database of metadata to be portable, but fixed to a particular archive, this limitation is partly mitigated. Nevertheless, a system would still be needed to ensure that any changes in the archive are reflected in the metadata database, so that queries always return accurate results.

Files are the quantum unit
Although this sounds spectacular, in reality all it means is the smallest unit a query can return is a reference to a single whole file. This has a couple of particular implications.

Firstly, it is common for a single file of weather and climate data to contain multiple data arrays. This can happen when all the data arrays are on the same grid, or when all the data arrays were produced in the same operation. A common weather modelling example of this is vector components of wind (u-wind, v-wind and w-wind; the vector components of wind on the x, y, and z axes respective to the underlying model grid). If a single file contained all three components but the user queried for a standard name of u-wind, the single file (which contains all three components) would be returned, so all three components would incongruously be loaded in a subsequent load operation.

Secondly, a limitation in this demo implementation of a metadata query engine cannot return a subset of a file. On a technical level, this comes about because the documents in the database store only file metadata and a reference to the file, but not the actual data points from the file. A significant advantage of this is that the document database remains small (metadata volumes are typically significantly smaller than data volumes), but it also means that references to whole files only can be returned, and not a subset or an index of the file.

For example, if a file contained 100 years of data, you could not use the metadata query engine to return to you only the 10 years of data you were particularly interested in. Instead, the query would return the file containing all 100 years of data, and it would be up to the user to subsequently subset the file once loaded to retrieve the 10 years’ data of particular interest.

Limited-scope queries
Simple equality-test queries can be run against any metadata contained in the documents making up the document database. More complex queries have not been implemented, even though such queries should be possible to implement. Examples of these more complex queries that have not been implemented are inequality queries (for example, locate all files containing data for years before 1980, or locate all files that measure height data above 1km), and spatio-temporal queries (for example, locate files that only contain data over the UK — perhaps by supplying a bounding box that must completely encompass all a file’s spatial extent).

Incomplete metadata
Not all of the metadata present in files being indexed by the metadata query engine is currently included in the metadata documents that make up the metadata database. An example of this is the bounds attribute of coordinate metadata, which, if present, describes the extent (that is, the lower and upper bound) of each point in the coordinate array. With the coordinate bounds metadata not translated, queries based on bounds (even whether bounds are present or not) will fail.

UX
Queries must be written and passed as Python dictionary objects in this demonstration. While this is arguably simpler than writing SQL-like queries, it is not necessarily the most intuitive way to pass queries to the metadata database. There is no hinting or autocompletion of allowed keys and values, requiring that the data user has an understanding of the schema used to generate the metadata documents, or at least at the type of metadata expected in the files being indexed. Worse still, metadata elements can be nested, and the syntax for specifying nested elements is quite unintuitive. So, while an un-nested metadata query only requires getting Python dictionary syntax correct:

{“standard_name”: “air_temperature”}

The syntax for checking if a coordinate has bounds, were the example from the previous limitation to be fixed, would presumably look quite unintuitive:

{“dim_coords.<coord_name>.bounds” = True}

Where Next?

Despite the limitations of the metadata query engine that have just been considered, the core concept of a metadata query engine has proven to be sound, with the possibility of having genuine practical application. To move this from a proof-of-concept towards being a useful application, we could look at taking the following steps.

Make it Go

A good first step would be trialling a deployment of the metadata query engine that indexes and provides a search function for a real-world data archive. By exploring its useability — and its usefulness — based on feedback from users accessing the archive, we can determine what works well about the metadata query engine, what doesn’t, and whether it provides beneficial functionality.

A good indication that the metadata query engine provides a valuable service will be if we find that users are consistently and quickly able to find the data they need in the archive more quickly than having to manually search the archive. Such a finding would suggest that the metadata query engine is making life easier for archive users and improving the findability and accessibility of data stored in the archive.

If we find instead, however, that users chose not to use the metadata query engine when accessing the archive, or that inaccurate results are returned, or that the query engine errors when being used, then this will be an indication that, at the minimum, more work is needed before we can deploy metadata query engines more widely.

One more thing we would need to consider when deploying a metadata query engine for searching an archive is the scalability of the query engine. That is, how does it handle multiple concurrent queries? We would need to consider both the client and the database server and ensure that both can handle concurrent requests, and have sufficient resources to handle requests in a timely manner.

Make it Better

We noted in §Limitations that there are a number of improvements and optimisations that could be made to the query functionality and UX of the metadata query engine. Implementing these would make the metadata query engine more functionally useful and easier to use.

Of particular value would be providing functionality that enables spatio-temporal queries to be submitted. Selecting data based on it covering a particular area in space and/or time is a common data access method, so adding this extra functionality would certainly be a valuable addition. We could further extend the functionality of the metadata query engine so that it also extracts subsets that precisely match the extents requested in the query from files returned and loaded by queries. At this point the metadata query engine would be able to supply an in-memory dataset, ready for analysis, that precisely matched the extents requested in the query.

Improving the manner in which query strings are provided to the query engine — so that they are not so prescriptive in format — would improve the experience for users of the metadata query engine. This perhaps particularly applies to the GUI, which requires a Python dictionary passed as a string. This is quite unintuitive! A possible way to improve this would be to provide a number of text entry fields, where keys and values have discrete text entry fields. This would remove the need for the end user to have to type a syntactically-correct Python dictionary into a single text entry field, as the dictionary could be built automatically behind the scenes from the discrete entries in the keys and values fields.

In Summary

In this post we’ve explored a new concept for user interaction with large volumes of data: querying the data, based on its metadata, to return a subset of the data that matches the metadata query that was made. This sits alongside existing methods for interacting with such large volumes of data — those of making an ARCO dataset of the data, or cataloguing its content.

We’ve taken a look at a demo implementation of a metadata query engine, and noted that while there are some limitations to it, we’ve also seen that it’s functional and an effective method for interacting with large volumes of data. Finally, we’ve explored some next steps that would improve the existing implementation to make it more full-featured, as well as considering situations where we could deploy an operational trial of a metadata query engine.

--

--

Peter Killick
Met Office Informatics Lab

Cloud Platform Architect, open-source software engineer and technology researcher in the UK Met Office Informatics Lab. I tend to blog on these themes.