What should a geospatial data service look like?

Niall Robinson
Met Office Informatics Lab
6 min readJan 28, 2020

This research was supported by the ERDF through the Impact Lab.

In a recent poll of Pangeo users ~70% of responders said the challenges with using public datasets were “ease of access”, shortly followed by ~60% stating “finding data”. Work environments and scalable compute are increasingly available from cloud providers as a Service (aaS), but this is now exposing the next bottleneck in our workflow — just getting the data.

We’ve worked with several open data programs with the major cloud providers to make Met Office data available for research. However, the access patterns we see are telling: people tend to hoover up the entire datasets wholesale. This tells us that we aren’t yet providing an effective data service for users to interact with the data at source.

An effective geospatial data service needs to do a few different things:

  • Represent geospatial datasets, that is multidimensional gridded data, as meaningful analysis ready objects, not as a set of chunk data files
  • Let people discover what datasets are available
  • Scale well so it can be utilised by distributed compute clusters

In this post we’re going to talk about our mental model for how data services could be architected with a primary focus on cloud services, discuss some of the relevant technologies, and describe a project we’re planning in the Informatics Lab to look at this further.

A generic architecture

A Data Service gives analysts Data Sets as Analysis Ready Data Objects…but what does all this mean? And what comprises a Data Service?

There are a lot of terms with the word “data” in them — increasingly there is confusion about where component services begin and end. So, it’s important from the off to be precise about the difference between all these terms.

A Data View is a representation of some raw data which is conceptually meaningful. For instance, in gridded atmospheric data this might be a temperature field as a function of space, time, and forecast. Note that the same underlying data can often be represented by different views, for instance, a lagged ensemble weather forecast can be though of as a function of forecast time, or a function of time since the start of the forecast.

An Analysis Ready Data Object is something which represents a Data View in a way that a human can interactively interface with to perform their analysis. In our context, that really means a Python object like an Iris Cube, an Xarray Data Array, a Pandas Data Frame or something similar. Note that, whilst these are human friendly objects, they can equally well be used in back-end code-based services.

In our model a Data Service is operationally defined as something which provides consumers with a Data View in the form of an Analysis Ready Data Object. So what could comprise this Data Service?…

At the most primitive level, a Data Store deals with…unsurprisingly, storing raw data. Big data sets are often broken up into chunks for storage. This confers lots of practical advantages, primarily availability, in that different chunks can be accessed simultaneously by distributed compute processes.

Traditionally, chunks have been stored on POSIX based file systems or tape archives. However, as we move to the cloud, object stores such as AWS S3 and Azure Blob Store are coming to the fore. These services put more emphasis on high availability at the expense of indexing what is being stored.

Chunks have long been difficult for humans to interface with directly, leading them to anti-patterns in their analysis. Additionally, cloud object stores are not really designed for humans to interact with directly. Which brings us to…

The term Data Base means a lot of different things to different people. Broadly speaking, we’re using this term to mean a service which manages and optimises access to raw chunked data. Note that this definition is slightly more limited than a lot of established data base technologies such as NoSQL, MongoDB etc, which also deals with storing the data, and don’t generally scale well with multidimensional gridded data. In this context, the focus is on defining Data Views and managing access to the composite chunks. But how do we find out about all the different Data Views that might be available to us?

A Data Broker is responsible for turning Data Views into Analysis Ready Data Objects. Our analyst could, in principle write queries for the Data Base, but in reality there are often well established Python objects for interacting with data. The Data Broker is simply responsible for wrapping the underlying services and presenting them in a way that is useful for our analyst. Note that the same Data View could be presented as multiple Analysis Ready Data Objects — for instance, a forecast temperature field could equally be presented as an Iris Cube or an Xarray Data Array depend on what the analyst wants.

A Data Catalogue is defined as a service which stores information about all the different Data Views which are available. It doesn’t store any data per se, rather it stores metadata about the views — when it was created, what physical quantity is represents etc. It is designed to be searched to find Data Views that you want to utilise. But once you’ve found a Data View, how’s best to analyse it?

Some specific technologies

We’ve been experimenting with a bunch of emerging technologies which can fulfil these different component purposes. Note that some of these technologies may not fit neatly into their assigned boxes.

Several different possible stacks.

Optimisation considerations

There are multiple trade-offs when choosing how to implement these component services.

To munge or not to munge? Firstly, we need to decide whether we are happy to munge data when putting them into Data Stores, for instance, converting data from NetCDF, which is the standard data format in the Met Office, to native Zarr format. If we mung the data, it’s harder for people to access with legacy tools. We also have nearly 1EB of data in our archive, which is too much data to mung into new data formats. However, munging does standardise data access, and can lead to optimisations.

What about chunking policy? Choosing the shape of our chunks is an opinionated step which is inevitably optimised for particular access patterns and not others. In time, we would like to investigate a model where we can easily change the chunking policy in response to access patters. We think it may be possible to do this automatically using machine learning algorithms, much like some data caches are primed.

Indexed vs Generative Data Bases? Data Bases can often be split another way: indexed vs generative. Indexed Data Bases keep a central record of what data chunk is where which is then queried to retrieve data. This is in contrast to generative approaches, which find data chunks based on generating chunk addresses using a standardised nomenclature rather than querying a centralised index. As such, generative approaches tend to naturally scale very well. Zarr is an example of a generative approach.

Declarative vs ingestive Data Bases? In our experience, gridded Data Base technologies can often be broadly split into two categories: declarative vs ingestive. Ingestive Data Bases inspect incoming data, before building a Data View. On the other hand, in declarative Data Bases, Data Views are declared based on a priori knowledge. This approach is often useful for us: we’ve normally generated our data from an earth system model, which means we know the structure we are expecting out — that is, the scientists who generates the data, tend to know the Data View. This is in contrast to ingestive Data Bases which often struggle to algorithmically create meaningful Data Views.

We’ve done a lot of development on generative declarative Data Bases, such as our Hypotheticube approach.

Concluding remarks

Various systems have tackled this problem in the past. However, we think that the problem has grown more acute, and relevant new technologies have emerged. Some of the existing approaches to this challenge encompass more than one of the concepts presented here, which is a double edged sword: this can lead to better integrated holistic system, however we think that now is the time for a more deliberate separation of concerns as described above. That way we can create an ecosystem of data services which can all work together.

--

--

Niall Robinson
Met Office Informatics Lab

I'm the Deputy Head of the Met Office Informatics Lab and a Senior Lecturer at the Global Systems Institute, Exeter Uni. Trying to make data useful.