What is the Open Data Cube?
The Open Data Cube (ODC) is a community of people and organisations building capability for working with earth observation data. At the core, the ODC is a software library and set of command line tools. And keeping the core together we have a group of organisations collaborating on strategy and who may maintain specific implementations of the ODC.
According to the OpenDataCube.org website:
The objective of the ODC is to increase the impact of satellite data by providing an open and freely accessible exploitation tool, and to foster a community to develop, sustain, and grow the breadth and depth of applications.
At a technical level, an implementation of the ODC is made up of three things: data, an index and software:
- Data is usually file based, either in local directories of GeoTIFFs or NetCDF files, but data can be anything that GDAL can read, including Cloud Optimised GeoTIFFs stored on AWS’ S3.
- For the Index, the ODC uses PostgreSQL as a database to store a list of Products (a specific data type, like Landsat 8 Analysis Ready Data) and Datasets (a single instance of a Product, for example, a single Landsat 8 scene). The index enables a user to ask for data at a time and location, without needing to know specifically where the required files are stored and how to access them.
- The Software at the core of the ODC is a Python library that enables a user to index data (add records to the Index), to ingest data (optimise indexed data for performance), to query data (returning data in a standard data format) and a wide range of other functions related to managing data.
The broad goal of the ODC is to make it easier to manage large data holdings, without requiring data to be stored in a specific way or in a specific place. What this means is that you can point it at your data repository and index the data where it sits, abstracting the complexity of managing large, distributed data holdings.
There are also a number of projects that sit on top of the ODC to add further capabilities, such as a WMS server, a Django-based user interface and a holdings and provenance dashboard. We are also finding that the ODC, when used in combination with Jupyter notebooks, is very useful for data science applications.
A little bit of history
A long long time ago, satellite imagery was distributed on large rolls of tape. A less long time ago, in 2011, Geoscience Australia worked with a number of other organisations to copy the Landsat data they had stored on tapes and in other locations onto spinning disks at the National Computational Infrastructure as part of the Unlocking the Landsat Archive project. Some time after this, the Australian Geoscience Data Cube (AGDC) was developed, a Landsat specific tool able to enhance access to these Landsat archives.
More recently, AGDC was rewritten as AGDCv2. This newer version had a number of goals in mind, including supporting arbitrary coordinate reference systems and file formats, and keeping track of processing provenance. In 2017, AGDCv2 was renamed the Open Data Cube and governance structures were set up to ensure that the project could have continuing long term support.
Where are we now…
The Open Data Cube project has major operational deployments in three countries, and many other countries are in various stages of implementation.
While the focus of the ODC has been big installations, such as Digital Earth Australia, there has been some recent work at the infrastructure level which aims to make it easier to get started. This includes Docker images and reference implementations using Docker Compose, so that getting started with the ODC only takes a few minutes. And some of the team are working on a JupyterHub sandbox environment, so that the ODC can be evaluated without having to deploy any infrastructure at all.
And that brings us to where things are heading with ODC in relation to the development in the broader community.
… and where are we going?
So what is the future? Without jumping to the unpredictable realms of possibility, there are a number of key technologies that the leaders among earth observation organisations are pushing on and Chris Holmes has been writing extensively on most of these topics.
Firstly, Cloud Optimised GeoTIFF (COG) is a fantastic standard that makes it possible to store data on AWS’ S3 or similar, and for small parts of the file to be accessed without the need to download the whole file. Because the ODC uses Rasterio and GDAL to read data, it is possible to handle COGs natively, and so the ODC is able to index data from S3 in place. This means there’s no need to transfer vast holdings of data to a local workspace — it can all be streamed on demand.
Secondly, SpatioTemporal Asset Catalog (STAC) is a metadata standard that is designed to sit alongside spatial data stored on a cloud service, in order to provide information about what data is available in a catalog, and specific information about data that is needed to easily utilise it. While the ODC can’t read STAC yet, work is underway to explore the use of STAC files as a source of information to index data, and there’s further potential for the ODC to expose a STAC representation of its index.
But ultimately, the future of the ODC is in its users.
So join us. Help to ensure that the ODC is useful to you.