Create time-series datacubes for supervised machine learning with ICEYE SAR images

Published in

ICEYE Analytics

9 min readOct 19, 2021

This is the second blog of a five-part series on the AI4SAR project. If you haven’t heard about AI4SAR yet, now is a good time to see our project page!

My colleague, Arnaud Dupeyrat, and I recently presented ICEcube at ESA Φ-week. We were thrilled to see the interest in it and got some amazing questions about it!

In this blog, I’m diving a bit deeper into the icecube toolkit that enables you to create time-series datacubes for supervised machine learning using ICEYE SAR images. At the end of this blog, you’ll learn that ICEcubes can be used for more than just a margarita ;)

But first, what are datacubes?

Datacubes have been widely used for storing, accessing, and analyzing massive amounts of data. Think of them as information vaults that are user-friendly, platform-agnostic, and enabling easier access to the desired piece of information.

Many definitions exist based on the context and application. But for the sake of simplicity and for this blog, I see a datacube as a multidimensional array for handling big data. Ideally, the intent of such a datacube is to efficiently handle large content of data in a memory-optimized way.

High-frequency and high-fidelity Synthetic Aperture Radar (SAR) images are an example of such massive data. Datacubes are beneficial here because they enable ease of management of SAR images, facilitating their ingestion into downstream applications using traditional algorithms or machine learning.

AI4SAR, a project sponsored by ESA Φ-lab, is an attempt to lower the entry barrier to SAR-based machine learning applications.

And why can’t we use geospatial images in the raw format?

Remote-Sensing/Earth Observation (EO) data requires deep domain knowledge for data preparation and preprocessing. It cannot be readily used by non-EO experts, and working with SAR data adds another layer to that existing knowledge barrier.

For Machine Learning (ML) engineers, in particular, who are used to training models on natural images, these preprocessing steps, such as calibration, map projection, coregistration, and labels transformation from map geometry to SAR geometry, can add unnecessary pressure and destroy the primary purpose: training a machine learning model.

It is a known problem that the EO big data is not readily consumed by the Data Science/Machine Learning community due to this knowledge gap, and SAR experts try to bridge it through instructions on how to preprocess the data for different applications. But the burden of learning the ropes is still on the data consumers.

To shift this burden back to the data providers, EO data can be provided as analysis-ready data (ARD). ARD is user-friendly, as it simplifies the underlying knowledge that’s needed to prepare and manage the data.

Back to the icecube toolkit

With the icecube toolkit, we intend to provide a datacube for time-series SAR data, specifically designed for the Data Science/Machine Learning community. With the toolkit, we not only aim to bridge the gap between the EO data and the ML community, but also provide an open-source infrastructure to help and grow the community.

At the same time, we don’t intend the icecube toolkit to be a panacea that solves all the problems for machine learning engineers when they work with the EO data. Instead, we hope that it ignites a spark for active collaboration and ongoing communication within this community so we can all better understand the needs and let the toolkit evolve to serve them.

With the icecube toolkit, we want to ensure that the community can easily:

Analyze time-series ICEYE data
Configure ICEYE’s time-series data for critical analysis and A/B testing
Leverage the power of datacubes for accessing, sharing, and managing ICEYE data

icecube refers to the toolkit, and ICEcube is the datacube you build with this toolkit.

Let’s answer the following questions to give you some insight into the toolkit:

How can the toolkit be used to create ML-friendly datacubes with a time-series stack of images?
What are the building blocks of an ICEcube?
What is the significance of the datacube configuration and how does it reduce some of the ML engineering burden?

An ML-oriented datacube

ML models can be broadly divided into two categories: Unsupervised (no labels required) and Supervised (labels needed).

The aim of the icecube toolkit is to ensure that the generated datacubes not only preserve the SAR signal (amplitude/complex phase), but also the associated labels to train the supervised models with machine learning. This eases the burden of separately managing the annotations and SAR images.

(Image by ICEYE) Illustrates how the generated cubes fuse together with the ML pipeline for model training. This makes it easier for ML engineers to train ML models without worrying about the domain-specific knowledge of SAR.

A self-sufficient datacube oriented to facilitate AI/ML should not only contain the SAR pixel values, but also have the capacity to ingest labels when needed. Datacubes generated with the icecube toolkit can be divided into two major sub-datacubes: SAR-derived datacube and Labels datacube.

SAR-derived datacubes contain amplitude information for ground range detected (GRD) images and phase information for single look complex (SLC) images. The ability to retain complex information broadens the application of datacubes to use cases like coherence analysis and SAR interferograms. The metadata is preserved at the same time in the datacube’s dictionary-like objects, thereby enabling users to perform geospatial and computational analysis.

Pretty soon, we will be releasing a sample dataset that contains ICEYE SLC and GRD images so that you can play with the icecube toolkit!

Labels are an important component of the supervised learning algorithms. And ICEcubes have the ability to preserve annotations in vector and raster formats to enable end-to-end ML training. This makes ICEcube wholesome from the ML perspective where input data (X) and output (y) can be easily accessed within the same data object.

Raster labels contain pixel labels (mostly binary) for training the neural networks. It is most widely used in segmentation tasks, such as water segmentation. The vector labels contain labels mostly in the well-known text (WKT) geometries and support polygons, bounding boxes, point, and multipoint. But in practice, these labels don’t necessarily have to be in the WKT geometries and can be arbitrary vector labels.

The Labels datacube notebook in the icecube documentation shows how to build a Labels datacube with the toolkit.

The ICEcube data structure

Per the state-of-the-art implementation of datacubes, the most widely used data structure is xarray. It is an open-source project and provides out-of-the-box support for creating massive multidimensional arrays.

It enables you to easily manage different data types, build massive multidimensional data, perform parallel computing, and manage data in grids. Xarray integrates well with numpy, which makes it a great choice for ML pipelines because most frameworks work with numpy arrays to create tensors for model training.

Datacubes are usually massive and can contain up to several terabytes of data that cannot fit into the average computer memory. This is where Dask comes to the rescue with its parallel computational power.

It divides the array(s) into smaller chunks and queues operations on the chunks. Computations are performed on the chunks, block by block, when required (and this is where data is loaded into memory).

Xarray integrates with Dask to support parallel computations and streaming computation on datasets that don’t fit into memory. This enables the icecube toolkit to generate huge datacubes of arbitrary size, thereby solving the memory bottleneck.

I like that Xarray has a built-in support for Zarr, which helps create cloud-optimized datacubes with chunking and compression capabilities.

Finally, the generated ICEcube is stored in the netCDF4 format (.nc). Think of netCDF4 as a wrapper on top of HDF5 with improved features for data compression, parallelization, chunking, and support for huge multidimensional data arrays.

Alright! Now that we know a bit more about the data structure of ICEcube, let’s explore its design and architecture.

The ICEcube architecture

Given a coregistered stack of ICEYE SAR images, you can use the icecube toolkit to easily generate a datacube. You can use either a directory that contains the stack or a list of image paths, as illustrated in the following image.

The icecube toolkit facilitates the creation of datacubes for multi-temporal SAR images or commonly referred to as a SAR stack. It is important to coregister SAR images before creating a cube because it preserves the co-located understanding of the content in the map (longitude, latitude) or image (azimuth, range) coordinates over the time duration of the stack.

If you are a beginner, this is a great source to learn about coregistration.

(Image by ICEYE) Illustrates the architecture diagram of the icecube toolkit. An OOP-oriented architecture ensures a modular approach for the Python library. Low-level details are abstracted away from the users by the IceyeProcessGenerateCube class. This enables users to easily create cubes without worrying about the implementation details.

The primary components are briefly described as follows:

local ICEYE images: Indicates a local directory containing the coregistered ICEYE stack.
user_config.json: Contains the user-specified configuration for the datacube.
Labels.json: Contains labels for the ICEYE stack in the icecube-formatted JSON structure.
IceyeProcessGenerateCube: Is the main class that users interact with. It contains the logic to trigger the correct classes (or code blocks).
SARDataCubeMetadata: Builds the metadata from the specified stack of images. This is an efficient way to create a datacube without having to read images in memory.
SARDatacube: Is the parent class of GRDDatacube and SLCDatacube that generates a datacube for GRD and SLC images respectively.
LabelsDatacube: Is the parent class of RasterLabels and VectorLabels that generates a datacube for raster and vector labels respectively.

The following image illustrates the structure of a simple datacube that is generated using the icecube toolkit.

(Image by ICEYE) Illustrates a simple datacube. A datacube is basically an xarray dataset. Its values represent the concatenated xarray data arrays with dask and the coordinates are azimuth, range, and time. The metadata of the data arrays (or SAR images) are preserved in the dataset attributes.

SAR images inside the stack are converted to xr.DataArrays and concatenated along the third dimension. The metadata of each SAR image is preserved in the attributes of the xarray.DataArray.

Each xarray dataset in the datacube is represented by an identifier called data variable. With more stacks, you can generate more datasets. You can also concatenate xarray data arrays and datasets. With the parallel computing power of Dask, you can merge as many datasets as possible and create arbitrarily massive datacubes.

ICEcube configuration

Configuring datacubes by a JSON file provides a convenient way to build datacubes from a monolithic directory of SAR images without worrying about the manual selection of SAR data. You can dump all the SAR images into a common directory and slice the information based on your need by specifying the date range.

Similarly, for certain applications, taking an observation after some ‘d’ days can be helpful, which is where the temporal_resolution parameter of the configuration can be quite helpful.

Be it an event of interest with a short time period or of a repeated nature, you can configure your ICEcube to suit your application. Moreover, you can use a range of incidence angles to study the effect of different incidence angles for a specific application.

The following is a brief description of the parameters that you can pass to configure the datacubes:

start_date: The date from which the SAR images will be considered for the stack.
end_date: The last date of the image stack.
min_incidence_angle: The lowest incidence angle in the image stack.
max_incidence_angle: The highest incidence angle in the image stack.
temporal_resolution: The observation interval (number of days) to be considered between the images in the stack.
temporal_overlap: To decide whether images with the same date should be considered.

If you use the default user configuration, all the images inside the directory are considered for building the datacubes and no pruning on dates or incidence angles is performed.

To sum it up, the power of ICEcube combined with custom configuration provide convenient ways to arrange, manage, and analyze data. This helps you focus on analytics instead of worrying about how to manage and filter your data for analysis.

Build your own ICEcube!

If this blog has piqued your interest, jump right into this detailed notebook to build your first ICEcube now!

Happy Cubing :)