Serverless Querying of Remote Sensing Data at Scale

Published in

IBM Data Science in Practice

7 min readApr 15, 2021

This article and exploration is a joint effort from the IBM Research (Linsong Chu, Raghu Ganti, and Mudhakar Srivatsa) and The Weather Company (Chad Johnson, Sam Baskinger, and Marcio Saban)

Background

Availability of remote sensing data and other satellite imagery data has been increasing significantly in the last decade, with some estimates putting these data volumes at several hundreds of petabytes a year. Such data is the basis for weather forecasting, insurance loss estimation, and climate change. Providing access to these data in a fast, easy, and cloud native manner is key to driving innovations in this space. These data come in raster and vector formats. Raster data are digital imagery that capture Earth wide phenomena using satellites, aerial vehicles, or scanned maps. Vector data represent real world features such as places of interest, buildings, and road networks. In this article, we will focus on storing large volumes of raster data on Object stores and using simple python APIs without a database to query such data in an efficient manner.

Figure 1: Remote sensing meets the Cloud (source: Unsplash, credit(l) @usgs, credit(r): @dianmia)

With the advances in Cloud combined with cloud native formats for raster data, IBM Research and The Weather Company jointly explored the efficacy of using spatiotemporally indexed raster data on Cloud Object Storage (an S3 compatible Object storage) for querying varying regions and time frames. We will examine in the rest of this article the performance of querying 10 years of MODIS NDVI (vegetation index) data for the entire Earth.

Before we deep dive into querying and its performance, we note that today’s MODIS data (among other remote sensing data) typically is distributed in scientific formats such as netCDF, HDF, and GeoTiff files. These formats are great for distribution, but not good for querying data at rest. Popularity of querying data at rest (in object stores) has gained significant traction in the last few years and with big data formats such as Parquet and ORC, indeed querying data at rest is feasible and at performant query times as well as attractive price point. Can we follow a similar approach for Raster data and do we have all the “pieces” in place to leverage this query at rest primitive?

An image of a chessboard before playing has begun — Figure 2: Pieces in place (source: Unsplash, credit: @hpzworkz)

Given our years of expertise in this space and the development of key technologies such as Cloud Optimized GeoTiff (COG) , GDAL’s Virtual File System, and Object storage range reads, we felt that the data format pieces are in place and with the right transformations and layout of the raster data, we should be able to query this data using a simple Python API without having a large database or noSQL store stood up.

Ingestion and layout on object store

For the ingest process that transforms the scientific format to the COG format with the virtual file system index, we use simple transforms using a wrapper on the Python API to GDAL. For larger scale ingests, we deploy this in a Kubernetes service. For more details on the ingest step, the reader can refer to our related article here since it is out of scope of this article. We followed the process in the article and laid out 10 years of all of MODIS NDVI data as a virtual dataset on Cloud Object Store (using Cloud Code Engine, a serverless Kubernetes service). Now, we are ready to do some query performance testing! The total volume of data in the object storage is 2TB.

The layout on Object Storage is a Hive style layout with each “key” holding multiple TIF images and a index.vrt key holding the index information.

modis/processed_images/product=MOD13Q1.006/layer=MOD13Q1_250m_16_days_NDVI/date=20200609/

Query performance

In our experience, end users and developers are most interested in queries such as: given a region and timeframe, can we retrieve the corresponding raster images? We characterize the performance of retrieval of images from the 2TB dataset for different types of queries using a Watson Studio Python notebook with 8vCPUs and 16GB RAM. All the query times are measuring the time taken to query object storage and then write the TIFF images to the local storage and are averaged over 10 runs each.

The first query is a simple point and radius for a given day, where we increase the radius from 1km to 9km and measure the time taken to retrieve the images to the notebook. The results are shown below (Figure 3).

Query time in milliseconds versus Radius in kilometers for a point in Amazon. At 1 kilometer radius, the query time takes between 450 and 500 milliseconds. The query time reaches a maximum of close to 550 milliseconds when the radius is at 2 kilometers, and a minimum of slightly below 400 milliseconds at a radius of 4 kilometers. — Figure 3: Query time vs Radius for a point in Amazon

The average query times did not change much when varying the radius from 1 to 10km and took anywhere between 400ms and 500ms! The NDVI TIFF images are visualized in Figure 4 for 1km and 9km radius and the reader can see the zoomed out view of the Amazon river.

Figure 4: Amazon river, 1km view on the left and 9km view on the right (Source: MODIS NDVI data)

We observe a similar behavior when we change the query to a bounding box of size 9km² and increase the size of the bounding box in increments of 200m². The query retrieval times are shown in Figure 5.

Query time in milliseconds versus bounding box size in km². The query time reaches to a global maximum of between 500 and 550 milliseconds at a box size of 9200 km² and a local maximum at just slightly over 500 milliseconds when the box size is close to 10,200 km². The query time is at its minimum shown at just slightly larger than 350 milliseconds when the box size is close to 10,600 km² — Figure 5: Query time vs bounding box size in km²

The code corresponding to the bounding box queries is illustrated below (for brevity, we have implemented wrapper functions to easily query the data, which is part of the pyrip package).

Radius query

Bounding box query

It is interesting to see from Figure 5 that as the region sizes are increased from 9km² to 10km², the query times do not change much. This is due to the fact that MODIS NDVI data is at 250m granularity and the density is lower. We increase the query radius in steps of 50km and then measure the performance, which shows a linear increase.

Query times in milliseconds versus radius in increments of 50km. The query time grows steadily from the radius at 50 kilometers. At the beginning the query time is below 750 milliseconds, but when the radius reaches 450 kilometers, the query time reaches 2,250 milliseconds. — Figure 6: Query times vs radius (increments of 50km)

Number of round trips to Object Storage

A deeper understanding of the query performance can be obtained by measuring the number of metadata reads and range reads to the objects in the object storage. Depending on the density of the data and the indexing granularity, the number of queries to the object storage are affected. For the radius query, we measure the number of queries to the metadata (e.g., file list, file size) and the number of range reads (i.e., downloads from the object storage). These numbers are illustrated in Figure 7.

We observe from Figure 7 that the number of round trips to object storage is following a staircase pattern. This is because of range compaction when querying COGs.

Multiprocessing

For time range queries, where we have to retrieve multiple images over a period of time, we use the layout combined with multiprocessing to achieve efficient querying.

The key is that backend object storage inherently supports parallel queries, enabling us to fire off parallel queries and retrieve multiple images using multiprocessing.

We plot the query times for a given bounding box of size 9km² and varying time range from 2 weeks to 32 weeks in Figure 8. We observe from this figure that the query times are more or less a constant.

Multiprocessing query over a time range. The query time is in milliseconds and the time range is in two week increments. The query time reaches a global maximum of 1300 milliseconds when the time range reaches approximately 13 weeks. Query time reaches a local maximum of 1100 milliseconds when the time range is approximately 3 weeks. — Figure 8: Multiprocessing query over a time range

The code corresponding to this multiprocessing query is illustrated below.

As we observed with the radius query, when we increase the time range to tens of weeks, we see a linear increase in retrieval times as multiple processes are saturated.

Query times for large time ranges using multiprocessing. The query time is measured in milliseconds and the time range is given in two week increments. The query time grows mostly steadily, achieving a local maximum of close to 1600 milliseconds when the time range is at 40 weeks. The query time maximum on the plot is the last value charted: between 1800 and 2000 milliseconds when the time range is 90 weeks. — Figure 9: Query times for large time ranges using multiprocessing

Dataframes

A final note for data scientists is to provide simple tools to convert these raster images to usable pandas dataframes.

Our open source library pyrip provides convenient utilities to convert raster images to a Pandas dataframe and perform machine learning tasks. Some related articles from our team on this can be found here and here.

Conclusion

In conclusion, we have shown how to store and query large volumes of remote sensing data on Object storage. This enables the data scientist to leverage simple query APIs so that they can quickly retrieve spatiotemporal data of interest. We encourage the reader to explore more of what is possible using Watson Studio’s spatiotemporal and timeseries functionalities.