Skip the download! Stream NASA data directly into Python objects

Scott Henderson
pangeo
Published in
5 min readDec 9, 2020

Authors: Scott Henderson (University of Washington eScience Institute), Matt Hanson (Element84, Inc.)

NASA, STAC, Interoperability

Scientists looking for NASA data often start by going to Earthdata Search, a wonderful web application for quick search and discovery of NASA’s huge Earth Science archive (32 Petabytes today and projected to grow to 250 Petabytes in the next five years!). This is a tremendous international resource for the geosciences, but there is even more valuable data from other space agencies (ESA, JAXA, DLR…) and commercial operators (MAXAR, Planet, Capella…)! Which is why it is exciting to see adoption of modern metadata standards like Spatio-Temporal Asset Catalogs (STAC 1.0) emerging for better search interoperability among data providers.

As Earth observation archives continue to grow at a blistering pace, it is becoming increasingly desirable not to download and manage files at all. Instead you often want to either 1) open the small pieces of large files directly in your favorite programming environment, or 2) stream large quantities of data on platforms co-located with the data archive for high performance. Fortunately the combination of STAC metadata, cloud-optimized data formats, and open-source software is a powerful system for consistent and efficient access to geographically dispersed Earth Observation archives. In the following short article we’ll illustrate this system in practice and highlight Python libraries sat-search and intake-STAC.

Intake-STAC, a new approach to data management

Most search and discovery starts with a geographic bounding box, time of interest, and perhaps a specific dataset in mind. Here is small function to perform a search with the sat-search library, and then load the found items with intake-STAC:

During the STAC 6 sprint, we spent some time releasing a new version of intake-STAC to work with the increasing number of STAC1.0 catalogs including NASA’s own STAC endpoint. For example, maybe you are interested in Sentinel-1 radar data from the Alaska Satellite Facility (ASF) covering Mt. Shasta (Úytaahkoo) in Northern California in 2020:

The great thing about standardized STAC API endpoints is that you simply have to change the URL to search other data providers. For example, you can use Element84’s Earth Search to find public datasets on Amazon Web Services (AWS) in the same area of interest. In the example below, we’ve just changed the URL endpoint and the collection to search for public Sentinel-2 multi-band cloud optimized geotiffs:

There are many ways to interact with intake catalogs (items_s2), a common starting place is list Item identifiers list(items_s2), or explore the metadata as a GeoPandas GeoDataFrame gf = items_s2.to_geopandas():

Representation of an intake-STAC Item Collection catalog as a GeoPandas GeoDataFrame

Lazy loading high-resolution data cubes

Here is where things get especially exciting! Each STAC Item can have multiple Assets (such as thumbnails, additional metadata files, and various data files). intake-STAC is designed to easily stream STAC Assets into an Xarray DataArray. A unique feature of Xarray is its integration with the Dask library for lazily loading data for distributed computing. What this means is that you can read just enough metadata to describe a dataset, but only read bytes corresponding to data values when required. This integration is critical in order to start interacting with large datasets in milliseconds:

HTML representation of an Xarray DataArray (Sentinel-2 band4 raster image)

This particular dataset is a great example of a large file (241 MB uncompressed) where you might only want to analyze a small subset. For example, you‘d like to hone in on Mt Shasta rather than work with the full coverage of this satellite image:

Screenshot of hvplot interactive visualization of Sentinel-2 band4 image. Stand out features with this particular color scale include Mt. Shasta on left, and forest clear-cuts to the East.

What about NASA data?

Accessing public data without authentication is very straightforward, but what about the Sentinel-1 data from the NASA’s Alaska Satellite Facility that we started with? To work with NASA data we need to use Earthdata Login credentials:

Once you’ve configured your authentication, you can easily load data into Xarray just as before! Because this particular data is stored in NetCDF4 format with subdataset Groups, we’re required to specify some additional Xarray options:

Screenshot of hvplot interactive visualization of Sentinel-1 radar backscatter image. Stand out features with this particular color scale include Mt. Shasta (-122.2, 41.4) and bright lava flows to the Northwest.

Concluding thoughts and outlook

This article is a glimpse of how STAC metadata, cloud-optimized data formats, and open-source software are promising to revolutionize the way that scientists interact with Earth Observation data. No longer downloading entire files, and instead using a common syntax to stream subdatasets on-demand, will greatly help with data management and facilitate high-performance workflows.

  1. Data format and authentication matter
    For lazy loading and subsetting, the data storage format can have a very large impact on computation efficiency. As seen in this post, it’s possible to open legacy NetCDF or HDF5 formats from URLs, but it’s quite slow. We recommend data providers embrace cloud-native formats like Cloud-Optimized Geotiff, Parquet, Zarr, or TileDB. All these formats enable reading subsets of data efficiently over a network connection. Efficient authentication mechanisms and the ability to directly access object storage rather than going through proxies and re-directs can also significantly impact data access speed for high-performance use cases.
  2. Public data needs standard metadata for maximum utility!
    There are many public geospatial datasets hosted on Cloud providers that do not have accompanying STAC metadata (follow links to public data on: AWS, GCP, Azure). You can find currently available public STAC search endpoints using the fantastic STAC Index. We’d love to see more data providers providing public endpoints! This would be particularly useful for disaster response data (for example Planet or Maxar), which is currently provided via distinct and sometimes cumbersome interfaces . If you’d like to help out with creating valid STAC metadata, check out the pystac library.
  3. Help develop intake-STAC
    As a relatively new project, there is a lot of additional functionality and documentation that could be added, so please consider contributing! Check out more detailed examples of working with intake-STAC or experiment with interactive notebooks on mybinder.org: https://mybinder.org/v2/gh/pangeo-data/intake-stac/master?filepath=examples?urlpath=lab

The code highlighted in this blog post is available as single notebook viewable on NBviewer.

Acknowledgements: STAC #6 contributors included Joe Hamman, Anderson Banihirwe, and Alex Mandel. intake-STAC is built on top of the fantastic intake and filesystem_spec libraries created by Martin Durant.

--

--

Scott Henderson
pangeo
Editor for

Research geophysicist at University of Washington eScience Institute