Cloud Native Geoprocessing of Earth Observation Satellite Data with Pangeo
If you are familiar with satellite imagery you’ve likely heard that we are entering a “golden era” of Earth Observation. It’s true! New satellites are generating Petabyte-scale publicly available archives of imagery at unprecedented rates, enabling new insights and fast global impacts.
This deluge of imagery is forcing scientists to reconsider traditional workflows of downloading thousands of image files to work on a personal computer. The alternative is a “Cloud Native” approach - forgo downloading data and operate on the data where it is stored on the Cloud. This approach has the major advantage of being able to utilize vastly scalable computing resources to crop, transform, and apply algorithms to imagery very quickly — Quickly enough to enable interactive analysis at full resolution over the entire globe.
For scientists this scalability and interactivity is fundamental to the process of discovery. A few years ago a seminal paper was published that quantified global deforestation using the entire Landsat archive using a Cloud Native approach (Hansen et. al. 2013). This paper was truly inspirational, demonstrating that questions of global scope are not limited to a select few with access to supercomputers. And in the last several years, Cloud Native tools for scientific research have been growing rapidly. In this article we will highlight a number of these tools, but focus on the Pangeo project: “A community platform for Big Data geoscience”.
Cloud Native Landsat Analysis with Pangeo
We’ve developed an example Cloud Native quantitative analysis of Landsat 8 satellite imagery. What is special about this example is that the analysis is easily reproduced, scalable, and interactive: 100 Gigabytes of Landsat 8 images covering Washington State (representing the entire archive back to 2013–03–21) are found using NASA’s Common Metadata Repository (CMR). Then, using URLs instead of local file paths, the Normalized Difference Vegetation Index (NDVI), a simple landcover classification algorithm, is run in seconds on a Cloud-based cluster. Compare this to a traditional workflow, in which a scientist must wait hours or days to download and decompress 100 scenes from a USGS server, then run analysis locally.
One of the hallmark features of the Pangeo project is a community-developed JupyterHub instance running on Google Cloud with a preconfigured Python environment and Kubernetes cluster. This environment can be customized and launched with the click of a button using Binder, allowing anyone to run Python code interactively in a web browser. A more detailed blog post on the implementation of Pangeo’s Binder instance can be found can be found here. And you can interactively run the full Landsat example simply by clicking the button below!
One very special feature of the Landsat example is that computations are done in parallel and on-the-fly, made possible by several independently developed Python packages (xarray, rasterio, dask, holoviews) coming together with magical results! For example, take a look at the following screencast, which demonstrates dynamically computing NDVI for selected dates at a resolution suitable to the current zoom level.
You can also easily extract a time series for a particular pixel or patch. Interested in a different region, different index, or color scale? The example can easily be modified and run and code, graphs, and images saved to your local computer for future use.
Archives, formats, and analysis ready data
One appealing feature of the Landsat example is that the a user needs only familiarity with Python, which has become one of the most pervasive programming languages in the scientific community. Parallel computation and memory management are taken care of by Dask behind the scenes. Rasterio and Xarray know how to pull down chunks of full resolution images, and Dask knows how to distribute computations that use those chunks. What’s more, if local memory is exceeded, data is written and read from disk, allowing for computations to be run without fear of “out of memory” errors.
However, this workflow would not be possible if the images weren’t stored in a format amenable to Cloud Native analysis. In the Earth Observation community there is a lot of excitement surrounding Cloud-Optimized Geotiffs (COGs) which are described in detail on https://www.cogeo.org, and well-advocated for in a series of blog posts by Chris Holmes (start here). In brief, COGs are Geotiff files that have internally organized overviews and image tiles and support HTTP range requests (enabling downloading of specific tiles rather than the full file). COGs are also nice because they work normally in GIS software such as QGIS. Strictly speaking the Landsat 8 images on Google Cloud are not in the COG format because they do not include built in overviews, but critically HTTP range requests still work.
Our Landsat example is not necessarily optimized in terms of computational efficiency. One simple way to speed up the analysis would be to work with “Analysis Ready Data”: At a basic level, images with the same dimensions that are aligned to the same coordinate grid, such that chunks are uniform and retrieved efficiently. The USGS has created such an archive for Landsat 8, but it is not available on a public Cloud.
For now, the reality is that most Earth Observation data is not stored on the Cloud, and of the data that is, much of it is not in a format amenable to Cloud Native workflows. There are innovative solutions using on-demand format conversion, such as the amazing GOES-16 data staging tool created by Element 84. Nevertheless, we hope that as NASA moves public archives to AWS, Cloud Native formats will be used and will lead to rapid and exciting new discoveries!
Pangeo and other geospatial processing platforms
There are many “platforms” currently under development that are designed to harness the power of commercial Cloud compute resources for scalable and fast analysis of Earth Observation data: Raster Foundry, EOBrowser, GBDX Notebooks to name a few. What distinguishes Pangeo from these platforms is that Pangeo is based purely on general purpose, open source, community based tools. Since these tools are designed well, they combine easily into something that is greater than the sum of the parts. It’s important to acknowledge that while the constituent tools are open and free, running analyses on the commercial Cloud is not. This is why some platforms charge hefty subscription fees to use their services. For now, Pangeo is generously supported by grants from the National Science Foundation and NASA which include credits on Google Cloud Platform .
In order to ensure flexibility and long term sustainability of Cloud Native tools, it is important to focus on Cloud-agnostic tools and recipes. Pangeo is tackling this issue by streamlining deployment with multiple Cloud providers. There are other great efforts on this front, generally spearheaded by academic groups. For example, OpenEO is providing “A Common, Open Source between Earth Observation Data Infrastructures and Front-End Applications”, see here for an excellent description of why this effort is so timely. And it’s worth noting that the Earth Sciences are not the only academic discipline confronting the issue of reproducing large scale analyses on Cloud infrastructure: For example, REANA comes from particle physics analyses and “… helps researchers to structure their input data, analysis code, containerised environments and computational workflows so that the analysis can be instantiated and run on remote compute clouds.”
Large archives of public satellite data on the Cloud can be a tremendous resource for the scientific community. They can also seem at first like being gifted the proverbial white elephant — how can researchers manage and conduct reproducible research with 45Tb of Landsat data? Fortunately, tools are emerging that enable researchers to make the most of Earth Observation data.
Pangeo is an exciting resource for Earth Observation because it is a collection of free and open cutting-edge computational resources geared toward Earth Scientists. It is also a collaborative community of scientists and developers who are eager to discuss these resources and continue to advance them. We hope this article and example analysis have piqued your interest in the Pangeo project. If you’d like to get involved, please visit the Pangeo website, or make direct contributions on the project GitHub repository.
This blog post and the Landsat analysis example was a team effort with important contributions from Daniel Rothenberg, Matthew Rocklin, Ryan Abernathey, Joe Hamman, Rich Signell, and Rob Fatland. Landsat data is made publicly available by U.S. Geological Survey.
The Pangeo project is currently funded through grants from the National Science Foundation and the National Aeronautics and Space Administration (NASA) . Google provides compute credits on Google Cloud Platform.