RasterFrames: A Data Scientist’s Point of View

Published in

Astraea.Earth

6 min readOct 23, 2019

At Astraea, we believe that RasterFrames will revolutionize the practice of analyzing Earth-observation (EO) data by enabling data scientists to efficiently work with tremendous amounts of this data. RasterFrames, our open-source toolkit, allows us to process EO data in a DataFrame for the first time. Backed by Apache Spark, RasterFrames brings a very mature parallel compute capability to this inherently big data. The rest of this post explains some basics about EO data and how, until now, it has been tied up in specialized tools. We will show how RasterFrames aligns with and enhances exciting trends to enable data scientists to easily find valuable data, focus on higher-level abstractions, and ignore many of the idiosyncrasies of how EO data are published by satellite operators.

Before we dive into that, it’s important to point out that RasterFrames is not a narrow tool. It extends general-purpose Spark DataFrames, Pandas DataFrames, and GeoPandas GeoDataFrames to work with EO data, without interfering with these libraries’ excellent general purpose functionality for all kinds of data. This means you can integrate EO data into your existing analysis to search for meaningful signals and lift in models.

We believe that data scientists are curious and excited about Earth-observation data.

As you read this, there are satellite missions imaging all the Earth’s landmass and vast swathes of the seas. Some have years or decades of historical data and ongoing data collection. EO provides undeniable measurement regardless of borders, enabling practitioners to make global impact.

https://rasterframes.io/getting-started.html#pip-install-pyrasterframes

Big Geospatial Raster Data

In the first post of this series, we described RasterFrames as a tool for big geospatial raster data. Let’s unpack that bit of jargon. Geospatial rasters are typically overhead images or measurements of the Earth’s surface: from cameras or telescopes aboard satellites, aircraft, or drones. Raster means that measurements are organized as an image, where the pixel values can represent just about any measurable phenomenon: such as temperature, elevation, plant health, etc. Geospatial means that the image is tied to a location. We sometimes also see the term spatiotemporal to emphasize that the data has both place and time of its capture.

And this data is staggeringly big.

The State of EO

The growing explosion of Earth-observation data offers unprecedented opportunities for firms and NGOs to better execute…

medium.com

In the US, NASA is poised to produce 24PB of EO data in 2019. There are also public satellite operators in Europe, China, and Brazil; and worldwide private organizations operating hundreds of small but mighty satellites. Local governments and private operators routinely capture extremely high resolution images from aircraft and drones. Add all these sources together, and the growing archive of geospatial raster data is well into exabyte territory.

A Legacy of Big Data

As stated in the first post, EO data was Big Data before the term even existed. EO data providers published the data for scientists and had to balance factors like spatial coverage and high data compression. For those users, searching and downloading files was a key part of the legacy workflow. Users had to choose carefully and download as few of these 50 to 100 megapixel files as feasible. That was an acceptable condition when studying a well-defined phenomenon in a specific region and time.

Once data was downloaded, analysis was typically undertaken with specialized programs and libraries. These tools defined the work that could be done with them. They afforded workflows tightly focused on opening, viewing, editing, and writing a small set of files. Tools oriented to visual tasks will always have an important place but are usually poor choices for machine learning and other data science workflows.

EO as Found Data

One way to distinguish data science from pure science is how data is collected. In pure scientific practice, data collection is designed to test a hypothesis. However, in data science many hypotheses are brought to found data, meaning available data that may have been collected for another purpose.

For data scientists to consider EO as found data, we need to eliminate the friction of finding, accessing, and processing a few files at a time.

Because we don’t know which data is valuable up front, we want to consider as much found data as possible.

There are some tools that support very large spatial coverages and long-time coverage. They require deep expertise in EO concepts, knowledge of GIS software, and dedicated data engineers to use them successfully. They also require a priori data selection and design decisions over the selected dataset. This falls short of allowing us to use EO as found data. Let’s take a look at some important trends and technologies RasterFrames is building on and how the library enables data scientists to do just that.

To the Cloud

One shift in the use of EO data is the emerging trend to publish it in Cloud Optimized GeoTiffs (COG) format. Two advantages of COGs are efficient reading of file metadata and small portions of the image. This enables parallel in-memory computation on these files, without having to copy data around.

And parallel computation enables speed!

https://www.eclipse.org/community/eclipse_newsletter/2018/december/rasterframes.php

A second thread is the emerging Spatio-temporal Asset Catalog (STAC) standard and related catalogs. Much of the imagery in STAC catalogs is hosted in publicly accessible cloud stores. At Astraea, we host a public STAC catalog that is growing from 6 million data products and 3 PB of imagery. COG format and STAC catalogs are becoming the standard even for established scientific organizations: the US government plans to publish Landsat 8 and 9 using these technologies.

Landsat Collection 2

A primary characteristic of Collection 2 data is the implementation of the Sentinel 2 Global Reference Image (GRI) into…

www.usgs.gov

In RasterFrames, a catalog is a DataFrame of image metadata and URLs to access it. Catalogs are the primary entry point for reading raster data. To enable efficient data exploration, raster reads are lazy and biased toward leveraging all the advantages of COGs to enable us to process the data in parallel. The data in the Catalog can be heterogeneous, meaning they can be from different instruments and publishers.

catalog = catalog[           
    (catalog.eo_gsd < 50) &           
    (st_intersects(catalog.geometry, mallorca_geom)) &           
    (catalog.datetime > '2019-07-01') &           
    (catalog.datetime <= '2019-07-31')]

Starting with a catalog, we can cheaply reason over coarse data selection, then move into exploration, summarizing, and building models within the same workflow. Building on the extensive and excellent work of GeoTrellis, RasterFrames provides dozens of efficient functions to operate on raster data. These functions provide the flexibility to describe the analysis over heterogeneous imagery data.

All of this work can be done with Spark or Pandas DataFrames, bringing custom EO functionality to those familiar and general-purpose tools, and providing seamless integration with popular machine learning libraries.

The example below uses SparkML to create an unsupervised image segmentation model.

df = spark.read.raster(catalog, ['red', 'green', 'blue', 'nir])# RasterFrames specific transformation for SparkML with imagery exploder = TileExploder()   # Pack features into dense vector for SparkML 
assembler = VectorAssembler().setInputCols(['red', 'green', 'blue', 'nir']) # K-means clustering from SparkML 
kmeans = KMeans().setK(5)  # Spark ML pipeline and model fitting 
pipeline = Pipeline().setStages([exploder, assembler, kmeans]) 
model = pipeline.fit(df)

What’s next

RasterFrames' latest releases in the 0.8 series provide important capability that aligns with and leverages trends in how EO data is being published and made searchable today. The combination of querying vast catalogs and on-the-fly data reading through a friendly DataFrame API unlocks EO in new ways and to a much broader set of users than ever before.

In future posts in this series, we will take you on a tour of RasterFrames features and provide use cases to inspire what you can do with it. In the meantime, check out RasterFrames documentation to learn more. As the RasterFrames community progress towards the 1.0.0 release, we hope you will to try it out and invite you to connect with us.

Written by Jason Brown
Senior Data Scientist at Astraea

I am grateful to my colleagues at Astraea who reviewed drafts of this post and provided useful feedback and encouragement.