RasterFrames: A Data Scientist’s Point of View

Jason Brown
Oct 23, 2019 · 6 min read

At Astraea, we believe that RasterFrames will revolutionize the practice of analyzing Earth-observation (EO) data by enabling data scientists to efficiently work with tremendous amounts of this data. RasterFrames, our open-source toolkit, allows us to process EO data in a DataFrame for the first time. Backed by Apache Spark, RasterFrames brings a very mature parallel compute capability to this inherently big data. The rest of this post explains some basics about EO data and how, until now, it has been tied up in specialized tools. We will show how RasterFrames aligns with and enhances exciting trends to enable data scientists to easily find valuable data, focus on higher-level abstractions, and ignore many of the idiosyncrasies of how EO data are published by satellite operators.

Before we dive into that, it’s important to point out that RasterFrames is not a narrow tool. It extends general-purpose Spark DataFrames, Pandas DataFrames, and GeoPandas GeoDataFrames to work with EO data, without interfering with these libraries’ excellent general purpose functionality for all kinds of data. This means you can integrate EO data into your existing analysis to search for meaningful signals and lift in models.

We believe that data scientists are curious and excited about Earth-observation data.


Big Geospatial Raster Data

And this data is staggeringly big.

In the US, NASA is poised to produce 24PB of EO data in 2019. There are also public satellite operators in Europe, China, and Brazil; and worldwide private organizations operating hundreds of small but mighty satellites. Local governments and private operators routinely capture extremely high resolution images from aircraft and drones. Add all these sources together, and the growing archive of geospatial raster data is well into exabyte territory.

A Legacy of Big Data

Once data was downloaded, analysis was typically undertaken with specialized programs and libraries. These tools defined the work that could be done with them. They afforded workflows tightly focused on opening, viewing, editing, and writing a small set of files. Tools oriented to visual tasks will always have an important place but are usually poor choices for machine learning and other data science workflows.

EO as Found Data

For data scientists to consider EO as found data, we need to eliminate the friction of finding, accessing, and processing a few files at a time.

There are some tools that support very large spatial coverages and long-time coverage. They require deep expertise in EO concepts, knowledge of GIS software, and dedicated data engineers to use them successfully. They also require a priori data selection and design decisions over the selected dataset. This falls short of allowing us to use EO as found data. Let’s take a look at some important trends and technologies RasterFrames is building on and how the library enables data scientists to do just that.

To the Cloud

And parallel computation enables speed!


A second thread is the emerging Spatio-temporal Asset Catalog (STAC) standard and related catalogs. Much of the imagery in STAC catalogs is hosted in publicly accessible cloud stores. At Astraea, we host a public STAC catalog that is growing from 6 million data products and 3 PB of imagery. COG format and STAC catalogs are becoming the standard even for established scientific organizations: the US government plans to publish Landsat 8 and 9 using these technologies.

In RasterFrames, a catalog is a DataFrame of image metadata and URLs to access it. Catalogs are the primary entry point for reading raster data. To enable efficient data exploration, raster reads are lazy and biased toward leveraging all the advantages of COGs to enable us to process the data in parallel. The data in the Catalog can be heterogeneous, meaning they can be from different instruments and publishers.

catalog = catalog[           
(catalog.eo_gsd < 50) &
(st_intersects(catalog.geometry, mallorca_geom)) &
(catalog.datetime > '2019-07-01') &
(catalog.datetime <= '2019-07-31')]

Starting with a catalog, we can cheaply reason over coarse data selection, then move into exploration, summarizing, and building models within the same workflow. Building on the extensive and excellent work of GeoTrellis, RasterFrames provides dozens of efficient functions to operate on raster data. These functions provide the flexibility to describe the analysis over heterogeneous imagery data.

All of this work can be done with Spark or Pandas DataFrames, bringing custom EO functionality to those familiar and general-purpose tools, and providing seamless integration with popular machine learning libraries.

df = spark.read.raster(catalog, ['red', 'green', 'blue', 'nir])# RasterFrames specific transformation for SparkML with imagery exploder = TileExploder()   # Pack features into dense vector for SparkML 
assembler = VectorAssembler().setInputCols(['red', 'green', 'blue', 'nir'])
# K-means clustering from SparkML
kmeans = KMeans().setK(5)
# Spark ML pipeline and model fitting
pipeline = Pipeline().setStages([exploder, assembler, kmeans])
model = pipeline.fit(df)

What’s next

In future posts in this series, we will take you on a tour of RasterFrames features and provide use cases to inspire what you can do with it. In the meantime, check out RasterFrames documentation to learn more. As the RasterFrames community progress towards the 1.0.0 release, we hope you will to try it out and invite you to connect with us.

Written by Jason Brown
Senior Data Scientist at Astraea

I am grateful to my colleagues at Astraea who reviewed drafts of this post and provided useful feedback and encouragement.


See the Earth as it could be.