How we judge Earth Observation foundation model quality, part 1: Intuition building

Published in

Earth Genome

6 min readOct 24, 2023

There are now several AI “foundation models” for Earth Observation (EO) data. These vast and versatile neural networks can be rapidly fine-tuned for many downstream tasks, making them an appealing tool. But choosing the right one for a given project is easier said than done. So what is the best way to decide?

Quantitative benchmarks are one excellent option. However, benchmarks that reduce model performance to a single set of metrics can also miss a level of qualitative understanding and nuance. In practice, I believe that it’s often best to start by building intuition about a model by performing a targeted series of simple experiments.

Benchmarking and intuition building are complementary tools, like focus groups and surveys. Both have their appropriate time and place — and both deserve their own blog posts.

This is the first post in a two-part series where we’ll discuss both of these approaches. In this post, we cover the qualitative techniques of intuition building. In the next post, we’ll discuss considerations for quantitative methods of foundation model benchmarking.

How to build intuition

I’m a firm believer that you should always start work with a model by exploring a series of tightly-scoped “demo” use cases. By getting your hands a little dirty, you can quickly discover more nuanced insights that might be missed if you’re only looking at a set of metrics.

However, this intuition building shouldn’t be completely unguided. At Earth Genome we have developed a set of four guiding principles that have served us well:

Explore use cases that cover several kinds of signal: temporal, spatial, and spectral

It’s not possible to judge a general foundation model’s holistic performance based on performance on a single task. Think about the famous “hot dog/not a hot dog” detector from Silicon Valley. It performs perfectly on the “hot dog” task, but doesn’t exactly extend well to other tasks! Instead, intuition building should center on performing a variety of tasks that stress-test the model across all three kinds of signal present in Earth observation data:

Temporal: Do things change over time? How?
Spatial: Are clear patterns or textures visible in the data?
Spectral: How does the data vary depending on band/spectral slice? (To oversimplify: what “color” is the thing we’re looking for?)

When testing a foundation model for Earth Index, our search tool for the environment, we try searching for a diversity of targets that span this set of signals. A number of these are listed in the table below, specifying whether searching for this target will require the foundation model to encode spectral, temporal, or spatial signals.

Consider deforestation. Forest loss has an obvious temporal signal: tree cover decreases over time. It also has a spectral signal: generally speaking, things turn from “green” to “brown” (for remote sensing pros, NDVI decreases over time). Finally, it has a spatial signal: a forest has a particular “texture” in an image, and the boundary line between forest and non-forest is an important feature to track.

*Deforestation in 2023 in western Cambodia. Imagery courtesy Planet’s NICFI program.*

2. Visualize input imagery whenever possible

Connecting model outputs with raw imagery inputs sounds basic. But I suspect many people don’t invest enough in this step, especially when coming to geospatial ML from an ML background with more abstract data sets. If a model is performing strangely, visualize a number of inputs and outputs and see if patterns are discernible. Also keep in mind: (a) geospatial data exists in geospatial context, so plot imagery on maps whenever possible (also view across time if relevant); (b) satellite imagery is often multispectral, so visualize bands or false color composites; (c) always try to visualize the actual input data/sensors (it might be simpler to pull up Google satellite imagery of your ROI, but if you’re using sentinel-2 data you might miss the effect of clouds or other sensor-specific factors).

*Sentinel-2 observations of a single poultry CAFO in Alabama being built in 2020/2021.*

3. Careful error analysis is always worth the effort

Spend time probing where your model is performing well and where it is performing poorly. Make sure to spend time with all kinds of results (e.g. true/false positives and true/false negatives). Patterns in model errors can show the limitations of what a model can discern. For example, in an early version of Amazon Mining Watch, which tracks gold mining in the Amazon, we saw several false positives due to logging. Fine-tuning the foundation model was required to reduce this false positive mode.

4. Use scatter plots to explore the relationships in data

We deploy foundation models to produce “embeddings”, which are vector representations of input data. Not every embedding space is semantically meaningful, but in many cases you can learn a little about how a model “thinks” by exploring the “space” of embeddings. Tools like PCA and t-SNE can help reduce the high dimensionality of embeddings to human interpretable levels (but be careful — they can require a little fine tuning and experimentation).

One great tool that we’ve recently loved playing around with is Nomic AI’s deepscatter library. It makes visualizing and interacting with millions or billions of points snappy and fun. As an example, check out the animation below, where we cluster the state of Alabama based on a t-SNE reduction of foundation model embeddings. When we color points based on Dynamic World land cover labels, it’s clear that the embedding space well separates locations based on content. Zooming in and visualizing the raw imagery used to generate individual embeddings shows that semantically meaningful sub-clusters emerge; things like mining, airports, and poultry CAFOs all cluster together. (If you want to deploy a deepscatter visualization through github pages, check out our deepscatter-template repo.)

Clustering embeddings can help visualize relationships in data

Building a deeper understanding

In the rapidly-evolving world of Earth Observation foundation models, navigating the sea of options can be overwhelming. Qualitative intuition building is a crucial first step. This method allows us to get a holistic feel of a model’s strengths and weaknesses, rather than solely depending on quantitative metrics. By combining intuitive exploration with rigorous benchmarking, we can ensure we’re choosing the best models for our specific needs.

If you’ve had similar experiences, differing viewpoints, or innovative methods you’d like to share, I invite you to leave a comment below.

Stay tuned for our next post where we’ll delve into the quantitative benchmarks that complement this intuition-building process. We’ll unpack the metrics and methods that help us judge these powerful models with precision and clarity.

Special shout out to former Earth Genome team member Caleb Kruse, who formerly led much of our embedding exploration effort and built many of the visualizations in this article.

How we judge Earth Observation foundation model quality, part 1: Intuition building

How to build intuition

Building a deeper understanding

Written by Ben Strong