Ensuring Dataset Replication Integrity in Earth Engine: A Collaborative Approach

Google Earth
Google Earth and Earth Engine
3 min readDec 13, 2023

--

By Simon Ilyushchenko, Google Earth Engine Data Tech Lead

December is the month when the AGU (American Geophysical Union) conference is traditionally held. Over 100 AGU 2023 session abstracts mention the use of Google Earth Engine, and the Earth Engine team has a large presence at AGU. In this post we highlight one of our contributions.

Earth Engine users often inquire about the accuracy of dataset replication in the Earth Engine Data Catalog. In collaboration with Brianna Pagan and Mahabal Hegde from NASA GES DISC, we delved into this critical question, presenting our findings in an AGU poster.

AGU poster on replica repository case studies
AGU poster on replica repository case studies (click here to see PDF version)

While the Earth Engine team performs initial data validation during dataset ingestion, our investigation highlighted the helpfulness of additional checks. Leveraging the recently released xarray interface to Earth Engine proved to be a real time-saver in this pursuit.

Our investigation yielded two main results:

1. Pixel Value Consistency:

Upon comparing pixel values, we discovered that they were consistently identical. This is not surprising as Earth Engine’s ingestion process is designed to do lossless replication.

2. Challenges in Comparison:

Despite the uniformity in pixel values, progressing to this stage presented difficulties. Both NASA and Earth Engine boast extensive cataloging and data access infrastructure, yet the sheer abundance of datasets and varied access methods occasionally causes confusion for catalog users. For example, many datasets are released at multiple versions, processing levels (early, late, or final), or cadence (half-hourly, daily, monthly, and so on). Successfully navigating these alternatives demands a profound comprehension of the dataset structure.

Addressing Challenges:

To address the complexity of reconciling large catalogs, particularly with the exponential growth in remote sensing data, we propose working along several directions such as:

  1. Enhanced Catalog Metadata:

We recommend providing more standard catalog metadata — for example, including information about related dataset versions. This will empower users to make informed comparisons and selections.

2. Owner-Managed Catalogs:

Recently introduced publisher- and community-managed catalogs can alleviate the challenge. With these catalogs, data providers have detailed control over the choice and presentation of datasets.

3. Cross-Platform Collaboration:

Solving the issue of reconciling large catalogs requires industry-wide cooperation. Data providers and cloud platform redistributors should collaborate to establish standardized practices, making it easier for users to find and compare datasets. A place to start could be with more standard and uniform cataloging requirements across data providers to ensure that cloud replicas can correctly identify the most appropriate datasets and versions to mirror.

While Earth Engine users can currently report dataset issues, the exponentially growing volume of remote sensing data necessitates a broader, collaborative effort.

--

--