How AI can decode the Earth’s “Genome”

Ben Strong
Earth Genome
Published in
9 min readSep 25, 2023

Imagine if we had access to a true “genome” for the Earth, an actionable blueprint that distilled the tsunami of data we collect about the planet. It would usher in a new era in diagnosis and treatment of our global environmental challenges. This vision is now obtainable with AI.

One thing that is truly impossible for me to wrap my head around — despite being a practicing Earth scientist who uses remote sensing data every day — is just how quickly the volume of Earth Observation (EO) data is increasing. In addition to the unprecedented volume of in situ sensors and observations, it seems like almost every day a new EO satellite is being announced or launched.

And in truth, that feeling isn’t far off; according to the Union of Concerned Scientists, 168 EO satellites were added to the skies in 2022 alone — an average of about one satellite every other day.

The upshot: we’re rapidly transitioning from a EO “data trickle” to a “data tsunami.” But despite all this data, we’ve made inadequate progress towards many of our most important environmental goals.

I think you could label our current predicament as a full-blown paradox: why is it that, despite having access to petabytes of data, the right data is not ending up in the right hands in order to drive better decisions for the environment?

A framework for understanding the Earth Observation data paradox

Let me propose one framework for understanding the “Earth Observation data paradox.” I’d like you to reach all the way back to high school biology and recall the difference between “phenotype” — a set of observable characteristics — and “genotype” — the set of distilled instructions, encoded in DNA, that serve as the blueprint for life.

With today’s volume of Earth Observation data, we have an unprecedented understanding of Earth’s phenotype. But while understanding of phenotype is certainly a useful thing, in many cases it might not be sufficient.

Just look at the revolution in personalized medicine over the past two decades. By getting to the genetic understanding of a cancer like Chronic Myeloid Leukemia (CML) we’ve made astounding progress: the same diagnosis that would have been quickly terminal 20 years ago is now seen as extremely treatable — with almost no shortening of life expectancy.

Through the power of genetics, we’ve made rapid progress in treating diseases like Chronic Myeloid Leukemia in the years since the human genome project took place (1990–2003). (Source: Bower et al, 2016. Inflection point annotation is mine. To answer your other question: yes, I did include this chart partly because I find delight in inserting a plot from a medical journal in a post about environmental data science.)

The plot above shows the clear “inflection point” in our ability to treat CML coincident with our ability to sequence its genetics.

Imagine if we had access to a true “genome” of the Earth — the essential blueprint for how the Earth appears and functions. This breakthrough would mark an “inflection point” in our capacity to understand and address global environmental issues: Want to know the carbon biomass of a given region of interest? There’s a gene that encodes for that. How about finding all cattle ranches on the planet? We can look across the globe for where the “cattle ranch” gene is expressed. What about change over time — things like deforestation? You guessed it — just look for genes “mutating.”

Okay, well, what does this “genome” look like, and how could we build solutions on top of it? This is the question that we (at the “Earth Genome”) have been pondering for quite some time, and I think we’ve found an initial answer.

How to sequence a planet’s genome

In order to sequence a genome, you first need to invent DNA sequencing technology. My belief is that we’re starting to see a rudimentary form of this technology emerge: large AI “foundation models” for Earth observation data. Recently, there have been many exciting announcements about teams pursuing foundation models for Earth Observation data. We see this as the starting gun in a race to develop “planetary DNA” sequencing technology.

For those unfamiliar with the term “foundation model,” the idea is actually pretty simple. Instead of training custom, one-off machine learning models for every possible task (say, determining carbon stock, locating illicit gold mines, and burn scar mapping), we can train one very large, self-supervised neural network to learn the underlying, essential patterns inherent in EO data. This model then can be rapidly fine-tuned in order to complete arbitrary downstream tasks. It’s an acknowledgement that many of the underlying technical requirements of Earth observation workflows are fundamentally the same — a model needs to discern objects, learn how multispectral data is useful, understand geospatial correlations, etc. A foundation model takes care of all this ahead of time, and can output a compressed “embedding” that represents the distilled information contained in the original data. In brief, a EO foundation model produces the essential, encoded representation of an EO data cube — kind of like “planetary DNA”!

We’ve started experimenting with this concept and have been very excited by our initial results. Check out the animation below, which shows results for sequencing the “DNA” of Alabama. If we rearrange Alabama to plot similar points close to each other, we see clear clusters emerge. (The bright colors that appear in the scatter plot represent Dynamic World landcover classifications.)

If we zoom in, we see a high degree of semantically meaningful and nuanced clustering: things like mines, airports, and poultry operations share similar “DNA”. This is where you can start seeing the power of planetary DNA: instead of building custom ML models for these datasets, the task can be as simple as drawing a circle around the appropriate cluster.

As you can see, there is tremendous potential that can be derived from using the EO foundation models of today. But here’s what’s wild: the foundation models of today will pale in comparison to what we can achieve with the foundation models of tomorrow. There are a whole host of issues that need to be addressed to fully realize the potential of EO foundation models, including multi-sensor/multimodal support, incorporation of in situ data and other reference data (e.g. soil maps), and true geospatial and temporal contextual awareness. Ultimately, we envision a future where any data with a latitude, longitude, and timestamp associated with it could be ingested and used by a “complete” foundation model for the Earth.

What can you do with “planetary DNA” anyway?

Of course, DNA on its own is not so useful. The raw sequence of A-T-G-C base pairs means nothing on its own. That’s where a host of other technologies, layered on top of DNA, need to come into play, translating the encoding to something meaningful.

As an initial proof of concept of what such a technology could look like, we’ve started building a tool we call Earth Index. Earth Index makes searching the planet as easy as clicking on a map. Here we show how a user can find locations of illicit gold mining in the Amazon, supporting our work as part of Amazon Mining Watch.

Earth Index puts the power of dataset creation directly in the hands of groups who know what to look for — and how to use that information. In the Amazon, our data has led to numerous investigations and was even cited in criminal proceedings against a law breaking mining company.

Earth Index demonstrates that we are rapidly approaching a world of zero marginal cost for “watch applications” that monitor the state of the planet. We’ve already seen this in our own applications, and we can’t wait to share this technology outside of Earth Genome.

Earth Index is about finding things, but “planetary DNA” can be used for so much more than that. The analogy actually extends pretty nicely to a number of routine Earth Observation tasks:

  1. Relative Finding: DNA can find your relatives; “planetary DNA” can find sets of similar things across the Earth (this is the Earth Index use case).
  2. Mutations: Mutations in DNA can lead to evolution or things like cancer; we can track the change in planetary DNA over time to quickly diagnose where change is happening on the planet.
  3. Predicting “physical traits”: DNA encodes things like height; planetary DNA can be used as a basis for models to predict traits like population density or carbon biomass.
  4. Predispositions: DNA can tell you that you have an elevated chance of developing a disease in the future; “planetary DNA” can be used to estimate the future chance of land conversion or other important trends.

Imagining a future built on planetary genetics

By decoding the “genetic patterns” of our planet, we’re ushered into a new era of understanding and interaction. Here, we delve into three pivotal areas where this newfound knowledge could revolutionize our approach.

  1. Proactive Conservation

Traditionally, our conservation efforts have been predominantly reactive, responding to evident challenges and crises. However, with a deeper dive into the planet’s “DNA,” we gain the ability to foresee potential disturbances in our ecosystems, almost akin to predicting genetic vulnerabilities in a living organism. Such foresight could manifest in numerous ways:

  • Detecting the initial signs of forest regions showing vulnerability to illegal logging or disease, even before the first tree falls.
  • Identifying coastal areas whose “genetic markers” suggest they are at heightened levels of climate risk, helping us protect blue carbon and other important nature before it’s too late.

The potential is vast, enabling conservationists to shift from firefighting to preventive action, securing our planet’s health with unprecedented precision.

2. Carbon Management

As we grapple with the climate crisis, the ability to manage and mitigate carbon emissions is paramount. Our planet’s “genetics” could provide unparalleled insights into the carbon cycle:

  • Recognizing areas where carbon sequestration occurs most effectively, enabling targeted reforestation or afforestation efforts.
  • Detecting regions with high carbon release susceptibility, perhaps due to land-use changes, providing valuable data for policy-makers and industry leaders to mitigate emissions at the source.

In essence, Earth’s “genetic blueprint” could act as a real-time ledger of carbon transactions, ensuring that our mitigation efforts are timely, efficient, and impactful.

3. Biodiversity Preservation

Our planet’s myriad species are not just a testament to the wonders of evolution, but also crucial cogs in the machinery of ecosystems. The “genetic patterns” observed through EO foundation models could:

  • Highlight “hotspots” where genetic diversity is rich, yet precariously balanced, allowing for tailored conservation strategies.
  • Predict potential threats to specific species by correlating changes in habitat “genetics” with species vulnerabilities, long before physical symptoms of threat manifest.

Such insights could guide resource allocation, ensuring that we focus on areas where biodiversity is under maximum threat, and also identify regions where the reintroduction of species could restore genetic equilibrium.

It’s time for a “human genome project” for the Earth

In summary, we stand at the cusp of a revolution for how data can be used to drive insights and decisions for our planet. But revolutions require momentum, commitment, and above all, involvement. Whether you’re a scientist, a data enthusiast, a policymaker, or simply someone who cares deeply for our Earth, your perspective and effort can amplify the impact of what we’ve started to explore.

Join us at Earth Genome as we navigate this fascinating landscape of planetary genetics. Collaborate on our projects, enrich our datasets with your expertise, or simply spread the word about the importance of this endeavor. The “DNA” of our planet Earth is something that should be owned and developed by all of us.

--

--