The State of Data Science for Earth Sciences

Report on ESIP 2021 Summer Meeting

Stephen Haddad
Met Office Informatics Lab
10 min readSep 17, 2021

--

The 2021 Summer Meeting of the Earth Science Information Partners (ESIP) took place (virtually of course) from 19 to 23 July 2021. ESIP is a group of Earth Science organisations, currently primarily from the United States, that aims to promote “the collection, stewardship and use of Earth science data, information and knowledge that is responsive to societal needs” (from ESIP website). The group is organised around clusters, which are small interest or working groups, each with a particular focus within the broad remit of Earth Sciences. Some examples include the Cloud Computing cluster, which focus on the application on cloud-based- architectures to Earth Sciences, the Machine Learning cluster, focusing on best practices in applying Machine Learning to Earth Sciences, and the Data Readiness cluster, which is developing standards for evaluating how easily a dataset can be used for Machine Learning or Artificial Intelligence applications.

The clusters comes together twice a year for the Summer and Winter meetings to share knowledge and ideas between clusters as well as with the wider community. In this blog post I will take a look at some of the key themes and take away messages from my perspective from the sessions I attended and discussions I participated in, and consider what this means for the application Data Science and Machine Learning techniques in the Earth Sciences context.

At each conference there are a few key themes that emerge, popping up in almost every talk, representing the common challenges and opportunities in that field at that time. While distilling my conference experience into a few key themes, the words of the song “Nowadays” from Chicago popped into my head:

There’s men
Everywhere jazz, everywhere booze
Everywhere life, everywhere joy
Everywhere, nowadays

Lyrics of Nowadays from musical Chicago composed by John Kander & Fred Ebb (DOI not available)

The musical Chicago represents a slightly glamourised, slightly cynical view of life and fame in the roaring twenties, a golden age in many ways before the Great Depression. We’re currently in a golden age in the adoption of data-driven techniques. The talks and discussions at the ESIP 2021 Summer Meeting represent a snapshot of the state of Data Science and Data engineering in the Earth Sciences domain, both where we are now and where we would like to be in 2021. What key technologies and principles are “everywhere”, at least on the minds of everyone attending the ESIP meeting? What does the Data Science for Earth Sciences landscape look like “nowadays”?

Data everywhere

Sessions:

  • OpenDAP for Data Providers (recording)
  • AI Data Readiness: Designing A Community-Driven Road Map for Data Standards and Tools (recording)
  • The Saga Continues: Cloud-Optimized Data Formats (recording)

Ideas & Tools:

  • Streaming data, optimising for cloud access e.g. OpenDAP, EDR
  • Making data easier to find through catalogs and search e.g STAC

It’s hardly novel for data to be a key theme of an Environmental Sciences conference, as data from both observations and simulations have long been essential. New data sources though do continue to be developed and existing data sources are made easier to access. In the new sources categories, we have sensors like “smart webcams”, which use Machine Learning to infer measurements like cloud cover from images or video. There has also been a lot of effort devoted to finding data sources through data catalogs, using technologies like STAC or EDR, as well as data search engines and portals, such as NASA GIBS or Radiant ML Hub. We see increasingly in applications that value from the large quantity and variety of data produced not from its use in isolation for a particular purpose, but used together with many different datasets from different domains, in ways the producer may not have expected. These users may often then not be experts in the domain of the data and data producer. Which brings us on to the next key theme: users.

Users everywhere

Sessions:

  • Delivering Trusted Data to Real Users & Decision Makers (recording)
  • Distributed Rapid Collaboration on Disaster related Information (recording)
  • New Frontiers in AI for Earth and Space: Big Data and Parallel Computing (recording)

Ideas & Tools:

  • Data Readiness levels & Decision-ready Information
  • Different Data Use Cases e.g. GIS user, disaster responders

What do users really require from our data? Many users are joining together data from different sources and domains to draw insights, so it can be difficult for data providers to understand all user. This doesn’t mean as data producers or curators we should throw up our hands and give up trying to understand our users or how the data is used. Rather it is an ongoing key task to monitor who is using the data, for what and how, and letting that inform the contents of datasets and processing tools that use the data so that they meet actual users’ needs as much as possible.

For example, in the Analysis-ready Data cluster I am a part of, we are putting together a checklist about how ready a dataset is to be used for AI applications. Although some properties apply to all datasets and users, we need to consider specific use cases and the usefulness of our checklist for specific users to create a more useful document.

One user category that is fairly new to me as a Data Scientist is thinking about users like disaster responders, who don’t want access to huge amounts of raw data, but rather want to be presented with decision-ready information. They need to know though to what degree they can trust this information and whether they can have confidence in the decisions they make based on that information? How do we present our datasets and build pipelines so that users can trust the information products that come out of the end?

FAIR everywhere

Sessions:

  • Best Practices for FAIR Research Software (recording)
  • Best Practices for Reusability of Machine Learning Models: Guideline and Specification (recording)

Ideas & Tools:

Many readers will have realised that the challenges described so far are addressed by the FAIR principles — Findable Accessible Interoperable Reusable. I’m familiar with FAIR applied to data. What was new to me at this conference was extending the application of these principles to all digital assets in the processing pipeline, including the code and machine learning models.

Making code or ML models findable and accessible is fairly straightforward. Code has git repositories or zenodo archives. Standard pathways are less defined for ML models and not at the level of data search engines and catalogs described earlier, but are gradually evolving and developing with services like DL Hub.

Interoperability is perhaps better defined for code and models. Code should have a well defined interface (API) so that it can be joined with other analysis code. Models should have well defined inputs and output to join with other models or analysis code. Reusable is a more contested concept. What sort of reuse do we expect? Do we really mean reproduce or replicate, terms which are often used imprecisely or interchangeably, but mean different things. Technologies like conda environments and docker containers help to make code and models reusable and often reproducible, especially in association with FAIR data. Formats now exist for exchanging and reusing trained model parameters such as ONNX.

Ultimately what the FAIR principles mean for each asset and context comes down to what the expected users and the defined use cases to be supported. One can never perfectly apply FAIR for unknown uses, but hopefully in considering broad usage patterns one can get close for most users.

Metadata everywhere

Sessions:

  • Machine-Readable Descriptors for Heterogeneous Tabular Data (recording)
  • Toward Improving Representation of Data Quality Information (recording)

Ideas & Tools:

  • Metadata vital for FAIR and ARCO, builds trust in data and derived products.
  • Metadata for tabular — CSV for Web, CSV YAML, ERDDAP

We will get the most value out of our datasets when they can easily be used by users from different backgrounds and domains. It essential for these diverse users, who may not be from the domain of the data provider, to be able to get relevant information about using the data. All datasets therefore need to have comprehensive and comprehensible metadata. This includes describing the fields, their units, the origin of the data, the processing that has been applied and any other information necessary to understanding and using the data. This is also important for efficient access patterns, as tools which extract a subset or stream data need to know the detailed structure of the data from the metadata, without reading and of the actual data.

The challenge for providers is in finding the time for the labour intensive task of gathering and documenting the metadata along with all the other tasks required of data providers. Providers are also often primary data users, but often don’t need the metadata themselves as they know the data intimately, so curating metadata is devoting time to a task that will primarily benefit others, or possibly one’s future self. Metadata builds confidence for downstream products that they are built on correct use of datasets, and therefore confidence in the decisions made based on the derived decision-ready information.

Another challenge is that it is not always clear what is the best way to store the metadata so it is attached to the data. For example CSV files are a common way of storing tabular data, but do not have good innate support for detailed metadata. The session on Machine-Readable Descriptors for Heterogeneous Tabular Data suggested several emerging standards that extend the CSV format in a backward compatible way, such as CSV for the Web . It is important in building a solid foundation for a data driven future that metadata is not an afterthought or optional extra, but seen as vital and used extensively by data provider and user communities. Such tools help, but ultimately it is a culture shift that communities of interest and practice must foster, which leads onto the final theme of Community Everywhere.

Community Everywhere

Sessions:

  • AI Data Readiness: Designing A Community-Driven Road Map for Data Standards and Tools (recording)
  • ESIP in 2031: how we got here, from a pandemic to a bright new future (recording)

Key idea / tools / technologies:

  • Community driven & owned recipes & standards
  • Provide mentorship and training to increase uptake of these ideas and tools.

There were a lot of great ideas and tools in every session of the conference. I’ve never gone away from a conference with so many different things I wanted to try and so many tools that would be directly relevant to projects I’m currently working on. I also really enjoyed the space allowed for discussion, breakout sessions and socials throughout the conference. The organisers did an excellent job creating a community feel that encouraged real engagement.

After the post-conference high, though, you get back to the daily reality of one’s job with deadlines and deliverables and the high-minded ideals you intedned to adopt and the new tools you were eager to try are moved the bottom of your to-do list. To ensure that principles such as FAIR and tools such as metadata-rich cloud optimised formats and searchable data catalogs become a part of the usual working practicesin Earth Sciences, there needs to be a community to facilitate this and drive the process forward, developing community recipes and standards that make it easy to adopt best practices, along with mentorship and learning opportunities to develop the skills required.

One very interesting session was the last of the conference, which was a time-travel thought experiment. Imagine it is the 2031 and the ESIP community has grown and thrived over the past ten years of the “post-pandemic new normal”. What decisions were made and actions taken from the 2021 Summer Meeting and subsequently to make that happen? I think all conferences should include this combination retrospective / future planning exercise as a satisfying way to draw things to closure while ensuring the right “next steps” are taken. One overwhelming theme in this discussion was the importance of inclusive, active communities in making this happen.

Conclusions: The Future of Data Science for Earth Sciences

You can like the life you’re living, you can live the like you’re like.

Chicago (again)

What do we want the application of Data Sciences to Earth Sciences to look like in 10 years time? There is a lot of excitement about how technologies such as cloud computing and machine learning can revolutionize the domains of Earth Sciences. We’re moving beyond the initial excitement phase of “Look, I applied machine learning to my subject” or “I ran my script on the cloud”. We need to consider what can be improved or even transformed for real uses with specific use cases, and what principles and tools can provide a framework for turning those possible benefits into a reality.

The musical Chicago talks about the roaring twenties, an economic boom period, but the excesses of that era led to the devastating Great Depression that follows. Machine Learning, like most technologies, has gone through boom and bust cycles of hype and scepticism. We are currently going through a boom in adoption and excitement, and we want to ensure that through the maturation of these fields these themes represent that we don’t over promise and under deliver leading to a bust, but rather create real value to our organisations, communities and the wider world and deliver some of the transformational capabilities long promised and expected.

Further information and links:

ESIP Summer meeting 2021:

--

--