Sitemap
Data Stewards Network

Responsible Data Leadership to Address the Challenges of the 21st Century

Fourth Wave of Open Data Seminar: Future-Proofing Open Data: Making Data AI-Ready

5 min readJun 3, 2025

--

Press enter or click to view image in full size

This blog is part of a continuing series on the Fourth Wave of Open Data. To read our first blog in the series, click the link here.

Over the last three months, the Open Data Policy Lab has explored how the Fourth Wave of Open Data fundamentally changes the way we conceive of the open data movement.

In the first episode, we spoke with leading data practitioners about how the combination of open data and generative AI offers significant potential. In the second episode, we discussed how open data might be made conversational through the application of AI tools.

On 5 May, the discussion looked at something more foundational: How do we make data ready for AI? How do we future-proof open data?

Joined by an international panel of experts that included Stefan Baack (Mozilla Foundation), Zhamak Dehghani (Nextdata), and Chris Shaffer (BrightQuery), Stefaan sought to understand what AI readiness for open data means and what strategies we might pursue regarding data curation, standardization, and quality.

Research on Open Datasets for LLM Training

The conversation opened with a brief discussion with Stefan Baack on recent research from the Mozilla Foundation and EleutherAI that examined best practices for creating open datasets for LLM training.

As he noted, the piece was the result of a short workshop, one that asked the question, “Is copyrighted data truly essential to build AI data or can we rely on openly licensed datasets to make AI more equitable?”

The results, which included the inputs of 30 AI scholars and practitioners, found significant opportunities to build open, transparent datasets for AI development.Acting on this potential meant overcoming certain challenges — principal among them scalability.

“I would summarize all the challenges that were identified as a lack of scalable solutions to identify and acquire this kind of data, scalable is really the key word here,” said Stefan.

He noted that there was no shortage of permissibly licensed material. Much of this work existed in the public domain due to a failure to renew copyright. However, much of this content has never been digitized or has poor Optimal Character Recognition (OCR), which is needed to train data.

Still other material existed on HTML pages with non-standard terms of use or with terms that required a human operator to understand. A page with a Creative Commons license might only intend part of the page to be in the commons, not everything in its entirety.

The paper identified possible solutions to address these problems. The open data movement could develop technical standards to encode preferences throughout the web on how it wants data to be used and not used (e.g. explicitly indicating what content should not be used for AI training). There could also be greater effort around making terms of use more machine readable.

What Stefan closed on, though, was a call to build a community of AI curators around shared interest, curators who could better define these conversations and identify the specific solutions needed for their domain.

“I think there needs to be a community made up of both AI builders and open data curators around shared interest. I think it is important that the AI community is not just a consumer of open data, but also in various forms, a contributor to it,” he said. “I think the Open Data community could be really important in developing [technical] standards and adopting them at scale.”

Federated Models of Data Ownership

Following these remarks, Stefaan turned to Zhamak Dehghani. Based on her experiences, he sought to understand how she saw data evolving. How did changes in how we handle data affect its readiness for AI?

Zhamak responded by noting that the challenges raised around data sharing, discoverability, and scalability had been a major focus of her work. At NextData, the solution they had settled on was one built around a federated model of data ownership.

“The reason for that is the problems that Stefan mentioned around discoverability, around understanding what the data is about and then how it can be used. […] We’re really looking at this intelligent block that can describe itself. It has all the metadata to describe itself. It can computationally govern whether this particular use case that is trying to access my underlying data ultimately should or should not be used.”

“At NextData, we’re trying to solve this problem of data sharing at scale with trust and product thinking built into it with a technology that abstracts the complexity that sits beneath this concept of autonomous data as a product,” she continued.

Stefaan followed up by asking if there were any ways we could accelerate this approach and what kinds of investments that would take.

Zhamak responded by noting that standards always accelerate innovation. While they’re hard to get because they require people agreeing with one another, a few standards would at least start a conversation and allow for people to adapt that to their experiences.

“We need the hourglass of data innovation, a narrow waist with a small set of standards that, frankly, right now are missing, including those around access control.”

Building New AI

Finally, Stefaan turned to Chris of BrightQuery. Noting that the organization had significant expertise working with statistical agency, including on opendata.org, Stefaan wanted to know what challenges they saw emerging in the field and possible approaches to make data AI ready.

Chris, in turn, drew on his recent project experiences. He spoke on his work with the US government to evaluate its data, build a government data chatbot, and help establish an open entity dataset.

The question of how to make open data AI-ready was one that came up very quickly in BrightQuery’s effort to evaluate government data. Namely, what does “AI-ready” mean?

There’s AI ready for training an LLM, AI-ready for building a commercial chatbot that will provide the right answer, AI-ready for answering questions about structured data, and AI-ready for predictive analytics, among other applications.

BrightQuery thought about what the commonality between all these possible applications might be and settled on issues of accessibility and interoperability.

“Of the things that we found is almost universal is we have to look at the difficulty of just crawling a website, scraping data off of it, pulling data from pulling information from OCR. And this is all data that is intended to be open. This is data that is published by government agency that’s intended to be as easy to work with as possible.”

He also mentioned the importance of data provenance for building systems, noting that, when it comes to accurate AI results, having well-bounded data whose origin can be identified is critical.

“It’s not only important enough to not abuse work that people put out into the wild. It’s also important just to get the answers right. You’ve got to know where every number is coming from. Not knowing if your financials might be coming from Reddit or a sci-fi book is terrifying.”

***

These are just a few of the reflections offered in our first seminar on the Fourth Wave of Open Data. To follow the full discussion, watch the video here.

Stay tuned for our reflections on our next seminar “Data Commons for the Public Good”.

--

--

Data Stewards Network
Data Stewards Network

Published in Data Stewards Network

Responsible Data Leadership to Address the Challenges of the 21st Century

The GovLab
The GovLab

Written by The GovLab

The Governance Lab improving people’s lives by changing how we govern. http://www.thegovlab.org @thegovlab #opendata #peopleledinnovation #datacollab

No responses yet