Tackling 4 “Common” Problems: Accelerating the Next Generation of Data Commons at a time of Artificial Intelligence
By Hannah Chafetz, Adrienne Schmoeker, Andrew J. Zahuranec, and Stefaan G. Verhulst
Last week, MIT’s Data Provenance Initiative published a study indicating a decline in web data availability for AI training — providing more evidence of an imminent “Data Winter.” Examining 14,000 web domains, the study finds that the holders of 5% of data and 25% of high quality data used for training AI models have implemented mechanisms to limit access. These include content paywalls, fees to access data for AI, technical mechanisms to block automated web crawlers, and more. This decline in data availability is problematic because it can directly impact the quality of the AI output, leading to biased responses to AI queries and the spread of mis-information.
As “generative AI-nxiety” continues to permeate the data ecosystem, it is becoming increasingly important to determine how to increase access to high quality data for AI developers in a way that is equitable and sustainable. This shift represents a pivotal moment to reimagine the role of data and knowledge commons (repositories of data and information managed by communities or entities operating in the public’s interest) as a means of increasing data availability for AI and making sure no one is left behind.
Over the last few months, The Open Data Policy Lab (a collaboration with Microsoft) has begun a series of research activities to understand the challenges and opportunities generative AI poses to data commons and how to accelerate the next generation of data commons in the age of AI. Most recently, the team brought together a group of diverse experts from across the AI, technology, government, and data ecosystems to discuss our 10-Part Framework of unresolved issues that if addressed could help accelerate the next generation of data commons and solutions to make progress around those issues. Following these discussions, the Open Data Policy Lab team conducted two brainstorming sessions to synthesize the ideas gathered and identify an actionable area of focus. Below we provide a summary of these findings.
4 Common Problems for Data Commons
Drawing on participant suggestions around our 10-Part Framework, we found four common problems for the next generation of data commons. Participants subsequently provided a range of solutions to make progress towards those problems (see our mind map below). In what follows we provide a summary of these problems and the solutions suggested.
1. No Common Access to Data: Generating access to high quality data for AI remains a challenge across the data ecosystem. Participants explained that a lack of machine understandable open data from official sources has in-part contributed to a skewed AI ecosystem. Participants discussed the need for new approaches to improve access to quality data for AI — whether it be fine-tuning, pretraining, or data augmentation — to ensure no one is left behind.
Solutions:
- A. Identify the datasets useful for high value AI applications: Participants provided a range of examples of data types that could be useful for AI, including knowledge graphs, supervised fine-tuning data, metadata itself, or unstructured data from “data graveyards” (data repositories that are no longer managed or up-to-date) such as government databases or open science repositories. Participants also discussed the possibility of tapping into non-digital data such as local heritage information or public domain and out-of-print books.
- B. Establish the role of government and national statistics offices in making data accessible and AI-ready: Participants (particularly from the public sector) discussed the need to solidify the government’s role in improving data access. This might include making its own data ready for different AI scenarios, deploying data quality assessment tools, aligning interests and incentives among data producers, stewards, and consumers, generating awareness of responsible data re-use practices, and setting up new institutional models to facilitate data access (e.g. legal and governance frameworks for responsible data re-use).
- C. Implement new data access mechanisms: There was generally consensus that generative AI could help facilitate broader access to data when implemented responsibly. Participants explained that generative AI could be used to create new data commons interfaces that allow users to find and analyze data based on search queries.
- D. Identify and disseminate data commons use cases: Participants emphasized the opportunity to identify and share use cases from existing data commons and data commons for AI. Use cases involving genomic data, environmental data, supervised fine-tuning, biology data, and street mapping were mentioned.
- E. Foster continuous learning and feedback: Documenting and sharing lessons learned from prior efforts to make data ready for AI could be valuable. Participants suggested exploring new tools to provide feedback on data quality and AI-readiness.
- F. Accelerate the creation of non-English data commons: There was generally consensus that AI data commons for non-English languages could provide a pathway to increase representativeness within generative AI technologies. However, these data commons must reflect local values and norms, have transparency on sourcing, and create value for the communities that contributed.
2. No Common Financial Structure: Making data commons sustainable and avoiding “data graveyards” remain significant challenges. Several participants voiced concerns around a lack of motivation for contributing to data commons communities and the resulting small pool of data resources only representing subsets of a population. Others emphasized that a lack of financial contributions to common infrastructure has made data commons challenging to maintain. There is a need for new approaches to manage data and financial contributions and make data commons durable in a rapidly evolving AI ecosystem.
Solutions:
- A. Develop mechanisms to govern data contributions: Participants expressed the need for formal guidance on how to balance what users are contributing to the data commons with how much they are using it. However, participants stressed that this would need to be handled cautiously to avoid becoming a closed system. They recommended exploring both legal (e.g. rules) and social (e.g. social pressure) mechanisms.
- B. Increase incentives to contribute data: Participants suggested developing new strategies to increase the motivation to contribute data. This might include specific rewards for contributions such as access to the generative AI tools their data helped create or identifying unique value propositions for different personas.
- C. Operationalize new funding models: Participants suggested a range of financing models to make data commons durable in the age of AI, including patronage, micropayments, and the creation of new public investments in data commons infrastructure.
3. No Common Based Licensing Practices: Participants expressed concerns around the potential mis-use of datasets and the need to better manage which actors can access datasets for specific purposes–aligning with community preferences. They also shared concerns around a lack of a common licensing solution for data re-use in the age of AI. They explained that traditional licenses and technical infrastructure are no longer sufficient in the rapidly evolving AI landscape. There is a need for new approaches to ensure official data for AI is accessed and sourced responsibly.
Solutions:
- A. Design a data commons license: Participants suggested creating a specific data and knowledge commons license for use across different AI scenarios. This license could also include the type of AI initiative being developed using the data.
- B. Accelerate access systems to protect boundaries: A tiered access approach for different volumes and uses (e.g. commercial vs. non-commercial) was frequently mentioned. Participants also suggested creating file walled infrastructure, leveraging federated learning, and replacing licenses with consent preference signaling with an opt in/out feature.
- C. Develop scalable infrastructure: Participants recommended advancing a common infrastructure or architecture that could help facilitate equitable access to data. One participant suggested a web-based infrastructure that houses all required information to use AI datasets such as a data/metadata standard/catalog, governance models, standardized APIs, and licenses. Another suggestion was to implement generative AI interfaces within data commons.
- D. Establish a social license for AI: Participants discussed how a dedicated social license could be helpful in making individuals feel more in control of their data. This would include developing new, broadly agreed-upon rules and processes through public participation.
4. No Common Clarity: There lacks a shared understanding of what data commons actually means and who ought to be involved. At the same time, many high quality data sources are not included in generative AI training and finetuning. There is, for instance, a need for common signals to indicate that the data can or cannot be scraped by AI technologies.
Solutions:
- A. Establish data and metadata standards for an AI age: Participants suggested developing new data and metadata standards specific to data commons for AI. This could help increase representativeness within generative AI technologies. A metadata catalog and standards to increase transparency on sourcing were mentioned.
- B. Publish scoping documents: Participants recommended creating new scoping frameworks to narrow and refine what should and should not be included in data commons.
***
In the coming weeks, we will be hosting another convening specifically on the first common problem: No Common Access to Data. Through this effort, we aim to develop a blueprint for new data commons for AI training and fine tuning data — thus providing pathways to making more high quality data available for AI models.
Are you working on data commons for AI or have you come across any interesting examples from the field?
If you have any questions or feedback or are interested in collaborating, please contact us at datastewards@thegovlab.org.
Interested in learning more about the thinking behind this work? Read our blog: “Data Commons”: Under Threat by or The Solution for a Generative AI Era? Rethinking Data Access and Re-use
***
Thank you to Gretchen Deo, Jordan Gimbel, and Sonia Cooper from Microsoft and our Open Data Action Labs participants for your contributions to this work.
***
About the authors
Hannah Chafetz and Andrew Zahuranec are Research Fellows and Adrienne Schmoeker is a Senior Research Fellow at The Governance Lab. Stefaan G. Verhulst is the co-founder of The Governance Lab and The DataTank, and one of the Editors-in-Chief of Data & Policy.
***
This is the blog for Data & Policy (cambridge.org/dap), a peer-reviewed open access journal published by Cambridge University Press in association with the Data for Policy Community. Interest Company. Read on for ways to contribute to Data & Policy.