Data Commons Version 1.0: A Framework to Build Toward AI for Good
Elena Goldstein, Urs Gasser, and Ryan Budish
Alongside the recent explosion and mythologization of artificial intelligence (AI), we have seen a surge of interest in “data for good.” While the concept has long served as a cornerstone for individuals working at the intersection of innovation and development, evolutions in big data and now machine learning have provided new tools and traction. “Data for good” has since galvanized UN agencies in pursuit of the Sustainable Development Goals, ushered in a new era and method of philanthropy, and catalyzed diverse private and public partnerships for improving transit infrastructure, crop yields, and school performance, among other systems.
AI represents a class of technologies built upon and sustained by massive amounts of data that has motivated a reexamination of “data for good” and its purview. As is apparent in the early adoption of machine learning in child welfare and criminal justice contexts, AI can exacerbate challenges and generate new ones related to decision-making with incomplete and unrepresentative datasets. Yet these technologies offer unprecedented opportunities for data-driven interventions and collaborations in the public interest, connecting information and partners previously irreconcilable. Identifying and supporting these opportunities was a primary goal of the AI for Good Global Summit, which took place last month in Geneva, Switzerland. At this Summit, the Berkman Klein Center co-organized a series of data-focused sessions and led a team of rapporteurs tasked with synthesizing findings pertaining to data availability, access, and sharing on the final day.
Among our most salient takeaways was that, given data’s fundamental relationship to AI, advancing “AI for good” means advancing our collective “data for good” capacity as well. Summit participants continually referenced the concept of a data commons as a promising mechanism that could boost the potential for societally beneficial AI by lowering the barriers to data collection, sharing, and use. Drawing upon these discussions and the findings of the Summit’s data rapporteurs, we have developed an emergent version 1.0 Data Commons Framework to help leaders in the public and private spheres better understand and implement these commons. Often, when people speak of a data commons, they refer to a repository and the technical infrastructure that enables it — what we call the narrow data commons. Our framework uniquely highlights the fact that building an effective data commons requires consideration of an additional broader set of societal and institutional layers that ensure interpretation of such repositories for the social good — what we call the broad data commons.
Framework Overview: Layers and Axes
Our Data Commons Framework is based upon a layered model of functionality, with each layer building upon those below it.
The model begins with three layers of core functionality: a) the technical infrastructure used to store the data in the cloud, government servers, or decentralized ledgers; b) the data itself in its various forms, ie. qualitative/quantitative, structured/unstructured, ordinal/nominal, and discrete/continuous; and c) the formats and labels used to sort this information as set out by metadata and taxonomies of datasets. These three layers comprise the narrow data commons, circumscribed by the bounds of the infrastructure and data contents.
Situated on top of the narrow data commons — and no less important — are three layers that determine how the core functionality layers interact with societies and institutions. Specifically, they include: d) the organizational practices that dictate how a data commons fosters collaboration and incentivizes multistakeholder, multidisciplinary participation; e) institutions, law, and policy, enablers and barriers that preside over the accessibility of a data commons as well as the mitigation of risks pertaining to privacy and human rights; and f) the humans “in the loop” whose knowledge feeds into the development and preservation of the other layers, and whose inclusion and education can correct, enhance, and supplement them. The addition of these three layers establishes the broad data commons, which encompases a full range of applications and audiences.
As these layers build upon one another, they highlight the importance of interoperability. The types of interoperability shift as one moves higher up the framework; within the narrow data commons, interoperability is driven by technical and data standards, whereas semantic interoperability and the translation of shared knowledge become dominant within the broad data commons. Our framework also differentiates between input and output data, allowing us to plot questions and proposals related to data at different stages in the AI development and application process. Here input data refers to information used to train an AI system, and output data refers to information an AI system may generate.
To demonstrate how one can map specific considerations and apply this framework across different layers, we have included some examples that emerged at the AI for Good Global Summit within its four tracks: AI + Health, AI + Satellite Imagery, AI + Smart Cities, and Trust in AI.
Framework Application: Examples from the AI for Good Global Summit
- AI + Satellites: Data on the Ground - Training AI for image recognition requires ground truth data — especially for satellites orbiting hundreds of miles above earth’s surface that literally need “ground” truth. In order to maximize satellites’ potential for tracking crop yields and predicting famines worldwide, it is worth pursuing additional tools allowing for the crowdsourcing and ingestion of standardized, georeferenced data that can be paired with the data satellites produce. Mobile phones, equipped with an array of sensors that can help build rich datasets, hold promise for enabling us to better understand what is happening in areas where there are currently no standardized field surveys. Utilizing mobile devices to supply ground truth data involves building interfaces at the technical infrastructure layer and the data layer for collecting this kind of information.
- Trust in AI: Verify, Then Trust - Building trust in AI demands transparency. While labeled data is useful for training AI systems, labeled datasets can help increase trust in the system as a whole. By implementing something akin to a nutrition label for datasets, we can bring greater transparency by demystifying the type of data and its provenance, collection, diversity, and limitations. This ensures that someone who did not collect the information directly still has the relevant context they need to make informed decisions about how to properly use the data. Nutrition labels and other dataset labeling tools (both human and machine readable) can act as guardrails to prevent abuses stemming from applications that may be inappropriate or introduce bias into outcomes. Creating such labels requires building interfaces at the the data layer and formats & labels layer to collect and share the right contextual metadata.
- AI + Smart Cities: Graveyards and Sandboxes - As cities across the world become “smart” and look to automate services, it is crucial that they consult other municipalities that have attempted or may be looking to attempt like-minded acts. How can we incentivize city governments to share their failures and successes, and to collaborate in pursuit of more ethical design? Creating “graveyard” and “sandbox” platforms for cities to log and learn from each others’ experiences could facilitate such an exchange and an “Internet of cities” approach for implementing solutions. Drawing upon the takeaways of other municipalities in deploying AI demands openness and action at the organizational practices layer and the institutions, law, & policy layer to establish pathways for adopting practices and reporting back.
- AI + Health: Trusting AI Like a Doctor? - The health field presents unique challenges for a data commons, as it encompasses both a complex ecosystem of stakeholders with distinct incentive models and a complex data ecosystem with information derived from sources as varied as medical imaging, claims databases, social media, and nurses notes. As practitioners look to incorporate medical diagnostic tools into their practices — reckoning with the ways in which these tools may amount to life-threatening black boxes or shine a light on the black boxes that are our doctors — it is important to explore opportunities to educate and include humans in implementing proper safeguards, whether through doctor/patient/AI overrides or comparisons with traditional thinking. Given the highly regulated and personal nature of relationships with doctors, these opportunities require interventions at both the institutions, law, & policy layer and the human layer.
We envision our Data Commons Framework as a starting point for discussing the needs that a data commons can help to address. At this stage, we offer the framework as a version 1.0 roadmap to assist those who may be looking to create or advance data commons for the social good; as we learn from these efforts, we hope to iterate upon our framework in conversation with colleagues and collaborators. Special thanks to AI for Good Global Summit co-organizer Amir Banifatemi of XPRIZE and to our fellow data rapporteurs: Marie-Ange Boyomo, Alex Cadain, Kenny Chen, John Enevoldsen, Mathilde Forslund, Sheridan Johns, Trent McConaghy, and Sean McGregor.
For more information on work the Berkman Klein Center is carrying out to support AI in the public interest, check out our Ethics and Governance of AI Initiative webpage.