Using Open Data to improve AI initiatives
The value of openness to promote transparent, effective, and fair AI
By Renato Berrino Malaccorto, Research Manager, Open Data Charter
AI technologies are being used in many aspects of life, platforms, apps, research and also to tackle different SDG challenges, such as environmental issues, transparency or inequality. Many of these efforts are carried out collaboratively by public and private sectors and civil society organizations. These collaborations can take different forms, including sharing a basic input for AI systems: data.
Data is at the foundation of AI models and that is why attention should be put to how it is collected, processed, shared and used. In order to democratize AI, we first need to have a proper discussion on what that means in terms of data governance: how data is created and distributed (data lifecycle), how can we make the datasets representative of the populations and regions in which the projects are developed and operate, how privacy laws and open data are regulated, and how data quality can affect AI. if the data is not democratized, the AI systems will never be.
Open data principles for better performing AI
As the official portal for the European data team pointed in its article called “Open data and AI: A symbiotic relationship for progress”, Open Data and AI have the potential to support and enhance each other’s capabilities. This has also been recognized by UNESCO, in its “Open Data for AI guidelines”. Having raw data is the first essential step in processing and transforming them into actionable information. But the required data needs to have certain characteristics, such as being accurate, timely and reliable. Further critical aspects of data are that they are findable, accessible, interoperable and reusable (FAIR) by anyone for any purpose. The aim of the guidelines is to apprise governments of the value of open data, and to outline how data are curated and opened.
In a research project that is still in the works we included a question for local practitioners to learn if they would find it important to access open data to develop their projects, to which the majority (95 % of the respondents) answered yes. According to them, the main benefits of having open data would be to complement the information, to enable trust, to improve quality and for interoperability.
There was also a mention of the added value of having final products in open repositories (e.g. GitHub, Airtable, etc.). There is a huge opportunity in sharing knowledge and advancing research in connection with Open Science. Several open data initiatives in the scientific field demonstrate the impact of open repositories with structured catalogs of data and standardized data formats.
The Open Data principles contribute directly to making AI system’s work more transparently, efficiently and reliably.
The “Open by default” principle recognizes that free access to, and subsequent use of, government data is of significant value to society and the economy, and that government data should, therefore, be open by default. It proposes to move from a notion of passive transparency, in which requests have to be made and more resources have to be invested, to a notion of proactive transparency, in which the flow of information becomes bidirectional. We believe that this principle favors access to data for AI projects, and in turn allows for the creation of links between different actors and institutions, allowing for another flow of information and different levels of feedback. And we also recognise that Access to Public Information is a human right that has to be balanced with the Protection of Privacy. This dialogue is fundamental for AI systems to operate in a secure manner, for example, in sensitive topics such as health.
The free availability of diverse datasets as through open data, is essential to drive innovation and bring new economic opportunities. Innovative AI systems can then be used to address development challenges. Data should be free of charge, under an open license, for example, those developed by Creative Commons. It would facilitate access and usability for more people, and democratize AI systems.
The completeness of open data contributes to the ability of AI systems to generalize to unseen examples once these are deployed into “real-world” operation, but it also contributes to the concept of data quality. It is important that the data is complete and contains all necessary information (e.g. metadata). This will allow AI systems to operate more efficiently and with a closer match to the real world conditions as they are trained with a higher quantity and quality of data.
Open data standards: repositories, relevance and re-usability
Open data is only valuable if it’s still relevant and being reused. Getting information published quickly and in a comprehensive way is central to its potential for impact. Historical data is important to make an application work (e.g. in agriculture), but we need up-to-date data to make it work effectively. For example, in such applications, we need data that reflects different weather conditions, light conditions, and crop growth stages.
Data has a multiplier effect. The more quality datasets you have access to, and the easier it is for them to talk to each other, the more potential value you can get from them. Commonly-agreed data standards play a crucial role in making this happen. This is where the principles of comparability and interoperability play a key role. Adopting international standards that can ensure that data from various sources is compatible and interoperable.
Achieving open repositories with common standards can help AI systems to perform better, as we have seen for example in agriculture use cases. It is interesting to see projects like Source Cooperative, from Radiant Earth Foundation, a data publishing utility that allows trusted organizations and individuals to share data products using standard HTTP methods, a great initiative that attempts to make access to data more uniform and efficient. It has many positive sides in bridging institutional bridges, and, for example, it is good to see that datasets like the Google and Microsoft building footprint data are available under open license.
It is also important to mention whether available repositories with datasets are really useful for AI initiatives, in some cases this has to be complemented with local data. Some recent studies suggest that synthetic data may be useful to promote inclusion and fill the gaps. Harnessing open data within generative AI models holds the potential to democratize access to open data and democratize AI systems.
Open data for improved governance, inclusive development and innovation
When we talk about improved governance, we are talking about the improvement of internal processes (both in public institutions and in other sectors). Open data allows for greater articulation between different agencies, and therefore, better results. This can be seen in both public and private sector developments in the field of AI. In relation to citizen engagement, we have already talked about two-way information flows: an informed society can deliberate better and a virtuous feedback loop is generated, and AI initiatives can contribute to this by, for example, providing information from databases (with projects such as chat-bots).
Finally, when we talk about open data for inclusive development and innovation, we are talking about the value of open data for AI systems that are representative of different regions of the world, that can address important development issues in today’s global context (such as anti-corruption or climate change challenges), and that allow these technologies to experiment and innovate. Open data allows us to access data from different locations, and while this will often need to be supplemented with local data, it brings us closer to greater accessibility and representation of information (e.g. documents in more languages, images from different parts of the world, etc.).
Conclusion
As the EU data team mentions, the potential of AI systems in society is vast. When combined with open data, new opportunities become possible both for deriving new insights from open data and powering AI systems for new uses.
Open data can be helpful for an AI project due to: it’s ability to supplement existing information, foster trust among stakeholders, enhance data quality, and promote interoperability.
Exposing AI systems to a larger volume and variety of data increases the chance of the system returning accurate and useful predictions. As such, open data can be a supply of large amounts of diverse information for AI systems. In this way, the availability of open data contributes to better performing AI.
In the run-up to the UN World Data Forum 2024, and with a view to committing to deepening our work on the connection between open data and AI, we have joined the #Commit2Data campaign organized by Open Data Watch. Our commitment is called: “Using the Open Data Charter’s principles to promote transparent, effective, and fair AI”. For more information please click on this link.
Keep in touch with author Renato, discussing the possibilities on the intersection between Open Data and AI.