Towards a Fourth Wave of Open Data? Selected Readings on Open Data and Generative AI

Published in

Data Stewards Network

8 min readSep 14, 2023

By: María Esther Cervantes, Hannah Chafetz, Sampriti Saxena, & Stefaan G. Verhulst

Generative AI tools are increasingly used across sectors, including in governments. However, there is limited research on how these generative AI tools could impact open data policies and programs. What are the opportunities for generative AI and open data? What are the risks? Could generative AI transform the role of statistical agencies? Is there a need for a global charter to govern generative AI?

Towards this end, in May 2023, The GovLab’s Open Data Policy Lab (a collaboration between The GovLab and Microsoft) hosted a panel discussion on the intersections of generative AI and open data and the ways in which generative AI could alter our existing conception of a third wave of open data. Building on the takeaways from this discussion, below we provide a curated list of annotated readings (listed alphabetically) on these topics.

These selected readings focus on three main areas: (1) the opportunities and risks of applying generative AI for open data, (2) generative AI governance models and discussion, and (3) the new role of national statistical agencies in the advent of these technologies. Given the speed at which these technologies are changing, we incorporate a wide variety of sources such as journal articles, reports from international organizations and think tanks, and blog posts.

We found several common themes across these readings. First, there is generally consensus that generative AI tools can provide value for open data and National Statistical Offices, whether it be for increasing data discovery, accessibility, or stakeholder collaboration. However, privacy, security, and safety risks remain prevalent and must be balanced. Second, there is a lack of common standards or policies for generative AI specifically. There are concerns that without a common language or standardization, algorithms may be misconstrued across borders. Third, governments are recommending synthetic data as a way to minimize privacy concerns with open data. If done responsibly, generative AI could help produce synthetic data at a larger scale. Lastly, governments around the world do not all have the same capabilities and resources for applying generative AI in their work. The countries that lag behind on these capabilities may have more challenges and risks when trying to incorporate generative AI into their public services.

*****

Alam, Zaidul. “Harnessing the Power of Generative AI in a World of Open Government Data.” LinkedIn Blog, June 15, 2023. https://www.linkedin.com/pulse/harnessing-power-generative-ai-world-open-government-data-zaidul-alam.

In this LinkedIn article, the author discusses the opportunities to leverage Open Government Data (specifically, census data) for generative AI.
The author explains that Open Data and generative AI could be merged in several ways including: helping increase interactions between citizens and governments, develop tools to engage with public institutions, and answer search queries about domain specific data (e.g. health data).
The author provides an example of how census data and AI applications could be merged: “By leveraging data APIs from the ABS and other similar institutions globally, Census Chat GPT could generate real-time, data-driven insights about demographic trends, socio-economic disparities, housing statistics, and more.”
There are many possible intersections between generative AI and Open Government Data: “In the future, we could see more sophisticated applications of generative AI to government open data. For example, AI could be used to generate comprehensive city planning scenarios based on urban development data, or to create personalized learning plans for students based on education data. Governments could also develop AI ‘public assistants’ that can explain complex legislation, provide real-time updates on policy changes, or guide citizens through bureaucratic procedures. Such AI assistants could democratize access to public information, reduce administrative burdens, and enhance civic engagement.”

Boom, Cedric, and Michael Reusens. Changing Data Sources in the Age of Machine Learning for Official Statistics, 2023. https://doi.org/10.48550/arXiv.2306.04338.

This paper gives an overview of the main risks, liabilities and uncertainties associated with changing data sources in the context of machine learning for official statistics.
The use of machine learning for official statistics has the potential to provide more timely, accurate and comprehensive insight into a wide range of topics, by leveraging the vast amounts of data that are generated by individuals and entities on a daily basis, statistical agencies can gain a more nuanced understanding of trends and patterns, but there are risks associated with this. Mainly, concerns about data quality, privacy and security and a need for the technical skills and infrastructure in government.
Machine learning can be used to complement or even replace official statistics, and its ability to nowcast and forecast is an extremely valuable addition. By incorporating machine learning into official statistical production, one can benefit from the strengths of both approaches and make more informed decisions based on the most current and accurate data.
National statistics agencies are used to having their data completely under their control, but using external data sources to power innovative statistics can become problematic, establishing proper protocols and procedures for external data management is necessary.

Goasduff, Laurence. “Is Synthetic Data the Future of AI? Q&A with Alexander Linden.” Gartner Interview, November 20, 2022. https://www.gartner.com/en/newsroom/press-releases/2022-06-22-is-synthetic-data-the-future-of-ai.

In this interview with Alexander Linden, a VP Analyst at Gartner, he talks about the potential of synthetic data as a complement to open data to drive the development of more accurate AI models.
He says, “Synthetic data can increase the accuracy of machine learning models. Real-world data is happenstance and does not contain all permutations of conditions or events possible in the real world. Synthetic data can counter this by generating data at the edges, or for conditions not yet seen.”
While synthetic data may offer a way to address biases and issues of quality in open data, Linden emphasizes the importance of transparency and explainability when it comes to the models creating and using synthetic data.

Loukis, Euripidis, Stuti Saxena, Nina Rizun, Maria Ioanna Maratsi, Mohsan Ali, and Charalampos Alexopoulos. “ChatGPT Application Vis-a-Vis Open Government Data (OGD): Capabilities, Public Values, Issues and a Research Agenda.” In Electronic Government, edited by Ida Lindgren, Csaba Csáki, Evangelos Kalampokis, Marijn Janssen, Gabriela Viale Pereira, Shefali Virkar, Efthimios Tambouris, and Anneke Zuiderwijk, 95–110. Lecture Notes in Computer Science. Cham: Springer Nature Switzerland, 2023. https://doi.org/10.1007/978-3-031-41138-0_7.

In this paper, the authors analyze the opportunities and risks of using ChatGPT for Open Government Data from an Affordances Theory perspective. Through 12 expert interviews, the authors develop a series of research agendas to accelerate the understanding of how ChatGPT could impact Open Government Data.
ChatGPT could have a positive impact on Open Government Data in several ways. These include: increasing user engagement, awareness, and accessibility, helping develop new Open Government strategies, offering new ways for data discovery through government chatbots, and balancing the supply and demand of Open Government Data. Additionally, from a public values perspective, ChatGPT could provide service-related and professionalism-related values for Open Government Data. It could help design user-driven Open Government Data initiatives and lower barriers to accessing Open Government Data amongst different stakeholders (e.g. citizens) — increasing transparency around government initiatives.
The authors point to several issues that ChatGPT could pose for Open Government Data such as unknowingly collecting personal information from registered users and inaccurate summaries of Open Government Data from ChatGPT. Also, the lack of governance frameworks could lead to larger problems such as inadequate results, cybersecurity issues, and algorithmic biases caused by language differences across countries.
In order to harness the value of ChatGPT for Open Government Data, additional research is needed on how ChatGPT could be used to increase use and value generation from Open Government Data, how ChatGPT could benefit the publishing of Open Government Data, and the potential issues of ChatGPT for Open Government Data.

Sallier, Kenza, and Kate Burnett-Isaacs. “Unlocking the Power of Data Synthesis with the Starter Guide on Synthetic Data for Official Statistics.” Statistics Canada, March 10, 2023. https://www.statcan.gc.ca/en/data-science/network/synthetic-data.

In this piece, Statistics Canada provides a set of guidelines for National Statistics Offices to use when leveraging synthetic data.
Using UNECE’s report as the guide, the piece explains that using synthetic data can help increase access to statistical data in a privacy compliant manner. It can help with publishing data, testing analysis, education, and testing software. Additionally, it explains the three main ways in which synthetic data can be generated: sequential modeling, stimulated data, and deep learning methods.
The article provides an overview of the pros and cons of using Generative Adversarial Networks to create synthetic data for National Statistics Offices.
Pros: “GANs have been used in NSOs to generate continuous, discrete and textual datasets, while ensuring that the underlying distribution and patterns of the original data are preserved. Furthermore, recent research has been focused on the generation of free-text data which can be convenient in situations where models need to be developed to classify text data.”
Cons: “GANs can be seen as too complex to understand, explain or implement where there is only a minimal knowledge of neural networks. There is often a criticism associated with neural networks as lacking in transparency. The method is time consuming and has a high demand for computational resources. GANs may suffer from mode collapse, and lack of diversity, although newer variations of the algorithm seem to remedy these issues. Modelling discrete data can be difficult for GAN models.”
In sum, the article explains that synthetic data can provide benefits for National Statistics Offices and Generative Adversarial Networks can help produce the synthetic data. However, those undertaking the initiative need to balance the many associated risks.

Ziesche, Soenke. “Open Data for AI: What Now?” UNESCO Digital Library, 2023. https://unesdoc.unesco.org/ark:/48223/pf0000385841.

This report summarizes UNESCO’s guidelines for Member States in opening up data for AI systems.
The report explains that there is an enormous amount of data already being collected through automated systems (building off of the COVID-19 pandemic). This data is often too large to be manually processed. AI and data science methods have the capacity to discover new information from these large data sources.
The report is divided into 3 phases: the preparation phase, the opening data phase, and follow-up phase for data re-use: “The preparation phase guides Member States in preparing for opening their data, and includes the following suggested steps: drafting an open data policy, gathering and collecting high quality data, developing open data capacities and making the data AI-ready. The opening of the data phase consists of the following steps: selecting datasets to be opened, opening the datasets legally, opening the datasets technically, and creating an open-data-driven culture. The follow-up for reuse and sustainability phase consists of the following steps: supporting citizen engagement, supporting international engagement, supporting beneficial AI engagement, and maintaining high quality data.”

*****

We plan to explore these topics further over the coming months. Professionals interested in collaborating with The GovLab on these topics can contact Stefaan Verhulst, Co-Founder & Chief Research and Development Officer at sverhulst@thegovlab.org.

Stay up-to-date on the latest developments of this work by signing up for the Data Stewards Network Newsletter.

Learn more about the Open Data Policy Lab by visiting our website: https://opendatapolicylab.org/.

Towards a Fourth Wave of Open Data? Selected Readings on Open Data and Generative AI

Written by Hannah Chafetz