What is ‘public data’? And who should be allowed to collect and use it?

Published in

Canvas

13 min readJun 30, 2024

This piece was written by Sasha Moriniere, Claudine Tinsman and Jared Robert Keller.

Early in 2024, Meta dropped its lawsuit against web-scraping company Bright Data, following a significant legal setback. The District Court for Northern California ruled that Meta had failed to provide sufficient evidence proving Bright Data had scraped non-public data from its platforms. Bright Data, which advocates that “public data must remain free and accessible”, considered it a legal victory for the web-scraping community.

But what is ‘public data’? Why would a social media platform and technology company go to court over access to it? And how do different stakeholders such as policymakers, platform representatives, researchers and journalists conceive of public data and its fair collection and use?

Our team at the ODI has been investigating these questions in an effort to generate consensus on guidelines for fair and ethical collection and use of public data.

Throughout our work within the Global Data Infrastructure programme at the ODI, we’ve been investigating ways to enable access to social media data to support public-interest research. Social media platforms have become crucial spaces where individuals express themselves, share political opinions, mobilise on issues that are are important to them, and consume news and information. To better understand these movements and the role that social media platforms play in their development, researchers must be able to study the platforms and the manner in which users interact with them. Within the programme we have therefore investigated how different countries across the globe are attempting to enable access to social media data and mapped the different ways that researchers can access data about social media platforms.

Throughout this work, a challenge frequently raised by researchers and policymakers was the lack of clarity around what constitutes ‘public data’ — generally understood as data that is published on news websites and social media platforms — and therefore a lack of clarity around what constitutes fair collection and use of that data. They felt that in some ways this lack of clarity was limiting — or being used to limit — access to this important data. This data is often sensitive, whether because of personal or commercial sensitivities, but it also has the potential to enable public-interest research and support open-source investigations and journalism. The lack of clarity also potentially leaves researchers in a legal and ethical grey area when collecting or accessing this type of data. Questions around what constitutes ‘public data’ or ‘publicly available data’ and what constitutes fair collection and use, are therefore very pressing.

To help address some of these challenges, we worked to develop a proof-of-concept Delphi survey that could serve as the foundation for future consensus-generation exercises in this space.

In this Canvas piece we outline the research we conducted as part of this pilot study and our early-stage findings.

Though these are only preliminary findings, we believe they demonstrate the important and timely nature of this issue and we encourage others working in this space to join us in investigating it further.

Methods

Background research

It will come as no surprise that, like 99% of good research projects, ours began with desk research. We consulted academic and ‘grey literature’ to understand the range of definitions and perceptions of the term ‘public data’ in use by a range of stakeholders, with special emphasis on how that term is understood in relation to the data and information found on social media platforms. We followed this with six informal interviews with experts in our network to provide additional context to the literature. We then conducted semi-structured interviews with experts obtained through convenience sampling. Participants who were unavailable for face-to-face interviews received questionnaires via email.

Survey development

Alongside this background research, we sought out research methods capable of not only gathering disparate viewpoints, but ultimately helping people with differing views begin to align and agree on next steps. Because the internet and social media platforms span the globe, conducting research in this area cuts across many geopolitical, organisational and disciplinary boundaries. We therefore felt the research required methods capable of making sense of this complexity and gathering diverse inputs.

The Delphi method is well suited for this type of research. It involves surveying a panel of experts and stakeholders within a particular area (usually 2–3 times) and asking them to describe the evidence or opinions that support their responses. Often the survey consists of statements about a particular topic and participants are asked to respond to those statements using a Likert scale to indicate the extent to which they agree/ disagree (eg ’strongly agree’, ‘agree’, ‘disagree’, ‘strongly disagree’). The research team then summarises the responses and sends that summary back out to participants so they can see the average scores and anonymous summarised responses from their fellow participants. Where appropriate, participants can update their responses from the previous round based on the new information and viewpoints from the rest of the group. The process can help lead to consensus around how to respond to important issues or alignment around a set of principles or guidelines within a community. It is therefore well suited for our work attempting to generate consensus around guidelines for the fair collection and use of public data.

Having chosen to utilise the Delphi method, we set out to develop our Delphi survey. We took inspiration from a study entitled ‘A Delphi study to build consensus on the definition and use of big data in obesity research’. Similar to our own aims, the authors sought to “establish agreed approaches for the use of big data in obesity-related research”. We adapted their approach to study public data in the context of research about social media platforms. This included adopting the structure, which was split into two parts: 1) definitions of public data and; 2) the use of public data. We also adapted some of their statements to fit our context, such as ‘It is unethical to use big data data in obesity research when consent has not been obtained for this purpose’ and ‘The data governance requirements associated with using big data in obesity research are clear’. Adapting relevant aspects of this study not only ensured that our research was drawing on established methods and a tried-and-true survey, it also potentially allows for comparison of perceptions of ‘big data’ and ‘public data’ down the line.

The final survey, which you can see here, consists of 74 statements in seven thematic areas across two main sections:

Definition of Public Data: General
Use of Public Data: Value, impact, purpose; Ethics; Data governance; Training and infrastructure; Reporting and transparency, and; Quality and inference.

We designed the statements to gauge respondents’ levels of agreement and stimulate discussion on areas of alignment and disagreement.

To finalise the survey we sent it in draft form to people within our network, including some of the people we had spoken to as part of our round of expert interviews. We iterated the survey based on their feedback. Finally, we settled on Google forms as the host software for the survey, preferring its combination of accessibility and back-end analytics.

Recruitment of participants

The initial idea for the project was ambitious: we aimed to generate consensus among the various stakeholders involved in this debate, including policymakers, regulators, researchers, platforms, tech companies and journalists. However, our desk research and expert interviews revealed significant divergences of opinions within the research community itself. Within different research communities — eg researchers in academia, industry, civil society, government, advocacy organisations and journalists — there appeared to be numerous differing perceptions of what constitutes public data and fair collection and use of that data. In recognition of this, we decided to focus our pilot study on investigating agreement/ disagreement across a range of research communities.

The two communities we ended up selecting for the pilot study were:

The European Open Source Intelligence Organisations Observatory (ObSINT) which is a collective of seven organisations involved in fact-checking activities or open source investigations, focusing on online harms and disinformation. They use public data to investigate misinformation and disinformation on social media, verify information, and conduct analyses to produce accurate, ethical, and relevant information for the public.
A web scraping community coordinated by Bright Data, a company providing a tool positioned as an ethical web scraping solution for public websites. Bright Data operates on a commercial model where users pay for the technology they utilise and the data they collect. Alongside this, Bright Data supports researchers in the non-profit and academic sectors by providing pro-bono access to their tools and data for social media research.

Pilot study

Having developed the survey and secured two research communities to take part, we launched the pilot study. Though the intention at the beginning of the project had been to conduct multiple rounds of a Delphi survey, time constraints meant that we were only able to gather one initial round of survey responses. As a result, we are leaving the proof-of-concept Delphi survey open for responses for the time being. The responses and feedback we receive will serve to influence the development and iteration of future surveys and we intend to conduct a more traditional, multi-round Delphi survey in the future.

Survey respondents were asked to evaluate each of the 74 statements using a Likert scale ranging from 1 (Strongly disagree) to 5 (Strongly agree). Respondents who felt unqualified to answer specific statements were allowed to skip individual statements. Each section concludes with an optional long-form text box, encouraging respondents to provide detailed explanations or evidence supporting their Likert scale responses to enrich the quantitative data with additional context.

Responses are currently being collected anonymously. Once the survey period has concluded, the research team will analyse the quantitative data from the Likert scale responses and perform a thematic analysis of the qualitative data from the long-form text boxes. The analysis will document areas of consensus and disagreement and propose potential steps to address the concerns raised by respondents.

Results from the interviews and questionnaire

The results from the pilot survey will continue to come in over the following weeks and months, but in the meantime, two major themes emerged from the semi-structured interviews and long-form written responses to our questionnaire.

Diverging definitions of public data and its fair collection and use

Our interviews confirmed that the definition of public data is complex and multifaceted, with significant variations in interpretation. Some participants understood public data to include any data accessible to anyone with an internet connection, where there is no expectation of privacy when shared. For instance, one respondent described public data as data that is “accessible to anyone with a working internet connection.” Another described it as “any data that is open to the public, not behind any password protected walls”.

A different group of participants suggested that the concept of public data could be expanded to encompass data that, although having more restricted access, is collected and retained for the benefit of the public. In this sense, the definition of ‘public’ would seem to align more with the broader notion of ‘public good’.

Participants also commented that the varying interpretations of public data by different stakeholders were likely influenced by each group’s specific needs, incentives and ethical considerations. They felt that academic researchers, for instance, might have a more nuanced view of public data due to stringent ethical guidelines and a focus on advancing scientific understanding. In contrast, participants felt that industry representatives focused on commercial opportunities, might view public data primarily as an asset to be monetised within the bounds of existing regulations. Finally, participants thought that law enforcement officials and journalists often operated with fewer boundaries, driven by resource constraints and the nature of their work, sometimes leading them to prioritise the ends over the means.

In our view, time limitations may also impact how different groups of researchers define and perceive public data. For instance, the timeline of research and publication within academic circles is generally longer than in fields like journalism and open-source investigations. That being the case, academic researchers may feel more able to adopt comparatively stricter ethical and legal requirements for the collection and use of public data. For instance, they are likely more able to engage in thorough ethical reviews, such as Institutional Review Board (IRB) processes, which are both time and resource intensive. On the other hand, researchers in the non-profit sector, open source intelligence practitioners or investigative journalists often operate under more constrained conditions. They tend to face more limitations in terms of funding and must react quickly to unfolding news events. Their time constraints may necessitate a more immediate and pragmatic approach to accessing and using public data, as they cannot afford extensive and expensive ethical reviews. These disparities underscore the challenges faced by different research communities in navigating the ethical landscape of public data collection, use and sharing. They also highlight the need for any guidelines for the collection and use of public data to be co-designed with input from diverse research groups.

The challenges caused by divergent definitions of public data for public-interest research

Participants felt that these differing perceptions of public data and fair use were making it difficult to access and use public data that could be used to support important public-interest research.

Lawsuits, such as those brought by X and Meta against civil society researchers and web scraping companies (who were accused of collecting data from their platforms illegally) were cited as creating caution among researchers with regards to collecting and using public data. Despite these two lawsuits being unsuccessful, respondents felt the potential legal risks and uncertainties surrounding platforms’ terms of service were still driving down use. All this leads to a situation where datasets that could be used for public-interest research are left siloed within private sector organisations.

Despite this, our expert interviews and discussions with experts during the tutorial we delivered at the 2024 ICWSM conference raised questions about the legitimacy of social media companies banning scraping on their platforms. From a legal perspective, such bans might not be justified or legitimate. Indeed, both X and Meta lost the cases they brought against researchers that had used scraping tools to collect data from their platforms. In the case of X, the accusation was dismissed because judges ruled that scraping for public-interest research is integral to public-interest speech. The dismissal of the Bright Data case potentially extends this further, as it involved a for-profit organisation scraping public data for commercial, not just public-interest, research purposes. This suggests that training and awareness raising might help empower public-interest researchers and show them it is possible to collect and use data foundon social platforms in legal and ethical ways.

Discussion of next steps

In terms of how to respond to these challenges, every person we spoke to advocated for the development of clear, consensus-based ethical frameworks and best practices for the collection, use and sharing of public data. Respondents suggested that if this work progresses, it should be conducted in a way that ensures that diverse stakeholders (eg researchers, universities, policymakers, platform representatives, platform users, etc) are given the chance to feed into the co-definition of guidelines. This will help ensure that the guidelines are not only legal and ethical, but are perceived to be legitimate and actionable. This in turn would be key to driving broader uptake.

Participants also expressed that policy and regulatory measures, such as the Digital Services Act (DSA), offered a promising approach to clarifying and enforcing fair data use practices. This legislative work is accompanied by strong advocacy work from civil society organisations like The Mozilla Foundation and AlgorithmWatch, which advocate directly to EU policymakers for more actionable guidelines on the definition of public data and its use and more clarity from the DSA. This work is crucial in order to appropriately design researcher access systems to comply with the DSA.

Finally, some respondents argued that technological solutions, like privacy-enhancing technologies (PETs), could help mitigate privacy, security, and legal concerns, enabling broader access to valuable datasets. They suggested that by employing these technologies, data holders could make more data accessible without compromising individual privacy or security. It is worth pointing out, however, that in our view the solutions to increasing access to data about social media platforms for public-interest research need not be overly technical. Companies could be encouraged to make good-faith efforts to safely and securely enable researchers to access public data in a wide range of different ways — some novel, like PETs, and some more traditional — all while taking the necessary steps to ensure that the data is shared/ accessed in safe and responsible ways. We have compiled many of these different approaches in a register of initiatives that enable access to social media data.

Conclusion and recommendations

We believe that this pilot project has demonstrated the need for further work in this area and outlined a method and proof-of-concept survey capable of serving as the foundation for that work. Future work could seek to conduct a consensus-generation exercise with the aim of helping people and organisations begin to agree on what should and shouldn’t be considered ‘public data’ and establish early-stage guidelines for use. Where consensus cannot be generated, this work could document where views diverge and outline potential next steps to address concerns raised and differing viewpoints. The long-term goal of this work could be to generate consensus amongst a wide range of different stakeholders in different regions across the globe, eg different types of researchers, regulators, policymakers, digital platforms/ publishers, industry bodies and customers/ users. However, the initial phase of the project could decide to focus on generating consensus amongst researchers first, especially considering that our initial research suggests there are differing points of view amongst the wider research community.

We believe the findings and outputs of this type of project would be useful for:

Journalists, researchers inside/ outside academia and the open source intelligence community seeking clarity on how other researchers conceive of public data. Ultimately, by generating consensus, the research community may be able to come together to more effectively petition platforms and/ or policymakers for access to public data that meets their needs.
Policymakers seeking insights into the diverse conceptions of public data across different sectors and regions.
Digital platforms such as social media sites, crowdsource platforms, search engines and news sites interested in understanding the broader understandings of ‘public data’ in various communities and regions and whether their view of public data matches those of researchers, policymakers and users.

We also believe that the Delphi method has great potential for research within this space and deserves to be explored further. In particular, we believe the method is well suited for research at a global scale (such as the research within the Global Data Infrastructure programme) where input will need to be gathered from diverse viewpoints, regions, time zones and cultures. The Delphi method seems better suited to this task than more traditional methods for gathering viewpoints like workshops and roundtables because it can be conducted anonymously to limit potential bias and groupthink and is conducted asynchronously to gather input from across the globe. This helps ensure that all stakeholders can participate on equal footing and that the consensus produced is (hopefully) less influenced by existing power imbalances across the globe.

In order to continue gathering viewpoints and evidence, the proof-of-concept Delphi survey will remain open for the time being. Please feel free to submit a response and share the survey with your communities if interested. We will continue to monitor and analyse the results, which we intend to use to pursue further funding and work in this area.

If you would like to support or join us in this work, please get in touch by emailing us at research@theodi.org.