We must fix researcher access to data held by social media platforms

Published in

Canvas

12 min readOct 19, 2023

My name is Sasha Moriniere, I’m a Researcher for the Open Data Institute (ODI), working on issues related to global data infrastructure, power dynamics within data ecosystems and their impact on data access and sharing.

In October 2022, Elon Musk took over X — formerly Twitter — after months of twists and turns. For researchers working to tackle things like disinformation or online harms, X was an analytical goldmine. They were able to gather way more insights compared to other similar social media platforms, including that false information is 70% more likely to be retweeted than true stories. But early in 2023, the new leadership progressively tightened access to its previously open API for researchers, ultimately implementing a payment model that would start at $100 per month for ‘low-level usage’. In April, it was Reddit’s turn to restrict access to researchers, leaving researchers investigating important topics like the negative consequences of the Covid-19 pandemic on teacher resignation and mental health in the dark.

These recent instances of closing down access have been criticised by a wide range of different stakeholders, including civil society research organisations, journalists, academia and decision-makers. These new policies also raise a number of important questions, such as: How are they going to affect researchers in the short and long term? What other useful datasets will researchers examining misinformation and online harms be left with? How can we actually collect and analyse really critical data that is circulating on those platforms and make sense of the events taking place there without any legitimate access?

I believe there are two competing forces taking place here and potential competing interpretations of how data is being valued. On the one hand, social media platforms are locking themselves down from public-interest research, and on the other, several major legislative pieces are being discussed that are specifically targeting those platforms over their lack of transparency — the Online Safety Bill (OSB) in the UK, that just became law on Thursday 26th October, and the Digital Services Act (DSA) in the EU.

So where are we heading? And why does researcher access to social media platforms matter?

Social media platforms are spaces where individuals publicly express their political views, share and comment on local events and discuss the news they consume, as well as engage, organise, and mobilise in public and civic life. In the past, platforms such as speakers’ corner in Hyde Park were public spaces that weren’t provided or maintained by private actors. Now, our political activities are largely taking place on privately held, maintained digital platforms. The way in which these social media platforms are utilised accidentally function as public data infrastructure, leading some experts to say that they should be democratically ‘governed’ and funded.

The importance of being able to access these virtual spaces for research is clear, when considering examples like examining social media data to trace the spread of disinformation narratives preceding offline violent anti-democratic events, such as the riot at the US Capitol in January 2021. Another illustration is the data utilised by researchers to assemble photographic and video evidence suitable for the prosecution of potential war criminals in international tribunals. Some voices have even put forward that delving into social media network analysis can shed light on marginalised communities that might be falling through the cracks in terms of accessing public services. Losing access to this data is definitely not an option. In my view, it doesn’t seem too much of an ask to access that data, considering that failing to do so could lead to substantial societal and policy-making setbacks, especially concerning political, social, and democratic issues.

In this piece, I discuss the adverse consequences of keeping public-interest researchers from accessing social media platform data. I also discuss the potential for collaborations between researchers and social media companies to help improve the functioning of these platforms, address detrimental algorithmic drifts, and make meaningful contributions towards addressing broader societal concerns.

Private-ownership of critical data and its impact on public-interest research

Over the last 10 years, most of the biggest social media platforms such as X, Meta, YouTube etc. managed to build closed systems containing a large amount of data: from public to private or sensitive data. However, what sets a new precedent is the contrast between the private nature of that data and its public influence on politics and society. Therefore, they are increasingly becoming key realms for public-interest researchers.

The dynamic between tech companies, governments, researchers, and the public is characterised by an information imbalance owing to multiple factors, including a legitimate concern about the privacy of their users, a lack of incentives and desire for companies to share data they consider valuable, and a reluctance to disclose such data to the wider public. Social media platforms have a history of sharing just small portions of the data they hold. While there is a growing precedent and imperative for them to undergo audits and receive external reviews of the content spreading on the platforms, the foundation for such actions is still lacking due to policy shortcomings. Without these processes in place that will make those social media companies more accountable, researchers lose critical data to conduct research and analysis. This data is being hoarded by these platforms that know ‘nearly everything about us, and we know next to nothing about them’, as was stated by scholar Nate Persily.

Both X and Reddit used to be quite helpful in the access they were allowing to researchers trying to understand the extent of online harms and inform policies. However, the recent developments on both of these platforms are quite illustrative as to why we shouldn’t depend on ‘the goodwill of a few businesses whose policies might change at the whim of a new owner.’

The unavailability of platform data hampers the investigation of larger societal challenges.

The data held by social media platforms could be useful for a wide range of extensive research with significant social benefits. According to the Institute for Strategic Dialogue (ISD), there are three types of social media data that would be beneficial to the research community: User-generated data, including public and private information about user behaviour and content, eg individual and aggregated data accessible through platform APIs, with varying restrictions; Platform curation data, eg data concerning the moderation and content sorting processes carried out by both human and algorithmic systems on a platform; and platform decision-making data, comprising information about internal decision-making, including decisions related to the introduction of new features on the platform or experiments conducted by platforms to test and evaluate the ranking algorithms of the recommender systems.

Not being able to access this data hinders the way we understand narratives, trends and events taking place in our societies. For instance, it keeps researchers from shedding light on how hate speech and disinformation spread within social media groups or comments. This vital research is essential for informing policies regarding online harms, child safety, cyber harassment or bullying, gender-based violence and showcasing how those types of behaviours spread and can erode trust in democratic processes. Social media data is also useful for the investigation of broader societal concerns. It can also be beneficial for health research purposes such as ‘health interventions, health campaigns, medical education, and disease outbreak surveillance’. In the same way, climate change misinformation spreading on social media undermines accurate public understanding, reducing support for effective mitigation efforts. Social media platforms could play an important role in signalling when social and environmental issues are emerging, but that potential value is squandered when researchers lack access to the data they hold.

Continuing to lock down data could lead some researchers to develop alternative data collection or scraping methods that may lead to legal actions if it’s in contradiction with the terms and conditions of platforms.

The inability to access platform data hinders research focused on the platforms’ characteristics.

In addition to addressing broader societal challenges, accessing social media data would allow researchers to understand the structures of social media, namely platform features, algorithms, etc. By having robust processes and standards in place for data access and sharing, along with comprehensive information about what social media platforms produce and their effects, researchers could analyse and delve into the platform’s structure, functionalities and its role in exacerbating certain social issues. Over the remainder of this piece, I will elaborate on the fact that a lack of access to data hinders research and explore potential solutions for enhancing greater access.

‘It’s not about the technology being the existential threat. It’s the technology’s ability to bring out the worst in society. And the worst in society being the existential threat.’ — Tristan Harris, The Social Dilemma

Getting access to that data will help us understand how social media platforms shape online behaviour by amplifying harmful content, targeting individuals with political advertising, enhancing societal structural biases, minority groups being silenced by decisions taken by the platforms in terms of content moderation and creating structuring inequalities and polarisation within digital ecosystems. Going forward, it would be beneficial to the research community if these access conditions were standardised, formalised and not dependent on the good will of platforms.

My concerns are that without access to this data, it becomes difficult to gauge the effectiveness of social media platforms’ efforts in combatting disinformation, and other online harms. It also hinders our ability to assess the true impact of policies and regulations on users. Thus, this information asymmetry may result in policymakers designing and adopting inadequate or even harmful regulations. All of this could end up affecting society’s trust in the Internet in the long run. Hence, enabling researchers to access the data from these platforms marks the initial stride towards establishing well-informed public scrutiny and regulatory oversight of digital ecosystems: the content circulating online, its impact on public discourses and the development of policies to safeguard individuals and society as a whole.

How do we turn this around? How do we provide researchers with systematic and robust means of accessing the data held by social media platforms?

It is critical to advocate for the establishment of robust and systemised data sharing models between researchers and large digital platforms. First, because these platforms don’t disclose data voluntarily if they don’t have any interest in doing so. Second, because we can’t only rely on their goodwill or altruism. Helpfully, there have been a handful of initiatives in recent years that have explored ways of enabling access to this data, and a number of mechanisms for incentivising and/or leveraging access are emerging.

One way of enabling access might be by mandating access through regulation. This is not inconceivable considering many other industries and private companies are required to disclose data for public accountability. As a matter of fact, the Companies Act in the UK for instance — whose predecessors go back to 1862 and which has evolved over time — mandates companies file public accounts about financial and operational information. While the Act predates digital technology and the internet, it laid the foundation for today’s principles of corporate transparency and accountability. If, in the future, social media companies knew they would have to report on things like the spread of mis/disinformation or extremism on their platforms, we might hope this would lead them to change how they operate.

Another way of enabling access might be through building and empowering an organisation to facilitate safe access to the data held by social media platforms, on mutually agreeable terms. Our work on data intermediaries and data institutions suggests that implementing an entity capable of facilitating trusted sharing of data for research and regulatory objectives is both feasible and desirable. Furthermore, our work suggests there are ways of ensuring that sharing data responsibly is valuable and beneficial for every stakeholder engaged in the process.

The initiative Social Science One (SSO), coordinated by a consortium of researchers from different universities, including Harvard’s Institute for Quantitative Social Science, and involving industry, including Facebook, is an interesting attempt to get data out of social media platforms and create relevant infrastructure. This initiative acted as an intermediary, enabling global researchers to access Facebook data without requiring pre approval from the company, granting them the freedom to investigate the data without Facebook’s intervention, even if their findings may reflect negatively on the platform. Researchers worldwide had the opportunity to seek data access, which was granted approval by Social Science One, not Facebook.

As part of this collaboration, it was possible to publish the Facebook Privacy-Protected Full URLs Data Set, including 57 million URLs, more than 1.7 trillion rows, and nearly 40 trillion cell values. Another result of this collaboration was the work undertaken on the impact of social media on democracy and elections that has been conducted on Facebook. Although the SSO collaboration had its flaws, it still showcases that such partnerships are possible and improvable in the way they function.

Ways to push this forward

An organisation that gathers data should not necessarily have exclusive control over the decisions regarding its release. Instead, there should be a public discussion that takes into account democratic and societal factors as suggested by different organisations, such as the Coalition for Independent Technology Research or by the European Digital Media Observatory (EDMO). These statements state the need to safeguard API access to ensure the preservation of public-interest research in areas such as national security, public health, child safety, polarisation, online violence, etc.

There are different ways to push this forward beyond regulation and the enforcement of legal statutes. Incentivising digital platforms to share data is one way to enable access for public-interest research.

As for-profit organisations, these companies rely on what is called the reputational effect. If more and more disinformation, extremism and hate speech were to circulate on a social media platform, boosted by that platform’s algorithms, this could affect the platform’s reputation and therefore users’ trust. Without saying that this would necessarily affect that company’s business model, it could be a good incentive to convince social media platforms to disclose some of their data according to a precise collaboration model. As a matter of fact, Social Science One emerged in 2018 as a response to the Cambridge Analytica debacle (2016). Good public relations and reputation could be motivators to enable access to a certain extent. However, there is a risk that tech platforms use openness and transparency as selling points without necessarily committing to what it actually means in practice for researchers, as stated by NYU postdoctoral researcher Laura Edelson. My concern with this approach is that once this motivation is gone or diminished, the researchers trying to gain access to the data basically lose all leverage or ability to incentive platforms. It is an interesting area to explore in order to effect a cultural shift, but it is not sufficient nor robust enough to ensure public-interest researchers have sustainable, robust access to the data held by social media platforms.

We can’t let the goodwill of platforms’ be the arbiter here. Due to the nature of how they operate, such companies have the authority and power to modify their terms of service independently at any moment, limit access, or implement content filters and censorship, thereby impeding the use of the data they hold in research and practical applications. Nevertheless, there are initial signs that the research, advocacy, media pressure and policy and legislative discussions are yielding results, as recent decisions have prompted TikTok and YouTube to promise to share some of their data for certain research purposes. It remains to be seen whether these theoretical positions are implemented in practice.

It is time we gather and examine existing data-sharing mechanisms, and advocate for the development of new approaches, see what works, what doesn’t. I’m sure there are plenty of avenues we could explore that have not been already in order to demonstrate the importance of the collaborative effort that both researchers and tech platforms need to make. It’s not in our interest that social media platforms and social science research drift further apart.

The major texts of legislation (eg OSB, DSA mentioned above) are soon to be implemented. A question that I am asking myself is, will this be sufficient? Can these legislative texts alone serve as adequate mechanisms to enforce such collaborations? Are there going to be ways for platforms to circumvent the obligation of disclosing some data with researchers? In any case, we have to have that conversation now. Data sharing collaborations could also easily apply to AI, where there’s a pressing need to encourage AI companies to be forthcoming about the datasets used for training models, in order to identify and rectify biases, harmful content, and other negative aspects.

It’s already a quite packed area, with lots of individuals and organisations addressing the issue on a daily basis. But we need to conduct further research to identify and document examples of public-private data sharing, convene public and private stakeholders to identify challenges, discuss potential solutions and communicate the importance of these challenges to the people and communities impacted by them. These are just some of the things I am excited to begin work on as part of the ODI’s new research project around Global Data Infrastructure, with a primary focus on enabling data access for public interest research. The project will investigate the ways that global data infrastructure can be used to break down silos and facilitate collaboration across boundaries to enable research aimed at addressing pressing global challenges. If you’re keen to engage with that project or continue this conversation about enabling access to social media data, please do reach out via Medium.

We must fix researcher access to data held by social media platforms

Private-ownership of critical data and its impact on public-interest research

How do we turn this around? How do we provide researchers with systematic and robust means of accessing the data held by social media platforms?

Ways to push this forward

Written by Sasha Moriniere