Burning the Hooks: What happens when we lose our subreddits, APIs and exchanges?

Dominique Carlon
Automated Decision-Making and Society
18 min readJun 13, 2023

Co-authored with Dennis Leeftink

Rarely do we see fires this large ravaging our digital landscape. Over the past days, the decision of two major platforms, Reddit and StackExchange, to restrict third-party access to their website’s contents has ignited a blaze of protest, resulting in sitewide moderation strikes, with millions of subscribers greeted by their respective message boards going dark — in some cases indefinitely. Although moderator strikes are not new to these platforms, the widespread and impassioned response speaks to the heart of what Reddit and StackExchange represents: support for community driven initiatives and user generated content.

While the rationales differ between platforms, the common thread is clear: the rapid development of generative AI technologies is making platform and data governance increasingly difficult. The wide scale training of these Large Language Models on Reddit and StackExchange data is rekindling questions about attribution, fair use and user compensation, complicated by trained bots and models starting to inject themselves into the discussions on each site. It is here that admins, moderators, and users have reached an unfortunate but now all too familiar impasse: who gets to decide who can access our content?

The attempts by Reddit and StackExchange to restrict and compensate data and API access have been met with resistance and outright rebellion from their user communities. The backlash is not a protest against user data being used to train LLMs per se, but rather out of concern about platform overreach on the developers, moderators and power-users who create, use, and maintain third party applications that make the websites better, richer, and more creative. Moderators (user community leaders) have coordinated protests in response, and in a strategic move to regain foothold over their data, users are reaching for the only thing left in their arsenal: abandoning or shredding their digital contents on the way out. When digital self-immolation seems the feasible option, it is timely to explore how platforms got themselves into this mess, why users are taking to such extremes and importantly, whether their tactics will have any bearing on the course of each platform.

The platform pyres

The blackout of Reddit and StackExchange may come as a surprise to those observing from outside, however for moderators and users within, the coordinated actions arose from a series of unresolved simmering tensions. What triggered the subreddits and exchanges to go dark? What do APIs have to do with the current fallout? And what has set Reddit and StackExchange alight?

From the outset, Reddit and StackExchange seem to share only a passing likeness. The first, an ad-funded message board with disparate subs and nested comment chains, the second a meticulous listing of burning questions (and if you’re lucky some answers) banking on team products and talent referrals. Where they are similar, comes from the way each platform fosters a ‘collective self’ (or rather collective selves), a large body of users acting ‘as one’ towards shared goals, ideals or interests without necessarily sharing close personal ties (Brewer & Gardner, 1996). Taking away this collective power is a certain way for platform owners to push new policies (at least, temporarily), guidelines or other site-wide changes. Setting aflame the API ‘hooks’ that users plug into, heightens this reaction. However as we are witnessing, the strength of these platforms lies in their collective response of users [after all, the collective can hit them where it hurts: the advertising bottom line.]

Reddit: burning the front page of the internet

Browsing Reddit on the morning of June 12, 2023 resembles walking past empty shelves at a shop. There is a hint as to what should be there, but much of the content remains unavailable and out of reach. Reddit’s shelves are largely empty because the pillars from which the platform is built have gone dark. In the days prior, Reddit was replete with discussions about what subreddits were ‘going dark’, for how long and why? So what does ‘going dark’ mean, why is it happening and what are the consequences of the ‘blackout’ for Reddit?

Pillars built on subs and mods: Reddit is organised and built upon subreddit communities, or message board forums centred around topics both niche and broad, from gaming and science, to peculiar interests, products and pop culture icons. Since 2008, Reddit users have been able to create their own subreddits, and the content, rules and moderation of these communities has relied upon the labour and efforts of volunteer moderators and user contributors. The community centric experience of Reddit has meant that the platform functionality is dependent on moderators who use various apps and bots to complement the manual processes of improving usability and moderating content to meet the expectations of each sub community.

This central role of unpaid moderators has also meant that subreddit communities can in effect switch off access. Moderators can make subreddits private and inaccessible, meaning that the public can no longer access the content, and advertisers no longer have an audience to reach. In effect, the way Reddit was built means that each subreddit community and their moderators can decide to go dark. If they choose, moderators can coordinate to blackout or burn down the existing platform from within.

Given their central role to Reddit, it is no small feat that on the morning of June 12, more than 28,000 moderators and over 7,200 subreddits have committed to a blackout (according to the website Reddark that is live tracking the movement). On the morning of the second day, the number of participating subreddits grew to 8,300, and the combined subscriber count reached over 2.8 billion users. The participating subreddits include some of reddits most niche communities like r/TreeAbuse, to Reddit’s largest r/funny with just shy of 50 million subscribers, alongside some of Reddit’s historically most influential subreddits: r/science, r/music and r/gaming, each with over 30 million subscribers.

Why did Reddit moderators coordinate a blackout? The immediate motivation for this stems from Reddit’s announcement on May 31 that the platform would start charging for engagement with its Application Programming Interface (API), in practice synonymous to Web URL endpoints from which requests for structured data can be made (i.e. ‘hooking into’ the platform). However the reasons that inform this course of action extend further back into Reddit’s history. API access to Reddit was previously free allowing third party developers to create their own application and features. Although Reddit has its own app and website, moderators and Reddit users alike have enjoyed the convenience of, and often fully relied upon third party apps for improved functionality. The r/blind community for instance relies on external apps for accessibility, and they have played a prominent role in negotiating with Reddit to exempt fees for certain accessibility apps. Similarly, in their announcement in support of the blackout, the moderators of r/funny explain that the capacity of users to select effective tools and apps to access Reddit is absolutely necessary to ensure that ‘creativity and in-depth discourse’ thrive while ‘spammers, bad actors.. recycled content [and] bot driven activity’ are removed.

Although the initial announcement regarding API access was framed towards targeting large companies that train LLMs, it became apparent that existing third-party apps would be the ones to feel the impact. Others are more sceptical of the LLM narrative being pushed, pointing to how Reddit wants to absorb the ad-potential of third-party apps for their expected IPO (Reuters, 2023). At the core of Redditors heated response is the sense of fighting for the rights of the developers. As a user posted in r/ModCoord, third-party app developers ‘are generally Reddit users just like you and me, making the app out of their houses’. Whether this reflects the reality of the situation for influential apps like Apollo appears irrelevant. What is clear is that Redditors have sided with the devs who are considered Reddit’s kin and kindred, and not the Reddit admins.

Stack Exchange: a drying well

Meanwhile, a series of similar events unfolded on Stack Exchange, a network of user moderated ‘knowledge exchanges’ where people can ask, answer and vote on questions and solutions. Equally lauded and derided for connecting expert and niche knowledge domains (often SWM-dominated), it is no secret that the past decade has seen more and more questions go unanswered (Srba & Bielikova, 2016). Whether eventuated by the influx of ‘low effort questions’, disappearing expert incentives or other platforms filling the niche (e.g. Google’s ‘related questions’), the fact is the once slow drip has turned into a massive drain after the release of Chat-GPT late last year. To some estimates, there has been staggering 80% decrease in question-answering activity since the start of January as seen in the figure below.

Figure 1. Monthly StackExchange QA-activity. Credit Starball [DL1] (2023).

Like others at the time, the StackExchange platform response was an outright ban on ChatGPT generated posts. But as commenters were quick to point out, enforcing such a ban is near impossible due to the complexity of automatically tagging GPT generated posts. Just six months later, such concerns have become a harsh reality for the platform’s top brass, paid community managers, and spare-time moderators and users: one moderator claims to have flagged upwards of 10,000 GPT generated posts since inception. While official figures state a far lower number (‘just’ over 330–700 posts per week, which now translates to ~10% of weekly answers), the StackExchange admins have warranted the issue large enough to outright drain its most valuable community resource–the quarterly data dump.

Distributed continuously via The Internet Archive since 2009, John Atwood of the original founders deemed user contributions to the platform so valuable they would be providing them full access to the platform’s collective wisdom. Over the years, users have come to regard the dump as the platform’s collateral in case of future mismanagement. Surprisingly then, last Friday the platform’s upper echelons divulged that in “working on a strategy to protect data from being misused by companies building LLMs”, they have put a stop to their quarterly refreshes until adequate guardrails against misuse have been found.

The announcement came at a time that had moderators already up in arms from a previous decision by the site’s management that AI generated content bans would no longer be enforced. The explanation offered is that the “potential for false-positives” would simply be too high, requiring extensive moderator scrutiny. After a spree of downvotes, moderators from various exchanges organised a strike preceding the data dump announcement, listing demands ranging from retracting the sanctioned use of ChatGPT, reinstating the data dumps and guaranteeing continued access to the site-wide APIs and Data Explore. Mods have since stopped handling flagged posts, anti-spam bots, and review queues of low effort posts.

As with Reddit, most friction seems to be stemming from increasingly drifting platform values and lack of communication. Users wonder why changes to the data dump were not announced back in March when the decision was made, so discussions and feedback on the matter could percolate along the site’s official ‘meta’ channels. Mods have also been issued a ‘private policy’ on the matter that cannot be openly discussed, compounding their frustration. StackExchange’s paid staff is further fuelling the flames by downplaying the percentage of moderators involved in the strike as well as the actual number of ChatGPT-users, currently measured by the amount of edits a user makes before posting — a measure easily imitated by certain Chrome extensions that can automatically generate StackExchange answers.

To shreds, we say

A level-headed response to the chaos taking place on Reddit and StackExchange might be the quell the flame, however there is too much at stake for this to be feasible. The API hooks and endpoints that moderators, users and researchers are tapped into are starting to smoulder. As a critic on StackExchange posted “in order to prevent companies building LLMs from our data, which they already have and have had for years, the plan is to keep all of us from getting ours?” A question arises as to why historic changes are often marked with frenzied burning of our (now digital) books?

Raiding the collective shelf

It is troubling to imagine bookburning being commonplace, but if you look closely, on the web it is simply daily practice. Whether bits, scrolls or Gutenberg bibles, bookburning rears its head whenever authors and ideas are deemed unwanted, often instigated by moderators or the bots and tool they use. As stated by Ovenden (2020) “the significance of archival material is recognised not only by those who wish to protect knowledge but also by those who wish to destroy it”. Nowadays, the ritual of scrubbing unwanted content from our collective shelves is mainly symbolic, as modern storage and information circulation proves it difficult to wipe published works from the shared memory (Fishburn, 2008: xiv; Hill, 2013).

Despite frequent acts of top-down erasure, the sheer size of web communities increases the chance that someone, somewhere is making a copy for posterity — the web never forgets (Lasica, 1998; Colker, 2001).However, the reality is, it does sometimes: hyperlinks break, sites are walled off, and search engines favour the new. The gargantuan task set out by The Internet Archive to preserve all of web history goes to show that massive amounts of digital content is burned each day, whether due to practical/cost effective considerations, crashed servers or website discontinuations.

Sometimes, our collective shelves are raided more nefariously. For instance, Reddit’s recent closure of the ‘i.reddit.com’ domain that allowed users to access subreddits under certain access protections, or a recent A/B test where users were forcibly logged out of the main site and herded to using the app. StackExchange as well, is explicitly prohibiting moderators to convene on the platform’s channels to self-organise the ongoing strike. But such acts are hardly putting a dent in their resolve. While each platform provides an environment for users to convene, they are hardly bound to a single one — users have options Discord, Twitch, Tildes Metafilter and Codidact are taking up central roles.

What is putting a dent in the collective resolve, are the unreasonable timeframes and costs associated with the proposed API changes, at least in the case of Reddit. Having third-party developers scramble to adjust their apps to the multi-million dollar costs it will take to keep operating makes it clear that Reddit is “cutting off the water supply” for third party uses that affect the ad bottom-line. While StackExchange has not explicitly stated any changes to their (relatively open) API, if we have learned anything from past decade’s platform playbook is that once a faucet is turned off, it probably is for good (one user likens the missing data dump to ‘softening the blow’ for incoming API changes).

One thing is certain: LLM companies would like to see their models hooked up to popular APIs to tune their models on the latest language data (NYT). But when their usage exceeds the rate of terabytes, there will be little left for the average Reddit & StackExchange mod or user. The incoming API restrictions then, are not merely technical. There is serious money to be made by pivoting to taxing LLM companies for their access. But turning the user base into collective test subjects is not a contract that is being willingly signed, quite the opposite–an incipient shreddit movement aims to devalue the data available for exploitation with StackExchange users threatening the same. By turning to data shredders, users are cutting their nose to spite the collective face.

From backburning to blackout

While moderators with a flair for the dramatic advocated for ‘switching off’ Reddit, it was Reddit users who turned up the heat. In the lead up to the blackout, Redditors were actively engaged in formulating a response, coordinating action, upvoting content about the blackout, and signing the open letter. Discussion threads contained questions prompting responses from moderators, they were warned not to harass moderators, and notes were shared about the subreddits and moderators who were cooperating in the protest, and those who were not. For instance in the r/ModCoord subreddit, the decision of moderators from r/politics to not engage were scrutinised, and it was clear that a judgement of collective morality was taking place. Long term redditors were expressing solidarity with developers and moderators, and expressing frustration towards the admins for not taking the needs of the people -particularly the accessibility needs -seriously.

However, it became quickly apparent that for many redditors, the proposed blackout was not sufficient. For some, the 48 hour time period was too short. One Redditor within r/gaming described it as the ‘thoughts and prayers to protest’, and many called for moderators to follow r/music in forming an indefinite blackout until Reddit amends the API fees. Among this call, there was a sense that Redditors needed to unite and stand strong to show, as another user in r/gaming put it ‘that the site is built upon us, the people, not them’.

Some Redditors took the ‘us v the admin’ attitude further, stating that blackouts would not achieve the goal and that it was time to leave Reddit. Posters like the one below began appearing across subreddits like r/RedditAlternatives, and people were stating their intentions of engaging in a mass exodus. However, while there are examples to demonstrate the wide range of sentiments found across the platform, the majority of discussions collected around a shared theme: we will keep fighting to keep the existing Reddit alive.

Figure 2 — Indefinitely Leave Reddit Poster

Over on Stack Exchange, some long-time users also expressed the current situation as being untenable, threatening to leave the platform for good. Whereas Reddit users may have been ‘backburning’ to rejuvenate their environments, different strategy emerged on Stack Exchange: incoming questions and answers would simply not be greenlit, flagged or reviewed by their respective moderator teams, at least until their demands are met. In other words, much of the platform has been put on the backburner.

Whatever the strategy, the outcome appears the same: the communities on both sites are increasingly blocking their content for access. As one Stack Exchange commenter remarks “moderation use is a slow burn, and the attrition of striking will not be seen for weeks or months”. Indeed, the ‘data fuel’ that APIs provide may be cut off entirely if a critical mass of users keeps shredding their content, ‘burning the hooks’ third-parties have been relying on for years. Yet, who benefits from this black-out?

To the robot come the spoils

The image of scorching the earth to save data from rogue bots, and along the way fighting the good fight on behalf of struggling devs does not seem too far-fetched for either cultural imaginaries of Reddit or StackExchange. As the proposed blackout runs to the final hours, the question remains as to what will come of the Reddit and Stack Exchange protests. At present, there has already been success. The number of subreddits blacked out reached over 8000, including the coordinated efforts of over 28,600 unique moderators, while the demand letter for Stack Exchange has 1245 users signed up.

As far as Reddit is concerned, there has been no signal of backtracking the fee structure, and no clear signal from Stack Exchange what will happen to future data dumps. Furthermore, when data remains accessible to companies in cold storage, one starts to wonder how effective current shredding strategies might be. Ceddit, pushshift, removeddit all provide tools to recover previous versions of comments, just as the Internet Archive still hosts prior version of the StackExchange data dump. Who then might the current impasse be benefitting. Spoiler: not the human end-user.

Who’s eating our lunch?

In machine learning, there has long been a saying that ‘there is no such thing as a free lunch’, referencing to how for some computational problems there is no short-cut to optimisation. More broadly, the companies maintaining large machine learning models may be learning the hard way that when you keep taking everyone’s plate, your lunch may start to fight back. Operating under the guise of fair use, LLM companies have been accumulating massive datasets to the magnitude of the trillions of words arguing that the training and text generation of their models produces derivative works protected by fair use.

To not rehash this discussion in full, LLM companies are benefitting from the knowledge and creative work of online users, trained on datasets such as Common Crawl or RedPajama that include thousands of websites — including Reddit and StackExchange. The very existence of these models depends on web data being freely accessible in the first place. While we can only guess at the actual percentage of Reddit and Stack Exchange posts used in training models as Chat-GPT (try asking Chat-GPT or Bard whether they have been trained on either of those datasets), the fact is that these datasets contain thousands of contributions from disparate sites and users.

Whether this is fair is not the point, but rather the enormous knowledge asymmetries this entails. Whereas small third-party developers and social media researchers must fight over data scraps, some of our CS colleagues have been sucking up the entire web. To put salt to the wound, the chat interfaces themselves draw enormous amounts of traffic, generating another set of valuable (question) data that has yet to be made available to social media researchers. The current blackouts then, may provide a silver lining, as LLM builders will be sure to take notice when a large share of data disappears overnight. If this will lead to fairer compensation schemes and equal access to research data remains to be seen.

The final hookup

As researchers we are stuck between two walls: on one side we have come to depend on large social media data to inform us about digital phenomena. On the other, a growing body of evidence is pointing to the dangers of prolonged social media use. We have been mainlining big data for years, and it might be difficult to fight our addiction. But quitting cold turkey might be a good thing: not only kicking our social media habits, but “encourages researchers to reduce their dependence on mainstream platforms and explore new sources and ways to collect online records that are closer to the digital fieldwork” (Rogers & Venturini, 2019).

But kicking our bad habits is hard. Indeed, in the past users have quickly flown back to their respective platforms after ‘quitting for real’. Remember, what we have been describing is not Reddit’s first blackout: they are in many ways a cultural legacy to Reddit and with various degrees of impact. In 2012 for instance, Reddit underwent a blackout in protest to the Stop Online Piracy Act. While there have been many blackouts in response to internal and external factors (Matias, 2016), the 2015 ‘AMAgeddon’ stands out as one of the most renowned. Like now, it largely arose in response to miscommunication between admins and moderators, in response to the firing of Victoria Taylor (a liaison between moderators and for Ask Me Anything (AMA) sessions.

In the days following the 2015 blackout, an apology was issued by Reddit admins and the interim Reddit CEO at the time Ellen Pao resigned a few days later. Coincidentally, Ellen Pao was replaced by the original CEO Steve Huffman, or u/spez, who is at the centre of the current blackout controversy. While the 2015 blackout came about following a controversial AMA session, Steve Huffman’s recent AMA about API access is bringing about similar sentiments. While the AMAgeddon action had significant implications for Reddit, the question remains if Redditors today can stay away from the platform? And just as an abusive relation can be hard to shake, we might be at an inflection point in the site’s history — a final hookup after a prolonged stint with no one in particular or a temporary hiccup leading to a more balanced relation?

Rising from the ashes

In the end, what will happen when we lose our subs, APIs and exchanges? Is a new digital dark age upon us? Or will the current blackouts eventuate platform wide changes? While we do not know how the outcome will play out, it is clear Reddit will respond to the situation. We think it unlikely that Reddit will backtrack on its fee structures, but given Reddit is dependent on the hearts and minds of its users, some level of accommodation is likely. Redditors themselves remain confident in the power of its user base, and the platform is centred upon their contributions. At the same time, technical precedents are being set. If rising API cost and walling off user data becomes the standard, online research will become even harder. The Web 2.0 seems to be on its way out, with an era of more closely guarded systems in return. We are set to lose a lot of sense of community in the process.

Reddit and Stack Exchange users have ignited the flame. While the reaction may seem troubling, there is value in the heated user-led upheaval. The fact that both communities can effectively form an uprising from within is testimony to their community driven foundations. The conflict and protest itself, more than apathy, can be a sign of a healthy, flourishing community. As noted by Jon Ericson (2023): “Truth is, the only groups that don’t have conflict are those that are too young or insignificant. Since conflict happens when people care about something (otherwise they do not bother to push disagreements), conflict is a sign of maturity and significance. It’s the symptom and result of people working together on a common goal.”

From the fires of conflict then, new platform practices and respect for users contributions might spring. What Reddit and Stack Exchange have achieved is a reminder that platforms need to communicate effectively with moderators and users to ensure they do not bring down the platform pillars in their drive for protection and profit. If not they stand to lose what made them successful in the first place: the communities upon which they are build.

About the authors

Dominique Carlon and Dennis Leeftink are PhD candidates at the Digital Media Research Centre (DMRC) and the Australian Research Council Centre of Excellence for Automated Decision Making and Society (ADM+S), Queensland University of Technology.

--

--