Analyzing Content on Reddit

Using Reddit’s Search API, Python, PRAW, Pandas & R

UPDATE DECEMBER 2016: the PRAW package has been updated & functions / syntax have changed, find the latest docs here, or install the previous version via something like`pip uninstall praw && pip install praw==3.5.0` . Relevant updated code snippets below have been added.

Media companies are aflutter over Facebook’s increasingly dominant (& somewhat capricious) gatekeeping of digital audiences.

(A fact reiterated early & often during this weekend’s excellent Barcamp News Innovationunconference’ presented by & hosted by Temple University’s School of Media & Communication)

The concerns are well-founded. Facebook wields, by multiples, the largest active user-base known to the social web.

via &

But Reddit, middle of the pack above, is not an apples-to-apples comparison with Facebook, Twitter or Instagram. As a platform devoted explicitly to sharing, rating & discussing news links & other information, it’s in a class of its own — with obvious relevance to digital publishers.

Given Reddit’s unique platform among social media’s heavyweights, we might predict unique user engagement patterns across our content. This post conveys an initial foray into that analysis, with code examples & some surprising findings.

Accessing Reddit Data

To pull an initial dataset for evaluation, I accessed Reddit’s search API via the convenient PRAW (Python Reddit API Wrapper) package. PRAW’s thorough documentation covers the basics of creating & authorizing a crawler bot, so I won’t waste time & space here. The result will resemble this:


Now to create a Reddit API call & store results:

Some notes: Reddit rate-limits search API calls to 1000 results at a time, so we stick with that for this initial exploratory analysis. The lucene search syntax with site:___was the optimal query string in my experience, but does return false-positives such as etc, which I filter out with the exact string in the final for-loop above.

Ok, we now have a dataset of the most recent Reddit posts with at least 25 comments and/or 50 upvotes (as of 10/1/16, when I did this analysis). So what does it look like?

Visualizing results

I used Pandas to write the relevant data to a dataframe, then exported it as a CSV file to read into R for easy analysis & plotting.

Not the prettiest, most concise or intelligent way to do this, I’m sure. Whatev.

This can all be done in Python/Pandas, of course, but I still find R more fluent & visually pleasing for quick exploratory analysis. (& Yes, I’m using base R for plotting… shoutout to #TeamBaseR — Jeff & Nathan. But truly I’m language & library agnostic, & this extensive post on plotting in Python has me interested to try some new [to me] non-Matplotlib/Pandas packages.)

The results (viewed in a google sheet for prettiness) looked like this:

But that doesn’t tell us much about the shape or distribution of the data, so lets visualize.

In R:

Gives us:

giffing red rectangle not included

Here we can see that, while we have a decent number of outliers, most posts have between 0–1000 comments & upvotes, so lets zoom in on those. (We’ll inspect some outliers in a moment)

Now with a Y-axis range of 0–1000, we see a similar trend emerge, with some sparse outliers, but most of the action between 0–200 comments / upvotes.

By now the more quantish among us will recognize a familiar recursive pattern in the distribution of our data. Let’s take another perspective to make it more plain.

Viewed from this angle, it’s clear that our data mostly accrues into the first few segments of our histograms, quickly dropping off in the familiar exponential decay & long tails of power law distributions. The sparsely distributed outliers speckling the upper reaches of our initial scatterplots are now stretched thin along the rightmost length of our X-axis. We are indeed living in a (roughly) lognormal world.

Anyway, why does this matter, & what types of content do we find at each end of our distribution?

Picking on Data Points

One last plot to give us a better depiction of where each individual Reddit post with a link falls on a single plane of comments & upvotes.

Each point in the graph represents a post of a article on Reddit, plotted by # of Comments x # of Upvotes. There’s a reasonably intuitive positive correlation between upvotes & comments, with a simple regression line predicting about 1 comment per 6 upvotes.

But more to the point, What ARE These Outliers?

I went and looked: I was a bit surprised.

Top performing content recently posted to Reddit (from top-right to left):

  1. 3 Soviet Workers Dived Into Chernobyl Pool May 16, 1986 (/r/TodayILearned)
  2. Cops: Immigrant hit robber with car in bank parking lot Sept 20, 2016 (/r/Offbeat)
  3. Newspaper gets heat for headline July 12, 2002 (TIL When a mental health facility in New Jersey caught fire, a newspaper ran the headline “Roasted Nuts”)
  4. Poll: Convention boosts Clinton to 11-point lead over Trump in Pa., Aug 5, 2016 (/r/Politics)

So half of the most outperforming content on Reddit (in my admittedly limited dataset) are archival stories from 1–3 decades ago. The other half consist of a ~wEiRd nEwS~ story (amplified by the immigrant angle) & a hot-button political headline. All in all, these are exceptional or unusual stories, posted to large general-interest Subreddits. And they alone drove tens of thousands of views to our site.

This points to one major advantage we have as a Heritage media company (as our nonprofit owner’s new Executive Director Jim Friedlich likes to put it). Our rich (& not-so-rich) reporting archives give us a legacy of content with demonstrable continued appeal for contemporary audiences.

Still, outliers by definition are infrequent, freakish occurrences — hard to predict or replicate. So what’s going on with our more normally-performing content?

To answer, we have to zoom in again.

And here’s our zoomed-in view (notice the missing chunk < 25 comments x < 50 upvotes, per our criteria). So what sort of stories do we find here?

Here we find greater representation of our bread & butter core competencies: local public-interest journalism, breaking news & sports coverage. We also notice the Subreddits tend more toward regionally-specific, narrower interests. We finally notice the sheer number of posts in this range of engagement metrics.

True to long-tail dynamics, these posts might themselves each only drive hundreds or low-thousands of visitors to our stories. But in aggregate, they account for more overall exposure than a few (even very large) outperformers — & do so much more regularly & reliably.

To Be Continued…

So how does this measure up against engagement patterns on Facebook & Twitter?

Another post for another time. But suffice to say that we’re probably not seeing thousands of comments on stories published before I could read.

(And no one called Reddit evil at #BCNI16)

Follow on twitter: @dnlmc