Medium.com, More Stats Please

How I analyzed our Medium publication stats in a Python notebook

Here in the lab we get excited about data; well, maybe not everybody does, but I do. Ever since we launched our publication here on medium.com I’ve been itching to analyze our story stats. What captures the reader’s attention, what doesn’t?

Our medium.com publication stats in a static report.

If you are a data nerd, static reports are like near-beer—not satisfying at all. It’s time for some amateur home-brewing to allow for custom analysis.

I’ve created a simple Python notebook that derives aggregated data from the post stats and post metadata, to allow for a more satisfying analysis. You can explore the notebook by following the GitHub repository link at the end of this post.

Water, barley, and hops

To provide advanced stats for internal use, I’ve manually created two data sets for our publication.

The first data set includes a publication-to-date snapshot of the views, reads, fans (formerly known as recommends and currently also known as “claps”), and read ratio, as reported on the post stats page:

Sample stats for posts published in May 2017.
Without access to more fine-grained stats (e.g., by week or month) the analysis results are somewhat inaccurate because calculations are limited to averages and trend analysis is not possible.

The second data set comprises of post metadata, such as author, tags, and post URL:

Metadata describes each post.

Each row in both data sets is uniquely identified by the post title, and we’ve therefore got ourselves a join condition! By combining these two data sets, I can now analyze post popularity for each author and domain (as defined by tag associations) and can compare interest in “related” stories.

In the notebook, both (CSV) data sets are downloaded using the requests library, imported as pandas DataFrames, and merged using a left-outer join on title.

Merged stats and metadata.

A scatter plot of the raw data provides a rough idea of how well-received posts are as of today, ignoring for now that a story that was published six months ago had more time to accumulate views/reads/fans than a story that was published last week.

Reads vs. views and fans vs. reads.

Having the raw input covered, it’s now time for some light-weight data engineering to derive additional measures that we can indulge in.

Brewing

Using the raw data, I can now calculate the publication-to-date base statistics for each post.

  • First, calculate for how many months a story was online. The result is used to normalize (although in a simplistic way) the view, read, and fan averages for each post.
  • Next, calculate average views, reads, and fans per month for each story. Without easy access to monthly stats (instead of to-date totals), these rudimentary measures are used as one input to estimate the overall “value” of a story.
  • Lastly, calculate the reads/views and fans/reads ratios. These ratios can be used to determine how successful a story is in capturing a visitor’s attention and the story’s perceived usefulness. The ratio is a numeric value between 0 and 1, with 1 being the best.

The result looks as follows:

Final stats analysis DataFrame.

First tasting

Taking a sneak peek at the new data, we can easily find out which posts readers liked the most (in relative terms):

This nicely complements the reads/views ratio. Because all ratios are “normalized” it’s more of an apples-to-apples comparison.

To make this even more exciting, one could (I didn’t, but might do it in the future should this data be exposed by Medium via an API) also incorporate the referrer’s information.

There are other questions we can ask. For instance, in order to learn if a post with a click-bait-ish title generates more views than another post with the same/similar tag(s), we also need to generate tag-based measures.

Second tasting

Tags help visitors find posts that might be of interest to them. Each post can be tagged with up to five values (e.g., Serverless, Alexa, Amazon Echo). Therefore I aggregated and recalculated the post-based stats for each tag, which yielded one more data set:

Aggregated tag statistics.

And there you have it. Author, post, and tag stats waiting to be analyzed.

Curious what the numbers look like for your publication? Follow the instructions in this GitHub repository to find out.

Cheers!