Background via all-free-download | Screenshot from Data Science Roundup

Analyzing the Sources of Tristan Handy’s Data Science Roundup

What can we learn from a meta-analysis of years of content?

Paul Singman

Published in

Whispering Data

4 min readAug 24, 2020

For those who are only interested in the full results, click here.

One of my favorite data-related newsletters is Tristan Handy’s Data Science Roundup. I’ve been a loyal subscriber since early 2016, receiving a weekly or sometimes bi-weekly, summary of “the internet’s most useful data science articles.”

While convenient to receive a concentrated collection of articles I can peruse from bed on Sunday mornings, after so many years a part of me started to wonder — how is he finding these articles?

Is he using the same 10–15 sites over and over? What does the distribution of sources look like?

Well, I’ve compiled the data and have the answers. There are 186 total issues of the Data Science Roundup, each with blurbs on approximately 5 articles, giving us a population of around 1000 articles to analyze.

Although tedious, aggregating the data manually would take more or less the same amount of time as an automated solution, with more accurate data to boot. Regardless, I decided to take an automated approach to show off some python-hacking skills in the process.*

*As a side note, this would make for a great Mechanical Turk task.

Screenshot of my inbox filtered on the DSR

Collecting the Data

I’ve yet to figure out the dark art that is keeping your inbox tidy, so instead of scraping from the web, I was able to use my disgusting email inbox as the data source.

From there the steps were:

Connect to my inbox via imaplib and a gmail app password
Filter messages to issues of the Data Science Roundup by subject and sender
Grab the html contents of each email and parse link tags with BeautifulSoup
Save the parsed link sources and analyze!

This code should not be emulated, but for anyone curious what it looks like to grab data from gmail, here’s what my hacky code looked like:

Results

For a full list of the results, click here.

After a few seconds when all was said and done, I was left with 900 links from 129 Data Science Roundup issues since Oct 1, 2017*.

*The newsletter has had a consistent format since this date.

Below is a table with all sites that were referenced more than 5 times.

+----------------------------+----------+
|            Site            | No. Refs |
+----------------------------+----------+
| towardsdatascience.com     |      107 |
| medium.com                 |       86 |
| eng.uber.com               |       19 |
| flowingdata.com            |       14 |
| locallyoptimistic.com      |       12 |
| kdnuggets.com              |        9 |
| arxiv.org                  |        9 |
| ai.googleblog.com          |        9 |
| oreilly.com                |        8 |
| blog.openai.com            |        7 |
| technologyreview.com       |        7 |
| eng.lyft.com               |        7 |
| blog.getdbt.com            |        7 |
| blog.fishtownanalytics.com |        6 |
| labs.spotify.com           |        6 |
+----------------------------+----------+

By far and away the most frequent source is the site you most likely are reading this analysis on — medium.com. The Towards Data Science publication specifically (which yours truly was recently featured in with two extremely popular articles) leads the pack with 107 links. And Medium more generally comes in close second with 86.

That accounts for approximately 25% of all links featured in the Data Science Roundup, and is an argument to suck it up and drop the $50 for an annual Medium membership.

After that comes popular blogs most people have heard of, including the engineering blogs of prominent companies like Google, Uber, Lyft, and Spotify. (Netflix hosts its tech blog on Medium and would be included in the medium.com category).

The Long Tail

The point of this article is not to knock Tristan for repeatedly using the same sources in his newsletter; rather the opposite. There is a long tail of 150 or so sources that have been included in issues of DSR, and demonstrate the effort he puts in to remain integrated with the sprawling data community online.

Since we’re using data that goes back a number of years and personal blogs are a fickle domain, sadly, a number of the smaller sites are no longer maintained.

Here are three lesser-known sites that are active with quality content:

Eugene Yan

I work at the intersection of consumer data & tech to build machine learning systems to help customers, and write about…

eugeneyan.com

Normcore Tech

A newsletter about making tech less sexy, more boring, and anything adjacent to tech that the mainstream media isn't…

vicki.substack.com

Matthew Rocklin

I build and maintain open source software for Python's data science ecosystem. This is part of a broader effort to…

matthewrocklin.com

Of course, it is impossible to manually keep tabs on hundreds of sites. So, using the full list of sites I’ve compiled to create a centralized Feedly or other RSS reader feed would be a great way to efficiently stay on top of a good chunk of the internet’s data-related content.

If that’s something you’ve been meaning to do, but haven’t gotten around to yet… no more excuses and see the full list here!