
By Default
Clustering Medium Stories The Hard Way:
Read an excellent post recently, on how tags are suggested for every story you write on medium; keywords are selected from your post, and tags are proposed that are found in other stories with similar keywords.
Cold Start Problem
Similar to Netflix’s recommendation engine, you need a lot of people using the product, providing tags for their stories (or ratings for movies), to make accurate recommendations. When you don’t have the users, or the diversity in content just yet, you can’t recommend tags; unless you really want to…
Choosing the red pill is not for everyone though, as it is the moment when scalpels turn into hacksaws, and chai tea turns into battery acid.
Goin’ Old School:
Collecting Stories
Medium doesn’t have a reading api, so you can’t make a request to pull a thousand stories and their meta data at once. You can however, scrape the text, the author name, and the tags directly from the html source yourself.
I selected all stories that the Medium Staff had recommended. I knew there would be a lot of posts and it would be a good way of seeing the type of stories most enjoyed by the creators of the platform.
Infinite scrolling was stalling my progress however, as I could only see 25 stories at first, 10 more would be loaded each time I scrolled to the bottom. Since the URL remained the same and it didn’t change to the URL with a ‘page=2’ parameter or something, gaining access to a reasonable set of stories wasn’t as straight forward as I had hoped.
lapply(1:10000, function(x){ webElem$sendKeysToElement(list(key=”page_down”))})
Undeterred, and in keeping with the ‘hard way’ motif, I loaded up the RSelenium Package, which is a driver for a web browser, navigated to https://medium.com/@MediumStaff/has-recommended, and ran a for loop that would send the key ‘page down’ to the browser 10,000 times. 15 minutes later, Presto! I had a web page with 1300 or so stories that the Medium Staff had recommended, all that was left was to parse out the URL for each story, click on every link, and scrape.
Chump stuff.
Visualizing The Data

Each story usually has 3 tags, the first one being more of a broad topic, and then incrementally more specific for subsequent tags.
The top 20 tags are all broad (Music, Humor, Tech, Education, Politics etc.) and are present in about half of the stories analyzed.
It is clear to see tags that are a part of the larger topic, for example, Technology is a part of Tech, Hip Hop is a part of Music, Parenting is a part of Family etc.
However there are topics, that may not be part of a larger topic, but instead, additions to other topics; Writing, and Art for example, or Culture, Women, LGBTQ, and Media maybe. Modelling these topics together can only be done via context, ie. from the writing itself, this process in NLP is referred to as Topic Modelling.
LDA or Latent Dirichlet Allocation, is an algorithm that models topics. In a simple way, given a parameter of how many topics you are looking for, the algorithm splits each document into a distribution of topic probabilities returning the highest one. LDA is what I’m going to use, to model the 1289 stories.
Side note

Nice to know, the Medium Staff doesn’t have much of a bias or a skew toward a single author, everyone seems to get a fair shake.
Topic Modelling

Above is each topic, its share of all stories, with its associated key words. Although the words are stemmed to their root, it is not difficult to label some topics using them. (Topic 4 = books, Topic 8 = Tech, Topic 9 = Business, etc.)
The X axis represents the proportion of all the stories, previously the Music tag was ~6%, here however it is split between Topic 15 (~2%), and Topic 16 (~4%).
The most common topic seems to be just a bag of words commonly used by writers, but as you go down, the same themes emerge, Tech, Business, Life, Education etc.
I chose 20 topics, as I was trying to model the 20 broad topics identified from the original tags distribution.

For testing, I pulled the top tags most associated with the Topic. My model seems to have done a pretty good job, as similar tags are grouped together.
Some benefits of topic modelling are already becoming clear:
Observe the difference between Topic 15, and 16 ostensibly both are about Music yet one is about Hip Hop, while the other is more of a mixture of Art and Music (“kurt” (kurt cobain), “photography”, “film” etc.). Same story with Topics 7 and 8 both are about Tech though one is about Products, and the other is about UX.
There are some that don’t make too much sense on the outset, Topic 19: Guns, War, and Kurdistan seem to go together, but not really with Same Sex Marriage, or Star Wars.

Instead of tags, this time, I pulled authors most associated with a particular topic.
Jeb Bush, and San Francisco Magazine are seen in the Education Reform Topic, Talib Kweli talks about Hip Hop, Baltimore Sun talks about the Balitmore Riots, Dr. Ian O’Neil talks about space and aliens, and Leslie Lou talks about feminism, so it all seems to make sense.
Topic 19 is becoming a little more clear as the star wars reference is used as a commentary for middle eastern conflict, along with more straight forward stories about war, flags, and patriotism*.
*Story has no tags, would have been impossible to classify without using NLP techniques.
Now We Know The Model Works, Lets Look for More than 20 Clusters

The more clusters selected, the more specific the topics become, they are basically interpolations between the 20 topics before, creating 40 more specific topics.
Due to the increase in specificity the average size of the topics decreased considerably, as they are now subsets of previous broader topics.
Music which was Topic 15 and 16 previously is now split into Topic 36: the cultural side of hip hop and music, Topic 22: EDM, Pop, and Hip Hop, and Topic 8: Rock and Roll.
Topic 8 which was about Tech products is now split into 3, Topic 20: Apps, Topic 23: Startups, and Topic 16: Tech Business.
Topic 14 which was Immigration and Police Brutality, is now Topic 38: Privacy and Surveillance, Topic 3: Human Rights, Topic 27: Refugees, and Topic 29: Racism.
With interpolation, there is quite a lot of crossover, (common themes between topics), we can visualize this crossover using a network map of each topic.

The topics with no connections, are the most independent, as they have no crossover with other topics.
Topic 1: Food, Topic 5: Military and Warfare, Topic 11: Advertisement and Media.
You can clearly see how connected the common word Topics (7, 31, 39) are, as well as the the triangle of Tech Topics (20, 23, 16), and the line of Music Topics (36, 22, 8).
Interesting to note how much closer Topic 38 (Privacy), Topic 18 (Election 2016) is to Topic 3 (Human Rights), Topic 9 (Climate) and Topic 35 (LGBTQ), than to Topic 29 (Racism).

This diagram is the exact same as the one above, just with the most frequent tag as the node name. One tag is not usually representative of the complete topic, so I left the topic number in as well, but it should give an idea of what each node might contain.
Conclusion
- We scraped 1289 Medium Stories
- Found the most frequent tags and thus the most frequent topics
- Performed Topic Modelling to create similar topics as the tags, without the need for historical data
- Created more specific topics using more nodes
- Found correlations between topics.