Analyzing Medium’s posts and building a simple prediction service for “Popular on Medium”

Published in

Polar Tropics

5 min readDec 27, 2017

These days Medium is a rage everywhere. Blogging has taken the form of stories with more focus on a personal touch. Medium has become the platform to express your views and share them with the worldwide community.

I decided to do a quick analysis on the articles and see what separates the “Popular on Medium” posts from the other posts. I also wanted to build a predictor which could predict whether a post would be featured in the popular section and try and identify which features are the most useful while writing a post. Another motive I had was learning more about Apache Spark and Machine Learning models so I did all the analysis and visualization in Spark and Zeppelin.

Exploratory Data Analysis of “General Posts” and “Popular on Medium”

The dataset consisted of posts from January to November 2017 and numbered around 11940 posts out of which 1069 posts were the popular ones.

I preprocessed the text of the posts by removing punctuation, special characters, lemmatizing and stemming the words and finally removing the stop words. Then I passed the content of the posts through a Latent Dirichlet Allocation model to cluster the posts into 12 topics**. The purpose was to see what kind of articles dominate the Medium space. The words are stemmed to their root words so a little creativity maybe required to associate them with meaningful terms.

** The number of topics was arrived at by a little hit and trial.

Words that constitute popular topics for “Popular Medium Posts”

Visualizing differences in quantifiable features of posts between Popular and Not Popular Posts

I collected information about claps, users who have clapped, response given for each post. I then normalized the above quantities with the time elapsed since the publishing of the posts and the time of data collection to make it a fair comparison in terms of Per Unit Time.

Word Count vs Claps per unit time & Image Count vs Claps per unit time

In the first graph, we can see numerous and dense peaks between the range of 1500–3500 words. Articles with word counts in this range get the most claps.

In the second graph, we see that articles having 8–14 images have the most claps per unit time. Clearly writers should include more images in their articles.

All the below visualization are grouped by an isPopular attribute.

0 — Not Popular
1 — Popular

Comparison of Average Users who clapped per unit time, Average Claps per unit time and Average Responses per unit time.

Predicting which articles will become popular

For this I compared two algorithms: Naive Bayes and Random Forest.

Features used are all the things we can obtain as soon as the post is written:

Reading time
Image count
Word count
Title word count
Words as TF-IDF feature

After cleaning up the text, I used TF-IDF and combined all the features using a VectorAssembler. Then I passed the features to the Naive Bayes and Random Forest algorithms without any change in default values. The label to be predicted was whether the post is popular or not in the form of 0 and 1.

Naive Bayes gave a F-score of 88.4% and an Accuracy of 90.05%

Random Forest gave a F-Score of 87.6% and an Accuracy of 91.6%

Looks like both classifiers perform well in predicting the popular article. One thing I found a little surprising was that the Random Forest classifier did a little better when I removed the textual features.

Random Forest without textual features gave a F-Score of 89.06% and an Accuracy of 91.56%

This maybe because textual features add more complexity to the model but will probably generalize better to more data. Further analysis is required here. I have not fine tuned the model just now because I just wanted to see the effect and viability of the classifier. I’ll probably use a CrossValidator to improve on the accuracy in the future.

The ultimate aim of the prediction service is to convey to people what their articles are missing and what they can do to get the articles into the Popular section.

Whewww, that was a long post. But the analysis and the Machine learning part was super fun and educational to say the least. I learned a lot about Spark and Zeppelin and read about the classification and clustering algorithms it implements. Visualizations were done in Zeppelin and even though they are limited in scope but I found them enough for my usecase. Finally, I’m a newbie in the field of data and am still learning the ropes. Please, give your thoughts and suggestions below so I can learn and correct my mistakes.

You can checkout the source code at: https://github.com/masterlittle/SparkmediumAnalysis

Adios!!!

Analyzing Medium’s posts and building a simple prediction service for “Popular on Medium”

Exploratory Data Analysis of “General Posts” and “Popular on Medium”

Visualizing differences in quantifiable features of posts between Popular and Not Popular Posts

Predicting which articles will become popular

Written by Shitij Goyal