Analyzing Medium’s posts and building a simple prediction service for “Popular on Medium”
These days Medium is a rage everywhere. Blogging has taken the form of stories with more focus on a personal touch. Medium has become the platform to express your views and share them with the worldwide community.
I decided to do a quick analysis on the articles and see what separates the “Popular on Medium” posts from the other posts. I also wanted to build a predictor which could predict whether a post would be featured in the popular section and try and identify which features are the most useful while writing a post. Another motive I had was learning more about Apache Spark and Machine Learning models so I did all the analysis and visualization in Spark and Zeppelin.
Exploratory Data Analysis of “General Posts” and “Popular on Medium”
The dataset consisted of posts from January to November 2017 and numbered around 11940 posts out of which 1069 posts were the popular ones.
I preprocessed the text of the posts by removing punctuation, special characters, lemmatizing and stemming the words and finally removing the stop words. Then I passed the content of the posts through a Latent Dirichlet Allocation model to cluster the posts into 12 topics**. The purpose was to see what kind of articles dominate the Medium space. The words are stemmed to their root words so a little creativity maybe required to associate them with meaningful terms.
** The number of topics was arrived at by a little hit and trial.
Words that constitute popular topics for “Popular Medium Posts”
Words that constitute popular topics for “Not Popular Medium Posts”
The main ideas that can be seen here are Blockchain, Technology, Medical research, Relationships, Trump politics (again), Coding, Food , Economy and Cultural Arts like photography and music.
Looks like articles involving Trump, Bitcoin, Relationships, Music and Data have a higher chance of moving into the Popular section.
Visualizing differences in quantifiable features of posts between Popular and Not Popular Posts
I collected information about claps, users who have clapped, response given for each post. I then normalized the above quantities with the time elapsed since the publishing of the posts and the time of data collection to make it a fair comparison in terms of Per Unit Time.
- Word Count vs Claps per unit time & Image Count vs Claps per unit time
In the first graph, we can see numerous and dense peaks between the range of 1500–3500 words. Articles with word counts in this range get the most claps.
In the second graph, we see that articles having 8–14 images have the most claps per unit time. Clearly writers should include more images in their articles.
All the below visualization are grouped by an isPopular attribute.
- 0 — Not Popular
- 1 — Popular
- Comparison of Average Users who clapped per unit time, Average Claps per unit time and Average Responses per unit time.
Popular posts have:
- 10x more Average Users who clapped per unit time
- 10x more Average Claps per unit time
- 7x more Average Respones per unit time
2. Comparison of Average Reading time, Average word count and Average Unique Word Count per word
Popular posts :
- Have Average Reading time of 8 minutes as compared to 5.2 minutes of Non-Popular posts.
- Have Average Word Count of 2075 as compared to 1375 of Non-Popular posts.
- Have Average Unique Word Count per word of 0.333 as compared to 0.313 of Non-Popular posts. This means that for every 100 words Popular posts have 33 unique words as compared to 31 unique words of Non Popular posts. ***
Clearly Popular articles are longer and have a richer vocabulary than normal articles.
*** Average Unique Word Count per word=Unique Word Count/Total Word Count
3. Comparison of Average Image Count and Average Title Words Count
- Have 6 images on average as compared to 4 images of Non-Popular articles.
- Have longer titles with an average of 1 word more than Non-Popular articles titles.
Predicting which articles will become popular
For this I compared two algorithms: Naive Bayes and Random Forest.
Features used are all the things we can obtain as soon as the post is written:
- Reading time
- Image count
- Word count
- Title word count
- Words as TF-IDF feature
After cleaning up the text, I used TF-IDF and combined all the features using a VectorAssembler. Then I passed the features to the Naive Bayes and Random Forest algorithms without any change in default values. The label to be predicted was whether the post is popular or not in the form of 0 and 1.
Naive Bayes gave a F-score of 88.4% and an Accuracy of 90.05%
Random Forest gave a F-Score of 87.6% and an Accuracy of 91.6%
Looks like both classifiers perform well in predicting the popular article. One thing I found a little surprising was that the Random Forest classifier did a little better when I removed the textual features.
Random Forest without textual features gave a F-Score of 89.06% and an Accuracy of 91.56%
This maybe because textual features add more complexity to the model but will probably generalize better to more data. Further analysis is required here. I have not fine tuned the model just now because I just wanted to see the effect and viability of the classifier. I’ll probably use a CrossValidator to improve on the accuracy in the future.
The ultimate aim of the prediction service is to convey to people what their articles are missing and what they can do to get the articles into the Popular section.
Whewww, that was a long post. But the analysis and the Machine learning part was super fun and educational to say the least. I learned a lot about Spark and Zeppelin and read about the classification and clustering algorithms it implements. Visualizations were done in Zeppelin and even though they are limited in scope but I found them enough for my usecase. Finally, I’m a newbie in the field of data and am still learning the ropes. Please, give your thoughts and suggestions below so I can learn and correct my mistakes.
You can checkout the source code at: https://github.com/masterlittle/SparkmediumAnalysis