In my previous post, I had written about how to scrape search results for a particular query string from Medium. In this post, we will go into details of analyzing the data scrapped for search term “Data Science” to group posts based on Number of claps and Responses into different levels of popularity and also understand what makes these posts popular.
The data scrapped from Medium search results was JSON file with extensive data about each search result. To explore the structure of JSON file, I used Notepad++ with JSON plugin. The json file had data about the posts, author of the post and publisher associated with that post (if any).
The code to extract data from JSON file can be found here. In addition to extracting data from the JSON file, I also added field with the date when the post was scrapped.
Exploratory Analysis of Posts Related to “Data Science”
On scraping results for search term “Data Science”, 831 posts were scrapped, out of which 31 were responses to an post and were excluded from the analysis. The data scrapped was from March 2013 to April 2018.
All the date fields like Created Date, First Published Date, Last Updated Date where in milliseconds elapsed since Jan 1970. They were converted into human readable date format using the function below
# Function to Convert EPOCH Date to Human Readable formatdef convertToDateString(date):
return (datetime(1970, 1, 1) + timedelta(milliseconds=date)).strftime("%Y-%m-%d %H:%M:%S")
The next step was to look at what words were most commonly occurring in the titles of these posts. As you can see from the word cloud below, Data Science, Big Data, AI, Analytics, Machine Learning, python, self driven (about self driving cars) are some of the most frequently occurring words.
The distribution of Number of Claps, Number of Responses are highly skewed. 708 posts have less than 500 claps. This shows that there are few posts which become popular.
The Reading Time (mins) of most articles is between 1 to 3 min
On Medium, each post can have a maximum of 5 tags. Tags help readers find content more easily.The more relevant tags, the more easier to find. As we can see in the image, Data Science is the most frequently used tag, followed by Machine Learning, Big data, Artificial Intelligence.
Creating Clusters Based on User Responses
There are three metrics to measure how popular a post is on Medium viz. #Claps, #Responses and #Recommends. To make a fair comparison, I also included feature #Days between First Published and data collection date.On this feature set, I applied k-means clustering and identified three clusters. As we can see from the image below, there is a huge difference between the three metrics across clusters (Popularity Groups). Also, we can see that for the less popular posts though their median days between publishing and scrapping is the highest their engagement is very low.
Understanding What Makes a Data Science Post Popular
As we can see from the image below, for more popular articles the median for high and medium popularity articles are 9 and 7. They also have more links compared to less popular articles. This means that Popular posts refer to other posts and other sources of information adding more value to the content.
From the image above, we can also see that the post with medium popularity are more closer to highly popular group than to the less popular group.
With a simple k-means, we were able to identify popular and non-popular posts on Medium related to Data Science. The code for this analysis can be found here.
Thanks for reading.