How to get more likes on your blogs (2/2)

Neerja Doshi
Towards Data Science
6 min readFeb 12, 2018

--

Estimating the claps you get, the data science way

Wondered how you can get your story to trend more?? Is it the title, the images, the quotes or the content that gets you more claps? In this series by Alvira Swalin and me, we have tried to explore the relationship between the features of a blog and the number of claps it gets. Part 1, talks about the feature extraction and preliminary Exploratory Data Analysis (EDA) while in Part 2, we build models to predict the claps a blog can get.

The Data

The EDA in Part 1 was based on 600 Data Science blogs, but for further analysis I have used ~4000 Medium blogs. For uniformity of content and audience, we’ve scraped these blogs from Data Science, Artificial Intelligence, Technology and Programming categories. Our features include the length of the blog, images/word, number of tags, sentiment score of the title, duration since the blog was published and the number of followers. From preliminary EDA, we can see a positive correlation of claps with the reading time and the number of tags. More tags seem to fetch more claps whereas the sentiment of the title does not seem to have much effect on the claps.

Methodology

To investigate this further, we first tried treating this problem as regression first and then classification. As expected, regression did not do a very good job as the range for prediction and variation was huge. Thus, we settled giving this data the classification treatment, and will be further discussing that approach.

To see whether our preliminary analysis holds true for a broader range of blogs —

  • 3-label classification to see whether a blog gets low, medium or high claps
  • 20-label classification to get more granular predictions

Feature Engineering

In all, I have included 24 features which comprise features directly extracted as well as those related to the content, author and date.

Pre-processing based on the label

The claps range from 1–62,000 with a standard deviation of 2.8k! That’s a huge range to predict for, so to deal with this variation we clip the data at the 90th percentile.

Before and after clipping the data

3-label classification

For this, I have binned the blogs into Low, Medium and High, based on the number of claps.

  • Low: < 150 claps → corresponds to the 45th %ile
  • Medium: 150–750 claps → corresponds to the 85th %ile
  • High: > 750 claps

The classifiers I have tried in both the methods are Logistic Regression and Random Forest, because both are interpretable. Random Forest outperformed the latter as it could capture the non-linearity and interactions between features that Logistic Regression could not.

After some parameter tuning that I won’t be going into here, we get the following —

Our model is able to predict class 0 and 1 relatively more accurately that class 2. This is due to the imbalance in the distribution of observations across classes in the training data.

20-label classification

For converting labels into classes, here I have binned claps based on their distribution. This ensures that the observations are evenly distributed into all the classes (unlike the 3-label approach) and class imbalance is dealt with. Below are a few ranges corresponding to the bins.

As in the previous case, we built logistic regression and random forest models on the training data of ~3300 blogs using 5-fold cross validation for tuning the parameters.

Results and Interpretation

We have used log loss as the metric here. This model gave a log loss of 2.7 in predicting the bin of the claps. Since our classes are sequential we can compute the Mean Absolute Error as well. For our model we get MAE = 5.16.

Log Loss for multi class classification —

The predicted bin gives us the range of the claps that blog can get. Let’s take a look at some of the actual and predicted ranges:

While some predictions (the ones in red) are completely off the mark, there are some (in orange) that are actually very close but still misclassified.

So what are the features which decide if your blog will trend or not?
On computing the feature importances, we see that the number of followers and the quality of the content are a sure shot indication of more number of claps, other than (of course) the days elapsed since the blog was published. The proportion of images to the length of the blog is another important factor.

Surprisingly, the number of words of the title and reading time do not seem to be a very crucial factor which is contradictory to what we have observed in the initial EDA.

Going further, let’s take a couple of blogs and try to interpret whether the impact of these factors is positive or negative (and also how close our predictions are).

As an example, we take two sample blogs with ~760 and 1800 claps respectively. You can find the first and second blogs here. Let’s first look at their features and predictions:

The prediction for the first blog is 17 i.e. 754 –1000 claps (bang on!) and 14 for the second, i.e. 380–490 claps. This is completely off the actual range of 1400–2000 claps.

To see why our prediction went so wrong for the second blog, let’s look at that blog again. For this blog, content is king which points towards a shortcoming of our model. Our model could not capture how engaging a blog is in terms of its content and writing style. For this, more analysis needs to be done on the article itself. At the moment, we can only judge blogs based on features like number of followers, images, sentiment, length etc.

End Notes

Other work that can be done on blogs could be —

  • Analyse the purpose of the blog — whether it aims to educate, explore a problem, provide a solution to a problem, etc
  • Compute similarity between 2 blogs to avoid repetition/plagiarism
  • Determine the popularity of a topic based on how many claps it can get
  • As mentioned earlier, more NLP for determining how popular a blog gets For example, we could check if a blog contains current buzzwords (eg. cryptocurrency,) if it provides solutions to a relevant topic etc.

I hope you enjoyed reading this! Any ideas, suggestions or comments are most welcome!

LinkedIn — https://www.linkedin.com/in/neerja-doshi/

References

  1. Github repo with Python implementations
  2. This really cool blog on writing good blogs by Quincy Larson!

--

--