Decoding BuzzFeed Headlines Using Data Science
Love it or hate it (personally, fascinated by it), Buzzfeed has its characteristic way of capturing our fleeting attention, and a big part of their mojo lies in how they create their titles. I take an insight-driven approach to analyze BuzzFeed’s most common headlines using a variety of Machine Learning and NLP practices.
1. Data collection and cleaning
Using newsapi.org’s BuzzFeed API, I was able to query up to 50 top headlines from the website at a time. I scraped 50 headlines twice a day for a week, which gave me approximately 700 headlines to work with.
Once that was done, I cleaned the text data by removing non-English sentences and unhelpful Unicode characters.
2. What does the data tell us
The common consensus on the internet is that the ideal length of a headline is somewhere in between 50–70 characters. The average BuzzFeed title length falls somewhere along this ballpark. As this is a quest to find BuzzFeed’s most recurring title structure, I narrow my search to titles with 11 words and proceed to do Part-of-speech tagging to uncover the most popular sequence of words.
We can see BuzzFeed loves starting its headlines with numbers. A headline starting with a number almost always leads to a list-based article such as “30 Things Under $10 That’ll Never Stop Being Useful”. Now I’m not an expert on the psychology of lists and why they appeal to us, but for anyone interested, there are several sources out there addressing this topic.
Onto, the next part of speech: the nouns. By quite the margin, these headlines follow their numbers with nouns (singular or plural).
Nothing too surprising here. These are nouns that all of us can relate to on some sensory level. Clearly for mass appeal.
What really captured my attention was the fourth word used in the sequence — “will”.
An example: “14 Things You’ll Understand If You Have A Gym Membership”.
Off the bat, we can see how the word “will” creates an anticipatory effect. There is an invisible but apparent cost for not clicking on the headline — the cost of not finding out.
Now that we have the structure in place, we can look at other characteristics of the headlines, such as sentiment: are these common headlines generally positive, negative or neither?
The bimodal distribution does indicate there are a few positively skewed headlines, but overall, they are neutral. There are limitations to this assessment, however. One, BuzzFeed titles tend to use emojis and relevant punctuation marks (like ‘!!!’) which I naively cleaned out in the data collection process. I also did use a basic sentiment analysis model, which may have missed out on slang, unusual comparisons and other signals of sentiment. Despite these concerns, when I looked at the sentences the model classified as ‘neutral’ (shown below), it didn’t seem like there was a lot of misclassification. I myself couldn’t manually classify these sentences as being either positive or negative: they are not opinionated by nature, which ran contrary to what I was initially expecting. And this goes back to my point earlier on how BuzzFeed uses anticipation as a tool to pique the interest of readers. These neutral headlines (which are in the majority) lead to articles that aim to convey some information you “need” to have.
3. Creating a simple BuzzFeed title generator
Before I continue with this section, let’s summarize what we’ve learned so far about BuzzFeed’s titles:
- Headlines are succinct, at around 7–14 words each.
- The most frequently used headline structure is [CD,NNS,WDT,VBP…] i.e. a number followed by a noun (mostly plural) followed by a Wh-word and a verb (mostly “will”).
- The headline reveals a lot of information about the content of the article. They aren’t necessarily “clickbait”: quite the contrary in my opinion. They tell you exactly what you’ll be getting — list based articles that reveal some piece of information you want/need to know.
- They are mostly neutral in sentiment. I don’t believe “anticipation” falls under the blanket of “positive” or “negative”. The goal of these headlines is to trigger your curiosity as efficiently as possible. In most cases they don’t need to invoke any polar sentiment to achieve this.
- Note: I can’t stress this enough. The above analysis only pertains to the specific structure of headline this article has specified. BuzzFeed utilizes several other headline structures that are interesting and worth studying.
Alright, let’s create a simple algorithm that tries to replicate this structure:
Output looks good to me. Some of these titles look nonsensical (although, hilarious) but they work for our purposes.
To take it a step further, I wanted to see if a classification model would be able to pick up on this structure. For this I needed a dataset with headlines from a different source.
4. Testing
I decided to query around 500 headlines from ABC news. I combined the dataset with the BuzzFeed dataset to create a train and target set with the vectorized versions of the headlines with ‘1’ in the target variable indicating that it is a BuzzFeed title. I then fit a simple Support Vector Classifier model to the training set.
The model achieved an accuracy of 83.84 on the test set with many false positives (low precision, high recall) i.e. there were 13 cases where the model thought an ABC title was a BuzzFeed title. Using a thousand generated titles from our title generator, the model predicted all the titles to be that of BuzzFeed. So there is something about the structure of our generated titles that are indeed BuzzFeed-esque. Note, the random nouns and verbs I used to generate titles were not taken from the BuzzFeed corpus (I took them from the WordNet dictionary).
Conclusion and Extras
My intention in writing this was not to best replicate BuzzFeed headlines, but rather to analyze its structure and gain insights as to why they work. So I picked their most popular headline format and tried to make inferences given the data I was provided. You can access all the code and data I used here.
I used a pre-trained Recurrent Neural Network (RNN) to generate text based on the BuzzFeed headlines given as input. Below is a sample of the output I got (after 4 epochs):
Now they don’t look exactly like the titles we’ve dealt with, but in some uncanny way they do exhibit the properties we’ve discussed in detail. They start with a number followed by a plural noun and they use the future tense to further pique your attention. Fascinating stuff.