Deconstructing Bestselling Amazon Titles using Natural Language Processing and Bi-Grams

Published in

Ascent Publication

4 min readAug 21, 2018

A bi-gram is a set of two adjacent words pulled from a sentence. Bi-grams are typically collected around a topic of study, such as learning the style of a famous author or to model a technical dialect. I will focus on top-selling Amazon titles to create a library of winning bi-grams. Once collected, they can help us understand how those sentences are put together and what are the word combinations that sell books. We can even use them to stencil our own titles over those of the pros and measure how ours stack up.

As you can imagine, the longer the word sequence, the harder or rarer the matches to other writings, but by keeping it short, we can find plenty of matches. We can even find matches in titles that are radically different.

Scraping of Data Through Crowdsourcing

I enlisted the help of a few courageous souls to collect a few hundred top-selling titles. I then wrote a Python script to clean up the titles to make them more generalizable. Forcing the text to lower case and removing all punctuations can be seen as a compromise but it does pay off in helping find more matches to work with. In essence, we’re reverse engineering titles in to word bits that people respond to or don’t respond to.

Top Bi-Grams Disproportionately Seen in Bestselling Titles and those Disproportionately Seen in Sleepers

As mentioned in my previous posts on building titles, this analysis by no means ensures that a title will be successful. There are many variables a play that will make a title successful regardless of the quality, starting by whom authored it.

Let’s take a look at some of the top bi-grams disproportionally seen in winning titles. The tables show the bi-gram along with how many times it has been counted in a high-ranking Amazon book versus a low-ranking one (you can see that the Top_Count column is a lot larger than the Bottom_Count):

And Bottom bi-grams disproportionally seen in poorly selling titles and not in winning ones (you can see that the Bottom_Count column is a lot larger than the Top_Count):

Observations

We need to dig deeper to go beyond obvious popular pairs like ‘for+dummies’ or ‘national+geographic’ to find interesting nuggets. We can start with two obvious ones, ‘wall+calendars’ or ‘for+windows’, not big sellers apparently.

More interestingly, the word ‘short’ appears twice in the top list and never in the bottom one— we all like our information given in concise chunks. The word ‘of’ is used in over a quarter of the sleepers and only once in the winning list. The word ‘in’ and ‘by’ are also only seen on the losing side.

Great Combinations

delicious+recipes
definitive+guide
critical+thinking
your+brain
transform+your

And the Not so Great

wall+calendar
for+windows
description+of
report+of
containing+the

Conclusion

There is so much space for additional exploration, like tri-grams, removing stop words, stemming, etc. I will pull more of these patterns in the near future and report back if I find interesting differences. Also, creating your own bi-grams is trivial and can be done in any programming language. If you are a Python user, you can do it through the ‘bigrams()’ function from the Natural Language Toolkit library.

I am a bit perplexed finding ‘bibliography+of’ or ‘annotated+bibliography’ amongst the bottom ranks. If you have any idea why, please let me know.

And if you want to see how your own blog titles and content stack up against the pros, test it out on the experimental Multi-Point Writing Analyzer on ViralML.com.

Please share and clap if you found this helpful — thanks for reading!

Manuel Amunategui

Get it and plenty more at amunategui.github.io and at ViralML.com.

OK — Sign up to my email group below and I’ll send you my free eBook on tips to becoming a (better) data scientist (and signup even if you aren’t interested in the eBook). Thanks for reading!!