Keep Reading: The Story of a Nascent Recommendation Engine

7 min readJun 28, 2015

At The Hollywood Reporter (THR), we want you to read all of our great stories. Our editorial team is on the beat, breaking news and writing features about Hollywood and the world at large. If you haven’t seen the blistering Chris Rock article on race, or any of our recent cover stories like this revealing roundtable discussion or Natalie Portman, you’re missing out.

We want you to keep reading, and we want you to keep reading even when you’ve finished the first story. How can we invite you to stay for just one more? We, on the product team, are particularly obsessed with this question.

Right now on THR , we offer a “Next Up” widget at the end of each story. It’s powered with one of the most popular and “clicky” stories of the day. We saw a tidy improvement in bounce rate when we launched the widget last year. Here’s a screenshot from the bottom of our story Twitter CEO Dick Costolo Stepping Down.

Screenshot showing “Next Up” widget. Taken from www.hollywoodreporter.com/news/twitter-ceo-dick-costolo-stepping-801937

Here the widget is offering me a story about AwesomenessTV. While an interesting topic, there’s an obvious gap between a story about Twitter and a hot story from our television section. If you’re one of our regular readers, chances are, you’ve already read the day’s most popular story. If you’re one of our new readers, it looks like unrelated content. Either way, this story is not likely to entice you to click. The recommendation pair here is about as compelling as a pinot noir paired with Cheez-its.

The question becomes, what would be the next best story to put there? My hunch has been that a closely similar story would out perform just any popular story. So earlier this year, to prove that hunch, I set out to build a prototype recommendation engine for THR that could beat our current “most popular and clicky” model.

Building the Prototype

We know it can be done well. Hulu witnessed a 3x growth in clicks when they ran their own recommendation secret sauce in 2011. They tackled user-based filtering, or user actions that indicate preference, like watching a video or selecting content. I decided to exclude analytics data on existing click rates to keep my first pass simple. So my model uses content based filtering, or by comparing stories themselves— no reader data involved. It builds a similarity score with some natural language processing tools that develop some simple statistics on the kinds of words each story has.

Data Collection

THR has a few APIs for accessing content, but none are available for public access. So while I can’t reveal my exact API calls, I’ve uploaded a sample raw dataset on GitHub.

Included, you’ll find the title, deck, teaser, body, and a few other features. Although the dataset posted is much smaller, I wound up with a download of 32,374 stories. None are duplicated, and they range from as short as ninety-five words (a story that published one THR tweet), to 173,766 words (a transcript of an interview with Ethan Hawke). The stories were published between March 2014 and the first half of June 2015.

Data Transformation

All of the stories included HTML tags, which are irrelevant to our analysis. I borrowed an HTML parser from this Stack Overflow post. Then after stripping the tags, I combined title, decks, teasers, and the bodies of each story into a single column, “documents.” The next step is to convert these documents into something Python can understand, called tokens. I then used Scikit-Learn’s term frequency / inverse document frequency out of the box, applied it to the documents, and then produced a similarity score between each story in my data set:

from sklearn.feature_extraction.text import TfidfVectorizer
vect = TfidfVectorizer(min_df=1,ngram_range=(1,3),stop_words=’english’)
tfidf = vect.fit_transform(df.documents)
tifidf_score = (tfidf * tfidf.T).A

A snippet of the similarity score matrix for the first five stories.

The result is a 32,373 x 32,373 pandas data frame matrix of similarity scores. The similarity scores range from 0 to 1 and the resulting matrix has a main diagonal of 1; because a story’s most similar neighbor is itself. We can exclude these 1s from the final results. We can also exclude one half of the matrix, as it’s a mirror across the main diagonal.

The Results

Now, the fun part. Let’s start pulling stories and see what the model would recommend, beginning with the Dick Costolo story example.

Twitter CEO Dick Costolo Stepping Down

The site recommends:

Emmys: Why ‘House of Cards’ Star Robin Wright is Considering a Career Change

However my model would recommend:

Twitter COO Ali Rowghani Leaving Company 0.13886891

Not bad. Here are some runners-up, and their respective scores, to show the breadth of the potential model recommendations:

Twitter Revenue Nearly Doubles But User Growth Remains a Problem 0.112381551
Snoop Dogg Wants to Be the Next Twitter CEO 0.103994517
Twitter Dropping 140-Character Limit on Direct Messages 0.091953628

This set is great — all stories pertaining to Twitter’s growth, future, and other executive role changes. I bet we could keep more fans moving through our content about Twitter, instead of the Robin Wright story. Let’s keep pulling stories.

The Making of ‘American Sniper’: How an Unlikely Friendship Kickstarted the Clint Eastwood Film

On the site, THR recommended me:

Netflix Approves 7–1 Stock Split

My model recommended me:

‘American Sniper’: Chris Kyle’s Widow at Center of Quiet Furor Over Profits 0.211415329

Now, these two stories would make an excellent pairing! Here are a few other choices:

‘American Sniper’ Stars, Writer Went Beyond Chris Kyle’s Book 0.1850929
‘American Sniper’: What the Critics Are Saying 0.174557328
Bradley Cooper on How He Brought ‘American Sniper’ to the Screen and ‘The Elephant Man’ to Broadway 0.169175456
First ‘American Sniper’ Trailer: Bradley Cooper Makes a Deadly Decision 0.158361895

In this set, all of the stories directly pertain to ‘American Sniper’ or involve the film’s backstory. The potential pairing on-site is palatable: after the first read about the blockbuster, if your appetite isn’t completely satisfied, here’s yet another one.

I used the out-of-box Scikit-Learn package without much modification, and it’s pulling fantastic results. The difference is incredibly small between some of the stories, with only a thousandth or less separating them. Presumably with some more tweaking, I could separate them more from each other, but for the purposes of establishing the base-line “what stories are similar” to each other, this is definitely passing my internal litmus test.

It’s important to understand any model’s limits, and some of the results, on the surface, appear to be less than intuitive. While writing this post, Jurassic World began making a splash at the box office, so naturally THR decided to dust off the old Jurassic Park review from the archives: ’Jurassic Park’: THR’s 1993 Review. My model recommended: Ethan Hawke on Stage Fright, Denzel Washington: “Don’t F*** With Him, Man” (Q&A) 0.153167886.

An Ethan Hawke interview appears to have little to do with a republished review about dinosaurs in Costa Rica. But since the document similarity is conducted with TFiDF, then the model is finding stories with words that are unusually significant to the document. It turns out that Ethan Hawke mentions the year 1993 several times, as does the THR review for Jurassic Park. In the entire set of THR stories, this is a particularly unusual token. The model in this case is accurately predicting the next most similar story, but it is not something that would catch a reader’s eye at the bottom of a story about a block buster hit. In this case, we’ll need to decide how to combine my prototype with other recommendation models in order to drive the greatest click rate.

Next Steps

This prototype is great — but it’s only on my laptop, and it only has a static data set. It’s time to go live with this prototype and put up a system that actually updates regularly as new stories are published. We’ll have to retool the front-end templates to draw upon the new system, but with any luck, we’ll have an A/B test running in no time that compares the possible recommendations for “Next Up.” If we improve bounce rate, we know we have done our job: getting you the stories that you deserve.

Acknowledgements

I took General Assembly’s (GA) winter data science class at the Santa Monica. My final project is this recommendation engine prototype, and I couldn’t be more grateful to the entire GA team and our class for such an enriching experience. Special shout out to Dan Wilhelm, our class’ instructor, and Scott LaPlante. I often asked them both for help in wrapping my head around Python. This prototype is possible because of their generosity and expertise.

Special thanks as well to Scott, Nathan McGowan and Jeff Wainwright for reading drafts of this post and their insightful feedback.

I’m also much obliged to Adam Neary’s Publishing Data Science on Medium: A Step-by-Step Process Using GitHub for helping guide me through producing a sensible Medium post and setting up my first GitHub repository.