Project Octopus: A smarter curation engine

Gokul Nath Sridhar
Reads.
Published in
3 min readAug 12, 2015

On March 26th, I woke up after my mobile chimed in the arrival of my morning Tenreads digest on my inbox. I swiped right and waited with groggy eyes for the email to load, and when it did, I wished it hadn’t. I was following Music on Tenreads, and all the three stories on my email digest were about Zayn Malik quitting One Direction. If you are a user of Tenreads, you probably have faced this yourself.

The problem of articles on the same topic flooding feeds arose because our systems were intelligent enough to find how important a particular article is mathematically speaking, but they had no way of knowing what exactly the article was talking about. Therefore, when publishers write articles on a trending topic that become popular, Tenreads would dutifully but unintentionally fill up a number of slots on a particular feed with stories on the same subject. Internally, we called this the Duplication Duplication Problem. We first noticed this when Zayn Malik exited One Direction in March, and again when Jules Bianchi died of injuries sustained in a car crash, in July.

A few hours ago, Larry Page announced the creation of a parent company for Google, named Alphabet and promoted India-born Sundar Pichai as CEO of Google. And, the Internet went crazy! If your Facebook and Twitter feeds are anything like mine, you are probably nauseated by the number of stories that have surfaced on this topic. Within minutes of the announcement, every major publisher across the world scrambled to cover the story, with squirm-worthy curry and rasam references, thanks to his Madras roots. Without doubt, Google’s Alphabet announcement is the most important news in Technology today. However, the barrage of articles published on this event eclipsed other news items which were definitely less important, but important nevertheless.

I opened my Tenreads feed this morning to find six stories about Google’s announcement but not a single one about Xiaomi assembling smartphones in India or a news even more sinister, HTC One Max’s security debacle which allowed any app to access a user’s fingerprint data.

In April, we set our sights on this problem with a renewed focus, and Project Octopus was born. Octopuses are organisms that have a singular head but have eight tentacles wavering around, just like a single topic has eight stories written about it. Our goal with Project Octopus was to find the head.

And today, we are incredibly thrilled to report that we are making great progress. Towards the end of June, we successfully used a technology called Natural Language Processing (NLP) to extract important features from every article that enters the Tenreads database. After running a number of other pattern recognition and clustering algorithms, we group similar articles in a bucket and assign it a cumulative rank, measured as a function of individual ranks of each article in the bucket and picks just the best article from the lot as the representative story of the bucket.

In simpler terms, Tenreads finds what articles are talking about, groups articles talking about the same subject and surfaces just the best article among the lot of similar articles.

For instance, we found that there were about 28 articles from Wall Street Journal, TechCrunch, Verge, Next Web, etc. covering the Google story after the Alphabet announcement. As an acid test for Project Octopus, we fed all Technology-related articles pulled today morning into the new curation engine to generate the topic’s feed, and this was the result. You can see how the new feed generated by Octopus is different from the feed generated by our old engines.

At Tenreads, our goal is really simple: to remove information overload from people’s lives. Octopus, a smarter bucketing and curation engine that determines the importance of a particular subject based on how many people are writing about it, is just one step forward in that endeavour.

All credits for the work done on Octopus go to Ganesh Srinivas, our Data Science intern who worked with us on crafting the algorithms to get this system running with low error margins and Santosh Venkatraman, our Junior Python Developer who helped Ganesh in converting the algorithms to production-ready code. We hope to rollout the new curation engine, as well as throw the doors of Tenreads open to everyone on web and Android over the course of this week. We can’t wait to see the stories you discover on Tenreads.

Happy reading!

--

--

Gokul Nath Sridhar
Reads.
Editor for

Small-time startup founder and technophile. Love products that are tastefully designed.