Personalising category pages

Published in

Team Pratilipi

4 min readOct 11, 2018

Introduction to Pratilipi:

Pratilipi is the largest online self-publishing platform for Indian languages. The platform has about 50,000 Authors who have published over 300,000 contents in 8 Indian languages, and these stories have been cumulatively read over 150 million times (15 million times in last month alone).

About category page:

The contents are categorised/tagged by authors and are grouped under pages like Love, Women, Horror, Suspense.

Why the static list is not enough? Why is it a problem?

Earlier our language experts used to curate the list once in two days for every category page. The lists were same and limited for everyone without any personalisation. There are three problems with this approach -

This doesn’t take into account the reader’s tastes and preferences,
It is time consuming and error-prone,
There is a lag between the content being published and included in the list page.

Why is it important to solve this problem?

Our readers want to read high quality, uniquely personalised and fresh content, which is not possible with manual curation.

Delivering quality and personalised content to the reader is not the only goal for a list page. Two other equally important objectives are providing visibility for new authors and to increase engagement and interactions on the platform - for example user-author follow connections or reviews/ratings. So we decided to automate our list pages keeping the three key objectives in mind.

How to generate a quality item list

Our strategy in generating the list is divided into multiple steps as explained below -

1. Filters

We identify the parameters that can be used to classify and provide a gist of the content.

Example:

Average Rating (What quality?)
Read count (How popular?)
Word count (Which set of users would like to read?)
Rating count (How popular?)
Category (Which genre?)
Published date (How fresh?) etc.

2. Create buckets

Buckets are algorithms built by mixing and matching the filters, to identify a set of contents. Each bucket is formulated to contribute to the growth of one or more objectives.

Example:

Bucket 1: Added to library
Filters: Any content with percentage read is less than X%
Result: Improvement in read completion rate
Bucket 2: Contents that were not read and published by authors followed by current user
Filters: Avg_Rating >= X, Rating_count >= Y, Word_count > Z
Result: Improvement in user engagement.
Bucket 3: Contents published by popular authors, that were not read by users.
Filters: Avg_Rating >= X, Rating_count >= Y, Word_count > Z
Result: Improvement in user-author relations through follow actions
Bucket 4: Published in last x days as latest content.
Filter: Word_count > Z
Result: Improves the freshness in listing pages
Bucket 5: Serialised contents
Filter: All the unread `first chapters` and `next chapter to read` from all series.
Result: Improves the overall time spend by user in reading.

3. Personalised list for users

Personalisation can be based on multiple inputs from the user. We currently look at the user’s previous reading history, authors they follow and their library actions.

For example- all the contents that were completely read by the user, will usually be excluded from most buckets.

Preparing the final list for the response

Final list = (List from buckets)-(List of contents already read)

Fetch the final list from cache. If not cached yet/cache expired, then prepare the list for a user and category combination, shuffle the list and put it into cache before sending the response.

How to generate the final list and how long to cache is purely a business call. For example- club all the buckets and shuffle the list before caching or prioritise the buckets and show the list in order.

What we have implemented is — shuffle the entire list, so that user would see different set of contents after every X amount of time.

Technical overview

We use neo4j as database, the service is written in Java and redis for caching.

High level view of schema in neo4j database.

Constraints & Mistakes to avoid

Real time & Caching

The more the real time — higher the computation & longer the cache & older the list.

Constructing so many buckets for each request is an intensive compute, trying to respond with the real time data for every request is a disaster. The solution here is to cache the buckets. There must be right trade off between data served in real time & from cache.

The computations not only incur the financial costs, but also hampers the user experience due to increase in response latencies.

Non logged in users

In app as login in mandatory, user associated filters are applied to deliver more personalised content, but in the case of web where login is not mandatory, personalisation is still a constraint.

MECE: Mutually Exclusive Collectively Exhaustive.

When applying filters do not use any two filters which by nature have similar classification effect on data.

For example: The filters rating_count(no of times a content is been rated) and read_count(no of times a content is been read) yields the similar results because the value of these filters are proportional. Avoid such combination of filters.

MECE principle — Wikipedia

The MECE principle, pronounced “meece”, is a grouping principle for separating a set of items into subsets that are…

en.wikipedia.org

Ask users what they like to read
Ranking contents, just one filter to act on
Similarity model and collaborative filtering.

My two cents:

Don’t be over ambitious to deliver the best in the first try, find the easiest and quickest way to build the solution. We are still far away from using ‘Similarity model and collaborative filtering’ or any such recommendation algorithms.

Listen to users. Log and collect the metrics and evaluate how the user is reacting to the experiments.

Thanks

Ranjeet Prathap Singh(CEO of Pratilipi) for helping in building personalised category pages.

Michael Hunger & Andrew Bowman from neo4j team for helping in optimising queries and in modifying neo4j configurations to handle load.