3rd place winning solution for building a Movie Recommendation Engine for Hotstar

Published in

Data Science | Analytics

4 min readSep 10, 2017

IndiaHacks is the annual flagship event of HackerEarth and one of the tracks of the event was Machine Learning. After two rounds of shortlisting through online rounds, from more than 500 participants, 15 participants from all over India were selected for one final onsite round where we had 9 hours to solve the problem. The following is a brief writeup on the same.

Problem statement

Hotstar is an on demand video streaming service in India. It boasts about having more than a 100 Million users and more than 35,000 hours of content on their platform. Due to the sheer scale of users and content, leveraging user browsing history and thus personalising the content for each user creates a lot of value for Hotstar.

In this challenge, Hotstar challenged us with building a recommendation system so that they could personalize the user experience and also improve the content consumption on their platform (and hence revenues!).

Data

They have given us data about ~0.8 million users and 3000 odd movies. These movies are sampled from 11 different genres and 50 different languages spoken in India. 2 Million such user — movie — watch_duration (of each movie) were given. A significantly larger dataset, in the first look. We were asked to predict the list of top 20 movies of ~6200 users with which our models will be evaluated and NDCG@20 is the scoring criterion.

Approach

A quick exploratory analysis showed that

More than 90% of users given in our training dataset had watched less than 5 movies. We call them the rare_users.
More than 70% of the users given in the test dataset fall in the rare_users category.

The 10% of the non rare_users came out to be around 80,000 users and around 1.3M user-movie interactions. The above two points formed the foundation of the modeling strategy.

Modeling

My modeling strategy had two parts. One each for the rare_users and the non rare_users segment.

rare_users: For this set of users, as we have less data, I had a heuristics based model that essentially recommends top movies by the ratio of the watch_time to the duration of the movie. This ratio is the amount of the movie watched by the users, a proxy that I considered as the intent of the user towards the movie. Largely, this model contributed to more than 95% of the final score.
non rare_users: For 600 users in test set, from 80000 users in train set, I built a collaborative filtering based model using the Keras and TensorFlow backend. This approach creates an embedding vector for each of the user and movie features. Augmenting the model with genre and language alongside the embeddings also helped.

The heuristics and the CF model runs under a minute on a AWS GPU instance.

How this approach can evolve

This approach is quite clean and works on real-world datasets since the problem of cold start in many recommendation engines is quite common. To address that issue, the heuristics based approach helps. Slowly, as the users consume more and more content, they are transitioned into the non rare_users bucket and they will be a part of the training process.

The collaborative filtering approach scales horizontally with the amount of data. Add more GPU memory or add more GPUs, though the approach remains the same.

Results

The benchmark submission scored about 0.07 NDCG whereas the first approach scores around 0.201. The heuristics based model can be further improved to 0.24 (almost a 3x lift over baseline) with smart sampling (letting go of low frequency movies, letting go of low % watched movies) and arriving at final predictions.

This model helped me to get to #1 standing on public leaderboard. After a 5-minute short presentation of the jury, the winners were announced and I managed to finish 3rd in the hackathon and carried home a good prize money! :D The other winners had similar approaches but smart heuristics here and there had a decisive impact on the final results.

The top 10 in Machine Learning track. I am on the extreme left.

Conclusion

This edition of IndiaHacks was quite an experience in itself. Hotstar coincidentally organized another challenge in round 1 and that was one of richest datasets I’ve worked with. Looks like their data collection and storage is quite impressive and well aligned with the long-term idea of having analytics and machine learning drive some of their products.

However, the competition setup could have been way better. Be it data sampling, or the timeframe of the challenge. Had there been more data and a longer timeframe, this problem would have been way more interesting and could’ve had much more value to Hotstar.

Anyways, I hope folks at HackerEarth will take this feedback into consideration and I look forward to more such interesting datasets, problems that create impact.