Wynk Radio: The ML Behind India’s Seamless Streams

n0obcoder
Airtel Digital
Published in
8 min readNov 2, 2023

Launched in 2014, Wynk Music, a part of Airtel Digital Limited, is currently the #1 music streaming app in India in terms of downloads and daily active users. The music catalog offers 22 million+ songs in major global & Indian regional languages. With 75 million monthly active users, and more than 100 million installs, Wynk Music continues to be India’s favorite music streaming app!

In this blog we will understand the role of Machine Learning in developing a recommender system that learns the music preferences of a user, and recommends best suited songs to them. The blog also puts light on the challenges that come along the way, and of course their solutions.

Wynk Radio

It is the brand new player in the Wynk App, which is powered by deep-learning based Next-Song-Prediction model. The new player brings entirely new listening and browsing experience to the users.

Its contemporary swipe-based navigation, coupled with our recommender system, makes finding favourite tunes effortless.

With uninterrupted playback, music flows smoothly, creating a cohesive and user-friendly experience. Even more impressively, Wynk Radio offers automated song recommendations that align with the changing preferences of the users, without requiring the entire listening history of the user.

Rendering the manual selection of songs obsolete, Wynk Radio takes care of the song preferences for the users. It lets users enjoy a tailored and uninterrupted music journey.

Session-Based Recommender System

It is a type of recommendation system designed to provide personalised recommendations for users based on their short-term behaviour or the current session.

Unlike traditional recommendation systems that rely on long-term user histories and preferences, session-based recommender systems focus and adapt to the user’s immediate context and behaviour within a single session.

Next-Song-Prediction Model

Being the Session-Based Recommender System that generates all the recommendations, NSP is truly the heart of the Wynk Radio.

It is an LSTM-based deep learning model that captures the preferences of a user, from the songs consumed in the session. It predicts the next song most likely to be consumed by a user, in an ongoing session by understanding and adapting to the changing preferences of a user in the that session.

NSP Model showed ~10% boost in streams per subscriber over the previous ML model in production

Data Preparation for NSP Model

Streaming of songs with an inactivity of less than 30 minutes is called a session. This sessions dataset is used for preparing the training and evaluation sets for NSP model.

Four weeks of streaming data is used for preparing the training set and next few days is used for preparing the evaluation set to make sure that there is no data leakage in the evaluation step.

There is a set of very popular songs having very high stream counts. Such popular songs have a clear dominance over the other songs in terms of occurrences. This results in popularity-bias in NSP model when trained on this data. The model learns to predict the most popular songs ignoring the other songs which might be equally relevant, or even more, in the session. This problem is tackled through data sampling techniques.

Architecture of Next-Song-Prediction Model

Sequences of song IDs are the inputs to the model. There are two of such sequences for a session. One consists the songs that have been consumed (stream time ≥ 30 sec), and the other consists of the skipped songs (stream time < 30 sec).

Then they are converted to song embeddings by making use of the Embedding Layer. The embeddings are then sent to the LSTM layer along with the hidden states. These hidden states are nothing but vectors representing the complete sequence till a particular time-step.

The output of the LSTM layers are passed through a series of Dense Layers, last of which is the classification layer that assigns probability scores for all the songs in the vocabulary.

What is LSTM?

Standing for Long Short-Term Memory, it is a type of artificial intelligence algorithm. It is good at understanding patterns in sequences of data. It makes predictions based on what it has seen before and what’s happening now. It is very useful for tasks like predicting the next word in a sentence or the next song in a session.

LSTM can theoretically handle sequential inputs of any length. LSTM has internal states which update for inputs of every time-step. These states are vectors containing floating point numbers.

The LSTM layer forgets the irrelevant information, updates the hidden states with new inputs, and passes on the updates hidden states to the next time-step.

The Softmax Function

It takes an input vector and transforms it into a set of values, a probability distribution, that sums up to 1. It is a great way to perform a classification task.

Computation involves exponentiating and normalising each element in the input vector. Like every great thing, Softmax also has its own limitations.

As the size of vocabulary increases, the computational cost of these operations grows significantly, leading to slower training and inference times.

Sampled-Softmax to the Rescue!

Sampled-Softmax fixes the limitation of plain softmax function. It is a technique used to address the computational and memory inefficiencies associated with using the full softmax function in the case of a large vocabulary size.

This is done by approximating the computation of the full softmax by considering only a small subset of classes (i.e. song IDs) from the entire vocabulary. It is a faster alternative of full softmax, with reasonable accuracy.

Training and Evaluation of NSP Model

The model training is powered by the formidable processing capabilities of multiple NVIDIA GPUs, enabling us to accelerate the training process and achieve impressive results

Every machine learning model needs to be evaluated on an evaluation set, and it is done by using suitable metrics. The main metrics used to evaluate the NSP model are Short-Term Prediction Score (SPS) and Recall.

How is NSP Model Evaluated?

The evaluation set consists of tens of thousand of test sequences or user sessions. These sequences are split at the centre. The first half of these sequences are the inputs to the model and the second half of these sequences are the ground truths.

A test sequence with 8 items is split into two halves, input and ground-truth

Short-term Prediction Success (SPS)

SPS captures the ability of the model to predict the next item in the sequence, the ground truth. SPS is 1 if the next item (i.e. the first item in the second half of the test sequence) is present in the top_k recommendations, else 0.

For example, if the top 5 predictions from the model are [1, 3, 5, 7, 9], and the next item in the sequence is 7, then the SPS@5 is 1.
The final SPS is the average SPS for all the sequences in the eval set.

Recall

It is defined as the number of correct recommendations divided by the number of unique items in second half of the test sequence, ground truth.

For example, if the top 5 predictions from the model are [1, 3, 5, 7, 9], and the second half of the sequence is [3, 3, 8, 10], then the recall@5 is 0.33.

Deploying the Model

NSP has been written in TensorFlow and it makes compatible to be deployed using TensorFlow Serving.

TensorFlow Serving offers a lot of benefits-

  • Batching Parameters: TensorFlow Serving allows you to specify batching parameters when you configure a serving deployment.
  • Scalability: Designed for high performant throughput
  • Model Versioning: Supports easy management and deployment of different versions of machine learning models.
  • Flexible Deployment: Allows serving multiple models simultaneously, enabling A/B testing and experimentation.
  • Platform Agnostic: Works across different platforms and deployment environments, making it versatile for various deployment scenarios.

Problem of Predictability and its Solution

Because of the nature of the model, the next song is predicted by picking up the item with highest probability. This means that for a fixed input, there would always be a fixed output. The output of the model is deterministic.

We use a technique called Temperature-Based Sampling to introduce a little variety to the model outputs. Temperature is like a knob that can be adjusted to change probability distribution coming from the softmax function, and in turn, the quality of model’s outputs.

Higher temperature values (> 1.0) make the output more random and varied, leading to a surprising and creative output. A very high value of temperature can make the probability distribution a uniform distribution adversely affecting the quality of the recommendations. Lower temperature values (< 1.0) make the output more focused and deterministic, producing more predictable and controlled content.

Temperature needs to be adjusted on the basis of desired level of creativity and coherence.

Long-Tail and OOV Items Handling

The items in the vocabulary with low stream counts are called long-tail items. And the items which are not present in the vocabulary of the model are called out-of-vocab (OOV) items.

If long-tail or OOV items, are given as input to the NSP model, recommendation is generated by a fallback logic, the similar-songs model based on the word2vec model.

Use of Skip-Signals

Streaming of songs for less than 30 seconds is considered as a skip. A skipped song, shows a negative intent from the user. Despite being an indicator of negative intent from a user, it is a very strong signal for understanding users’ preferences in a session.

NSP-with-skips showed a relative gain of ~5% in SPS against NSP w/o skips

Other Possible Use Cases

Song Radio

One single song, a seed song, can be used to start a song radio which could generate a mix of songs, well-suited to be consumed in a session.

Recommendations for User-Generated Playlists

On Wynk, users can create and modify their custom playlists. NSP can be used to recommend songs to be added in such playlists, making it easier for the users to expand their existing playlists, maintaining the theme or the feel of the playlist at the same time.

--

--

n0obcoder
Airtel Digital

DL Engineering in the making and a Struggling Musician