SimGeet: Wynk Music’s Similar Songs Recommender

Rajnish Kumar
Airtel Digital
Published in
8 min readJun 12, 2024

Wynk Music, a leading music streaming platform by Airtel Digital Limited, has dominated India’s music scene since 2014. With a vast library of over 22 million songs in various languages, it caters to diverse tastes. Offering personalised recommendations, Wynk Music ensures users discover music tailored to their preferences. With over 75 million monthly active users, it leads India’s music streaming market, providing a seamless experience.

Introducing SimGeet

Wynk Music’s latest innovation, SimGeet, enhances the listening experience by suggesting tracks that complement the current song, facilitating smooth transitions. SimGeet, a hybrid recommender system, uses Word2Vec along with side features to significantly improve the accuracy of similar song recommendations. It analyses various facets of the currently playing track, including the artist, era of release, language, and co-occurrence, to recommend songs that match the listener’s mood and broader musical preferences. Whether enjoying the latest chart-topper or diving into a classic favorite, the similar songs Rail serves as a trusted guide to discovering new music that feels like a natural extension of the current listening session.

Similar Songs for query song: Nadaan Parinde-Rockstar (left), Ram Siya Ram-Adipurush (middle), and Hymn for the Weekend-Coldplay (right) respectively

The previous Similar Song Recommendation model and its limitations

The primary challenge encountered in existing recommendation models for music streaming platforms like Wynk Music is accurately calculating the similarity between all items. Traditional collaborative filtering (CF) and Word2Vec methods solely focus on the co-occurrence rate of items in users’ historical behaviour, potentially overlooking a broader spectrum of similarities between items. The previous model for similar song recommendations relied on a Word2Vec approach utilsing only co-occurrence data for training, implemented using gensim. This approach led to several issues:

  • Sparsity: Users tend to interact with only a small number of items, making it challenging to train an accurate recommendation model, especially for users or items with few interactions.
  • Cold Start: Wynk continuously uploads thousands of new items each day, for which there are no user behaviours. Predicting user preferences for these items is difficult.

To address these limitations, the following solutions were proposed:

  • A novel approach that leverages users’ historical behaviour to construct an item graph. This graph serves as the basis for learning item embeddings using DeepWalk, a technique that captures the underlying structure of the graph.
  • Unlike CF-based methods, which rely solely on user interactions, this approach incorporates side information such as artist, release year, language, and audio features (MFCCs) to enhance the embedding process.
  • By incorporating side information, it is ensured that items with minimal or no interactions still receive accurate embeddings.

Word2Vec

The Word2Vec model was developed to encode words within an English corpus using a self-supervised shallow neural network. This network is trained to derive compact vector representations of words by analysing the context of surrounding words within sentences. This principle has now become a standard approach in recommender systems for generating item embeddings. These embeddings are computed from user interaction sequences, such as site sessions or playlists, to enhance recommendations.

DeepWalk

DeepWalk is a method for learning representations of nodes in a network, typically applied in the context of graph data such as social networks, citation networks, and biological networks. It was introduced by Bryan Perozzi, Rami Al-Rfou, and Steven Skiena in 2014.

DeepWalk applies techniques from deep learning, specifically the Skip-Gram model from Word2Vec, to learn low-dimensional representations of nodes in a graph by treating random walks on the graph as sentences and nodes as words. The basic idea is to generate random walks in the graph, treating them as sequences of nodes, and then apply the Skip-Gram algorithm to these sequences to learn embeddings for nodes that preserve their structural properties in the graph.

Data Preparation

Source: Billion-scale Commodity Embedding for E-commerce Recommendation in Alibaba

Session-Based Behaviour
Data Preparation for the SimGeet Model involves extracting users’ consumption sequences, referred to as sessions. Within a designated time window, we filter users’ behaviours to form session-based behavior, categorising songs into separate sessions when the time gap between 2 consecutive streams exceeds 30 minutes. We then filter out invalid data and abnormal behaviours:

  • Songs with a streaming time of less than 30 seconds (considered skips)
  • Users exceeding a certain streaming threshold

An item graph is constructed from users’ behaviour history (sessions). Items are connected by a directed edge if they occur consecutively in sessions. Utilising collaborative behaviours across all users in Wynk, a weight is assigned to each edge based on the total occurrences of the connected items in all users’ behaviours. Specifically, the weight of an edge represents the frequency of item i transitioning to item j in the entirety of users’ behaviour history.

Generate Random Walks
After obtaining the weighted directed item graph, random walks are performed starting from each node. These random walks simulate short paths in the graph and help capture higher-order similarities between items. The resulting random walks are used to obtain training sequences of fixed length. The Neo4j graph database has been employed to generate these random walks.

Side Information
We populate metadata such as language, artist, release year, and MFCC features corresponding to each item in the training sequences.

  • Language: Retain the top 20 most frequent languages in the vocabulary
  • Year: Bucketize release years into 5-year intervals
  • Artists: Filter by role types such as main artist, composer, and singer. Retain only artists who appear more than three times in the vocabulary. For each song, keep only the top five most frequent artists
  • MFCC: Vector comprising the first 19 coefficients of the Mel-Frequency Cepstral Coefficients.

SimGeet Model Architecture: Word2Vec with side features

Treat the sequences of nodes as “sentences” and apply the Skip-Gram model to learn embeddings for nodes. The Skip-Gram aims to predict the context (i.e., the neighbouring nodes) of a given node.

Also, different side information contributes differently to the final representation, so, we propose a weighted average layer to aggregate the embeddings of the side information related to the item.

Source: Billion-scale Commodity Embedding for E-commerce Recommendation in Alibaba

Skip-Gram model

A Skip-Gram model predicts the context (or neighbors) of a word, given the word itself. The training objective of the Skip-Gram model is to maximise the probability of predicting context words given the target word. For a sequence of words w1, w2, … wT, the objective can be written as the average log probability

where c is the size of the training context. The basic Skip-Gram formulation defines this probability using the softmax function.

where v and v’ are target and context vector representations of words and W is vocabulary size.

Training Process

During training, the model learns to generate meaningful song embeddings by leveraging positive-negative pairs. Positive pairs consist of a target song and a context song (a song that appears nearby in a sequence), based on the Skip-Gram approach. Negative pairs are randomly sampled from the vocabulary to serve as contrast examples.

The TensorFlow Word2Vec model enriches the embedding process by incorporating various side features. The embedding layer transforms each song into a dense vector, capturing its semantic attributes.

To further enhance the embeddings, we used weighted averaging, along with the integration of additional metadata. The model is trained using a carefully selected loss function and optimized with cutting-edge optimisers like Adam. For offline evaluation of the model’s performance, we employ hit ratio and precision metrics.

The model training process is conducted on GPU configurations to ensure enhanced efficiency. It spans multiple epochs with carefully tuned parameters aimed at achieving optimal performance.

Golden Embedding Creation

As part of the model training process, we generate embeddings for each song, encompassing diverse dimensions such as:

  • Self Embedding
  • Artists Embedding
  • Language Embedding
  • Year Embedding
  • Audio Embedding
  • Weight matrix of audio processing dense layer

These embeddings consist of 75 dimensions, encapsulating the nuanced attributes of songs, enabling our model to capture rich semantic relationships. The final item (golden) embedding for each song_id in the vocabulary is built by taking a weighted sum of metadata.

Similar Songs Creation

An approximate nearest neighbor (ANN) index is constructed based on the item’s golden embeddings. For each song in the vocabulary, it queries the top 200 similar songs. Following this, deduplication logic is applied, and similar songs are populated for the entries in the vocabulary. Subsequently, a language filter is applied, and the top 50 songs for each song are identified and stored in the database. These results are further served in the similar songs rail.

Evaluation: Offline vs Online AB Testing

We employ offline and online A/B testing methodologies to evaluate our model's performance. Offline evaluation involves assessing the model’s accuracy and effectiveness using historical data. Evaluation metrics such as hit ratio and precision are employed to quantify the model’s ability to recommend similar songs accurately.

Top 10 similar songs for song Raataan Lambiyan (From “Shershaah”)

We conducted Online A/B testing to validate the performance of both the previous Gensim-based model and the new hybrid approach-based model in real-time user interactions. By comparing the results of these two approaches, we gain comprehensive insights into the efficacy of the models and their impact on user engagement and satisfaction.

In the AB Experiment, the hybrid approach-based model showed a relative gain of ~7% in SPS against the Gensim-based model.

Get in the groove with India’s Topmost rated App that can be downloaded from App Store & Play Store.

Feel free to reach out to Vipul Gaba and Rajnish Kumar for any feedback on Wynk’s SimGeet

References

  1. Billion-scale Commodity Embedding for E-commerce Recommendation in Alibaba
  2. DeepWalk: Online Learning of Social Representations
  3. Intuitive understanding of MFCCs
  4. Efficient Estimation of Word Representations in
    Vector Space
  5. Distributed Representations of Words and Phrases and their Compositionality
  6. Tensorflow Word2Vec Implementation

--

--