How Spotify Uses Semantic Search for Podcasts

14 min readAug 16, 2022

The market for podcasts has grown tremendously in recent years, with the number of global listeners having increased by 20% annually in recent years.

Driving the charge in podcast adoption is Spotify. In a few short years, they have become the undisputed leaders in podcasting. Despite only entering the game in 2018, by late 2021, Spotify had already usurped Apple, the long-reigning leader in podcasts, with more than 28M monthly podcast listeners.

To back their podcast investments, Spotify has worked on making the podcast experience as seamless and accessible as possible. From their all-in-one podcast creation app (Anchor) to podcast APIs and their latest natural language enabled podcast search.

Spotify’s natural language search for podcasts is a fascinating use case. In the past, users had to rely on keyword/term matching to find the podcast episodes they wanted. Now, they can search in natural language, in much the same way we might ask a real person where to find something.

This technology relies on what we like to call semantic search. It enables a more intuitive search experience because we tend to have an idea of what we’re looking for, but rarely do we know precisely which terms appear in what we want.

Imagine we wanted to find a podcast talking about healthy eating over the holidays. How would we search for that? It might look something like:

There is a podcast episode talking about precisely this. Its description is:

"Alex Straney chats to Dr. Preeya Alexander about how to stay healthy over Christmas and about her letter to patients."

We have zero overlaps between the query and episode description using term matching, so this result would not be returned using keyword search. To make matters worse, there are undoubtedly thousands of episode descriptions on Spotify containing the words “eat”, “better”, and “holidays”. These episodes likely have nothing to do with our intended search query, but we could return them.

Suppose we were to swap that for a semantic search query. We could see much better results because semantic search looks at the meaning of the words and sentences, not specific terms.

Despite sharing no words, our query and episode description would be identified as having very similar meanings. They both describe being or eating healthier over the winter holidays.

Enabling meaningful search is not easy, but the impact can be huge if done well. As Spotify has proven, it can lead to a much greater user experience. Let’s dive into how Spotify built its natural language podcast search.

Semantic Search

The technology powering Spotify’s new podcast search is more widely known as semantic search. Semantic search relies on two pillars, Natural Language Processing (NLP) and vector search.

These technologies act as two steps in the search process. Given a natural language query, a particular NLP model can encode it into a vector embedding, also known as a dense vector. These dense vectors can numerically represent the meaning of the query. We can visualize this behaviour:

These vectors have been encoded by one of these special NLP models, called sentence transformers. We can see that queries with similar meanings cluster together, whereas unrelated queries do not.

Once we have these vectors, we need a way of comparing them. That is where the vector search component is used. Given our new query vector, we perform a vector search and compare it to previously encoded vectors and retrieve those that are nearest or the most similar.

Given a query vector xq, we could calculate the distance between that and other indexed vectors to identify the top two nearest “neighbors”.

NLP and vector search have been around for some time, but recent advancements have acted as catalysts in the performance increase and subsequent adoption of semantic search. In NLP, we have seen the introduction of high-performance transformer models. In vector search, the rise of Approximate Nearest Neighbor (ANN) algorithms.

Transformers and ANN search have powered the growth of semantic search, but why is not so clear. So, let’s demystify how they work and why they’ve proven so helpful.

Transformers

Transformer models have become the standard in NLP. These models typically have two components: the core, which focuses on “understanding” the meaning of a language and/or domain, and a head, which adapts the model for a particular use case.

There is just one problem, the core of these models requires vast amounts of data and computing power to pretrain.

Pretraining refers to the training step applied to the core transformer component. It is followed by a fine-tuning step where the head and/or the core are trained further for a specific use case.

One of the most popular transformer models is BERT, and BERT costs a reported 2.5k−50Ktotrain;thisshiftsto2.5k−50Ktotrain;thisshiftsto80K — $1.6M for the larger BERT model [4].

These costs are prohibitive for most organizations. Fortunately, that doesn’t stop us from using them. Despite these models being expensive to pretrain, they are an order of magnitude cheaper to fine-tune.

The way that we would typically use these models is:

The core of the transformer model is pretrained at great cost by the likes of Google, Microsoft, etc.
This core is made publicly available.
Other organizations take the core, add a task-specific “head”, and fine-tune the extended model to their domain-specific task. Fine-tuning is less computationally expensive and therefore cheaper.
The model is now ready to be applied to the organization’s domain-specific tasks.

In the case of building a podcast search model, we could take a pretrained model like bert-base-uncased. This model already “understands” general purpose English language.

Given a training dataset of user query to podcast episode pairs, we could add a “mean pooling” head onto our pretrained BERT model. With both the core and head, we fine-tune it for a few hours on our pairs data to create a sentence transformer trained to identify similar query-episode pairs.

We must choose a suitable pretrained model for our use case. In our example, if our target query-episode pairs were English language only, it would make no sense to take a French pretrained model. It has no base understanding of the English language and could not learn to understand the English query-episode pairs.

Another term we have mentioned is “sentence transformer”. This term refers to a transformer model that has been fitted with a pooling layer that enables it to output single vector representations of sentences (or longer chunks of text).

Sentence transformers add a “pooling layer” to transform the many token embeddings output by a transformer model into a single sentence embedding.

There are different types of pooling layers, but they all consume the same input and produce the same output. They take many token-level embeddings and merge them in some way to build a single embedding that represents all of those token-level embeddings. That single output is called a sentence embedding.

The sentence embedding is a dense vector, a numerical representation of the meaning behind some text. These dense vectors enable the vector search component of semantic search.

ANN Search

Approximate Nearest Neighbors (ANN) search allows us to quickly compare millions or even billions of vectors. It is called approximate because it does not guarantee that we will find the true nearest neighbors (most similar embeddings).

The only way we can guarantee that is by exhaustively comparing every single vector. At scale, that’s slow.

Rather than comparing every vector, we approximate with ANN search. If done well, this approximation can be incredibly accurate and super fast. But there is often a trade-off. Some algorithms offer speedier search but poorer accuracy, whereas others may be more accurate but increase search times.

In vector search, there is often a decision to be made on whether to prioritize latency or accuracy.

In either case, an approximate solution is required to maintain reasonable query times at scale.

How Spotify Did It

To build this type of semantic search tool, Spotify needed a language model capable of encoding similar (query, episode) pairs into a similar vector space. There are existing sentence transformer models like SBERT, but Spotify found two issues with using this model:

They needed a model capable of supporting multilingual queries; SBERT was trained on English text only.
SBERT’s cross-topic performance without further fine-tuning is poor [5].

With that in mind, they decided to use a different, multilingual model called the Universal Sentence Encoder (USE). But this still needed fine-tuning.

To fine-tune their USE model to encode (query, episode) pairs in a meaningful way, Spotify needed (query, episode) data. They had four sources of this:

Using their past search logs, they identified (query, episode) pairs from successful searches.
They identified unsuccessful searches that were followed by a successful search. The idea is that the unsuccessful query is likely to be a more natural query, which was then used as a (query_prior_to_successful_reformulation, episode) pair.
Generating synthetic queries using a query generation model produces (synthetic_query, episode) pairs.
A small set of curated queries, manually written for episodes.

Sources (1–3) fine-tune the USE model, with some samples left for evaluation. Source (4) was used for evaluation only.

Unfortunately, we don’t have access to Spotify’s past search logs, so there’s little we can do in replicating sources (1–2). However, we can replicate the approach of the building source (3) using query generation models. And, of course, we can manually write queries as per source (4).

Data Preprocessing

Before generating any queries, we need episode data. Spotify describes episodes as a concatenation of textual metadata fields, including episode title and description, with the podcast show’s title and description.

We can find a podcast episodes dataset on Kaggle that contains records for 881k podcast episodes i. Including episode titles and descriptions, with podcast show titles and descriptions.

We use the Kaggle API to download this data, installed in Python with pip install kaggle. An account and API key are needed (find the API key in your Account Settings). The kaggle.json API key should be stored in the location displayed when attempting to import kaggle. If no location or error appears, the API key has already been added.

We then authenticate access to Kaggle.

Once authenticated, we can download the dataset using the dataset_download_file function, specifying the dataset location (found from its URL), files to download, and where to save them.

Both podcasts.csv and episodes.csv will be downloaded as zip files, which we can extract using the zipfile library.

We have two CSV files, podcasts.csv details the podcast shows themselves, including titles, descriptions, and hosts. The episodes.csv data includes data from specific podcast episodes, including episode title, description, and publication date.

To replicate Spotify’s approach of concatenating podcast shows and episode-specific details, we must merge the two datasets. We do this with an inner join on the podcast ID columns.

Before concatenating the features we want, let’s clean up the data. We strip excess whitespace and remove rows where any of our relevant features contain null values.

We’re ready to concatenate, giving us our episodes feature.

Query Generation

We now have episodes but no queries, and we need (query, episode) pairs to fine-tune a model. Spotify generated synthetic queries from episode text, which we can do.

To do this, they fine-tuned a query generation BART model using the MS MARCO dataset. We don’t need to fine-tune a BART model as plenty of readily available models have been fine-tuned on the exact same dataset. Therefore, we will initialize one of these models using the HuggingFace transformers library.

We tested several T5 and BART models for query generation on our episodes data; the results are here. The doc2query/all-t5-base-v1 model was chosen as it produced more reasonable queries and has some multilingual support.

It’s time for us to generate queries. We will generate three queries per episode, in-line with the approach taken by the GenQ and GPL techniques.

Query generation can take some time, and we recommend limiting the number of episodes (we used 100k in this example). Looking at the generated queries, we can see some good and some bad. This randomness is the nature of query generation and should be expected.

We now have (synthetic_query, episode) pairs that can be used in fine-tuning a sentence transformer model.

Models and Fine-tuning

As mentioned, Spotify considered using pretrained models like BERT and SBERT but found the performance unsuitable for their use case. In the end, they opted for a pretrained Universal Sentence Encoder (USE) model from TFHub.

We will use a similar model called DistilUSE that is supported by the sentence-transformers library. By taking this approach, we can use the sentence-transformers model fine-tuning utilities. After installing the library with pip install sentence-transformers, we can initialize the model like so:

When fine-tuning with the sentence-transformers library, we need to reformat our data into a list of InputExample objects. The exact format does vary by training task.

We will be using a ranking function (more on that soon), so we must include two text items, the (query, episode) pairs.

We also took a small set of evaluation (eval_pairs) and test set pairs (test_pairs) for later use.

As mentioned, we will be using a ranking optimization function. That means that the model is tasked with learning how to identify the correct episode from a batch of episodes when given a specific query, e.g., ranking the correct pair above all others.

Given a query and set of episodes, the model must learn how to embed them so that the relevant episode embedding is the most similar to the query embedding.

The model achieves this by embedding similar (query, episode) pairs as closely as possible in a vector space. We measure the proximity of these embeddings using cosine similarity, which is essentially the angle between embeddings (e.g., vectors).

When using cosine similarity, we are searching for embeddings that have the shortest angular distance, rather than Euclidean distance.

As we are using a ranking optimization function, we must make sure no duplicate queries or episodes are placed in the same training batch. If there are duplicates, this will confuse the training process as the model will be told that despite two queries/episodes being identical, one is correct, and the other is not.

The sentence-transformers library handles the duplication issue using the NoDuplicatesDataLoader. As the name would suggest, this data loader ensures no duplicates make their way into a training batch.

We initialize the data loader with a batch_size parameter. A larger batch size makes the ranking task harder for the model as it must identify one correct answer from a higher number of options.

It is harder to choose an answer from a hundred samples than from four samples. With that in mind, a higher batch_size tends to produce higher performance models.

Now we initialize the loss function. As we’re using ranking, we choose the MultipleNegativesRankingLoss, typically called MNR loss.

In-Batch Evaluation

Spotify describes two evaluation steps. The first can be implemented before fine-tuning using in-batch metrics. What they did here was calculate two metrics at the batch level (using 64 samples at a time in our case); those are:

Recall@k tells us if the correct answer is placed in the top k positions.
Mean Reciprocal Rank (MRR) calculates the average reciprocal rank of a correct answer.

We will implement a similar approach to in-batch evaluation. Using the sentence-transformers RerankingEvaluator, we can calculate the MRR score at the end of each training epoch using our evaluation data, eval_pairs.

Before initializing this evaluator, we need to remove duplicates from the eval data.

Then, we reformat the data into a list of dictionaries containing a query, its positive episode (that it is paired with), and then all other episodes as negatives.

We set the MRR@5 metric, meaning if the positive episode is returned within the top five results, we return a positive score. Otherwise, the score would be zero.

If the correct episode appeared at position three, the reciprocal rank of this sample would be calculated as 1/3. At position one we would return 1/1.

As we’re calculating the mean reciprocal rank, we take all sample scores and compute the mean, giving us our final MRR@5 score.

Using our evaluator, we first calculate the MRR@5 performance without any fine-tuning.

Returning an MRR@5 of 0.68, we will compare this to the post-training MRR@5 score.

Fine-Tuning

With our evaluator ready, we can fine-tune our model. The Spotify article doesn’t give any information about the parameters they used, so we will stick with pretty typical training parameters for sentence transformer models using MNR loss. We train for a single epoch and “warm up” the learning rate for the first 10% of training steps.

After fine-tuning, the model will be saved into the directory specified by output_path. In distiluse-podcast-nq, we will see all the required model files and a directory called eval. Here, we will find a post-training MRR@5 score of 0.89, a sizeable 21-point improvement from the previous MRR@5 of 0.68.

This score looks promising, but there’s further evaluation to be performed.

Evaluation

We want to emulate a more real-world scenario for the final evaluation step. Rather than calculating MRR@5 across small batches of data (as done previously), we should index many episodes and recalculate some retrieval metrics.

Spotify details their full-retrieval setting metrics as using Recall@30 and MRR@30, performed both on queries from the eval set and on their curated dataset.

Our eval set is small, so we can discard that. Instead, we will use the much larger test set test_pairs.

As before, we must deduplicate the episodes from the dataset.

This time, rather than keeping all of our embeddings stored in memory, we use a vector database, Pinecone.

We first sign up for a free account, enter the default project and retrieve the default API key.

Back in Python, we ensure the Pinecone client is installed with pip install pinecone-client. Then we initialize our connection to Pinecone and create a new vector index.

The vector index is where we will store all of our episode embeddings. We must encode the episode text using our fine-tuned distiluse-podcast-nq model and insert the embeddings into our index.

We will calculate the Recall@K score, which differs slightly from the MRR@K metric as if the match appears in the top K returned results, we score 1; otherwise, we score 0. As before, we take all query scores and compute the mean.

So far, this looks great; 88% of the time, we are returning the exact positive episode within the top 30 results. But this does assume that our synthetic queries are perfect, which they are not.

We should measure model performance on more realistic queries, as Spotify did with their curated dataset. In this example, we have chosen a selection of episodes and manually written queries that fit the episode.

Using these curated samples, we returned a lower score of 0.57. Compared to 0.88, this seems low, but we must remember that there are likely other episodes that fit these queries. Meaning, we’re calculating recall assuming there are no other relevant queries.

What we can do is: measure this score against the score of the model before fine-tuning. We create a new Pinecone index and replicate the same steps but using the distiluse-base-multilingual-cased-v2 sentence transformer. You can find the full script here.

Using this model, we return a score of just 0.29. By fine-tuning the model on this episode data, despite having no query pairs, we have improved episode retrieval performance by 28-points.

The technique we followed, informed by Spotify’s very own semantic search implementation, produced significant performance improvements.