Building a Recommender System for Podcasts

16 min readDec 19, 2022

Presented as a talk at Columbia University Applied Math Senior Seminar on Monday, October 10, 2022 by Yamini Ananth, Jafar Vohra, Abhiram Kolluri, & Kathy Wang (author details below). Slides available here. Code available as an interactive Colab notebook here & in a repository here.

Project Overview (abstract)

In a world full of recommender systems for every form of content, we aimed to focus on one form of content, namely podcasts, and explore multiple ways to recommend them.

Podcasts are spoken documents across a wide-range of genres and styles, with growing global listenership and a lowering barrier to entry for both listeners and creators. Major strides in recommendation in other industries such as shopping, text-based content, music, and video have yet to see impact in the podcast space, where recommendations are still largely driven by word of mouth.

We scraped metadata from the top 100 podcasts within the approximately top 40 genres, we gathered the title, producer, description, as well as the 6 most recent episode titles and their descriptions. After pre-processing the text, we built and implemented a content based filtering model to recommended podcasts using two types of embeddings. Finding results for this model to be realistic, but narrow, we moved to implement a collaborative filtering model that used synthetic user data that we randomly generated.

We assessed our results given the confines of the gathered and generated data, and further discussed ways to improve this model both in the short and long term, as well as the future of the field.

What are Recommender Systems?

As the quantity of available content has grown exponentially on the Internet, the need for systems that surface the most relevant content have become critical to a meaningful user experience. A recommender system is an information filtering algorithm that leverages user profiles and item metadata to predict items a user might like (3).

At the end of the day, recommending content through these systems is a way to move past the simple fact that the time and attention span of users is limited.

Why Recommender Systems for Podcasts?

Podcasts have a discoverability problem — they have a hard time connecting to their audiences. Unlike music, it’s hard to gauge how good a podcast is just from listening to a sample, and episodes can be hours long. Like many other forms of media, including movies, music, and livestreams, podcasts suffer from the long tail problem — the top 10% of shows get around 95% of listeners.

Long tail problem https://arxiv.org/pdf/2106.09227.pdf

With the podcast industry expected to reach a global market value of ~$100B by 2030 with over $3.5B in advertiser revenue by 2024, it’s clear there is growing value in the industry (4). Currently, the main mechanism by which major distribution channels like Apple Podcasts and Spotify allow for user discovery is through the top charts, which show the top 100 podcasts for any given genre. This is unlikely to surface niche content that may be more tailored to a user’s interests.

The goal of this project is to demonstrate feasibility of connecting users to content they are likely to enjoy and return to using two types of recommendation algorithms.

A Framework for Developing Recommender Systems

We use the following framework for the development of a recommendation system, as reflected by the overarching architecture of the recommendation system at YouTube (1):

Candidate generation: generating the subset of relevant items from the set of all available items. This can be done by creating a set of relevance scores for each item. Of a potential 1M pieces of content, perhaps only 100k might be “high quality” / worth recommending in general, and of that, perhaps only 5k are relevant in a given context. Efficiency is a priority here given the potential size of a candidate pool. Sometimes, multiple different candidate generators are used and their collective results are considered the resulting candidate set.
Scoring: Of the limited candidate pool, we need to pick the top k items to serve to a user. If we used different candidate generators, the relevance scores generated by each likely cannot be compared. Thus, new relevance scores must be recalculated for the whole candidate pool. Since there are fewer items in the candidate pool, a richer feature-set can be used at this stage.
Ranking & serving the top k items: We may not want to use the same relevance scores from the previous section to rank candidates. For example, say we wanted to prioritize new content or content that experienced high traffic in the last day. We could introduce these biases at the ranking stage. In our case, we recommended the top five podcasts.

The algorithms underlying candidate generation and ranking can vary widely, and often, in industry, many algorithms are implemented together in an ensemble. We will implement two types of algorithms for podcasts that can be considered useful for the “candidate generation” step. In this case, we will consider the raw output of this step as our “recommendations” in the context of either another podcast, or a given user.

Implementing a Recommender System for Podcasts

First, to recommend podcasts, we needed to collect some metadata. We scraped data from the top 100 podcasts for the top ~40 genres (~4300 total) on Apple Podcasts using BeautifulSoup. Specifically, we chose the following features:

Title (text)
Producer (text)
Description (text)
6 Recent Episode Titles (text)
6 Recent Episode Descriptions (text)

For our purposes, since users typically listen to the newest episode of a given podcast first, we allowed our data to have a recency bias (considers the most recent content, does not consider older content at all). In order to make this more usable, we also performed some basic pre-processing on this text data:

Filtered out URLs and special characters
Tokenized (separated each word into its own string)
Removed stop-words (common words like articles, pronouns etc)
Lemmatized (removed endings from words, so ‘like’ and ‘likes’ and ‘likely’ would all be converted to ‘lik’)

Content Based Filtering

In this style of filtering content, we use the content itself and score how similar it is to other content. Essentially, we’re looking at recommendations from the perspective of the item: if a user likes one item, there’s a good chance that they will like a similar item.

Recommending similar content to content a user has already consumed

We can condense each item into a set of its relevant features and then represent it as an embedding. Thus, we can think of items as vectors in space, allowing us to perform any vector operation on them. Since we have very high-dimensional data, rather than using raw Euclidean distance, cosine similarity gives us an efficiently computable and effective numeric representation of the relationship between any given pair of items (2).

To filter content, we can generate a similarity matrix, that assesses the similarity between every pair of items in our dataset. For a given item, to find the top k most similar items, we only need to find the top k items in a given row of our similarity matrix.

Since we have text data, we chose to construct embeddings of our data in two different ways to see how it influenced recommendations.

Bag-of-Words Embeddings

The “bag of words” is a fixed-length vector–the length of the vocabulary of known words–where each entry of the vector denotes a count of that word. This method for embedding text ignores the order of the words and only considers each word’s frequency in a given piece of data. All words are treated independently (i.e. the presence of one word does not imply or indicate the presence of another).

Term Frequency — Inverse Document Frequency (TF-IDF) Embeddings

This method assigns each word in the text an assigned weight. The frequency of a term in a document is calculated (Term Frequency) and is penalized by that same term appearing in every other document (Inverse Document Frequency). Each word’s TF-IDF relevance is normalized.

For instance, if the word ‘subscribe’ appears in every item, it isn’t a good signal and its tf-idf score would be low compared to a word like ‘economy’, which may only appear in 5% of items and thus provide a stronger signal.

Assessing this model

In general, the results from both models are reasonable, or at least interpretable. Our code is interactive, and you can find recommendations for a podcast of your choosing here.

We can also observe that for a given item, the two embeddings yield different recommendations which can sometimes (but not always) agree. Here, the Daily has completely overlapping recommendations and Joe Rogan Experience has totally divergent recommendations.

We attribute this to the different priorities that each embedding takes. In Joe Rogan Experience, the number 1001 appeared somewhere in the metadata, and bag-of-words picked up on this as a strong signal. As a result, we have multiple shows with ‘1001’ recommended, even though they are not similar to Joe Rogan Experience. The TF-IDF model yields higher quality recommendations in this case.

In this content-based filtering method, we can also observe the effect of the recency bias of the dataset. In podcasts, there are broadly two categories of shows: episode based shows where the content shifts radically from episode to episode, or shows that are consistent across their library.

Recommendations for shows similar This American Life

When looking at the generated recommendations for This American Life, an episodic show that reports on a wide range of topics, it happened to have a beer-related episode recently at the time we scraped the data. Thus, several consistently beer-themed podcasts are recommended as similar pieces of content. Arguably, these shows are not similar at all to This American Life, but they do have the most similar embeddings with the data we collected. Whether or not this is a good thing is something to consider in the context of the larger goals of the system.

Pros & Cons of implementing content-based filtering

Perhaps the biggest advantage of content-based filtering is that it can easily scale. It also avoids the cold start problem of needing a user to interact with an item before it can be recommended. Additionally, content-based models tend to be interpretable, with the features that are similar between items being apparent in many cases.

However, significant domain knowledge can sometimes be required to understand which features are most important when doing feature engineering. Furthermore, this method only yields similar content to a given piece of content, so using it in isolation can yield a very uniform/stale set of content. Since introducing variety into a user’s feed in the process can be one of the major reasons for using recommendation systems, this is a clear disadvantage.

Collaborative Filtering

The main idea of collaborative filtering is to recommend items that come from other users who have similar preferences to you. In contrast to content-based filtering, which tries to find similarities between items, collaborative filtering finds similarities between users instead.

The process is similar to the one described in collaborative filtering, but with vector representations of user ratings instead of item features in the similarity matrix. This can be approached as a matrix factorization problem, where we aim to decompose the similarity matrix into a user matrix U and item matrix P with compatible dimensions. The decomposition is such that U dot P^t = R, the original utility/ratings matrix.

Generating latent factor matrices fro the original utility matrix R

We can approach the construction of U and P using ALS Matrix Factorization.

First, we chose an objective function to gauge the distance between our predictions and the true data. We chose root mean squared error (RMSE).

We construct our loss function to be RMSE along with the lambda/l2 norm (to prevent overfitting).

Using this loss function, we can iteratively alternate between optimizing U and P. We will eventually reach the global minimum for loss.

Once we have the U and P matrices, the dot product of a given row in U with a given column in P relates how much a user’s interests align with a given latent factor.

Generating Synthetic User Data

As previously discussed, this method relies only on the ratings that users provide for items in the dataset. However, although we have a dataset, we don’t have any user data. Thus, we generated synthetic data.

We naively generate data for 1000 users, with each user randomly rating 5–20 randomly selected podcasts. In reality, this does not reflect human behavior. It will work as a method for demonstrating the method, but this method of generating user data means that there is inherently little signal in a user’s preferences. One option to improve this was to use the similarity matrices from the previous step to generate more realistic user data that reflects that people’s interests are not wholly independent.

Implementing Collaborative Filtering & Serving Recommendations

Using this user preferences matrix (R, as defined above), we run ALS matrix factorization using PySpark’s builtin functionality. With collaborative filtering, we have to figure out what the optimal number of latent factors, k, is for a given dataset, in order to construct the best possible U and P matrices. As such, we run the following:

Create a train/test split of 0.7/0.3
Set the maximum number of ALS iterations=10 (we can reach a fairly good estimate with 10 iterations, even if it is not a global minimum)
Set k-latent-factors between 10–100
Run 3-fold cross-validation
Identify which k produced least error in test data (k=100)
Use the best model to generate predictions for the whole dataset

For our model, we chose to use RMSE loss, RegParam=0.1 (\lambda from the above proof), and cold start strategy=Drop (meaning that items with no ratings and users with no ratings were dropped from the calculations).

Results

Generally, these results seem reasonable, or at the very least, not unreasonable. A few examples are below, but feel free to try it out yourself on our interactive Colab notebook. It’s also possible to try out different values of \lambda to observe the effect on the predictions.

In general, we can’t comment too much on the accuracy or quality of these recommendations for a few reasons. First of all, because user profiles were generated randomly, there is not a lot of signal or pattern that can be derived from associations between profiles. Secondly, and more importantly, in an offline setting without user feedback or action on recommendations, we can only compare recommendations to previously consumed content. One option for assessing recommendation quality here is to use similarity between items a user has ranked highly and items being recommended as a metric for the quality of recommendations. However, recommendations that are high quality, but serving unique content to a user would be penalized, which violates the goal of the algorithm.

In functional online recommender systems, collaborative filtering is used in ensembles with content based filtering in order to benefit from the strengths of each algorithm while solving some of these issues.

Pros and Cons of Collaborative Filtering

One of the pros of collaborative filtering is that the people implementing the model don’t need a lot of domain knowledge for feature engineering of the items that content based filtering requires. Additionally, the model can provide more interesting recommendations since it considers data outside of a single user’s preferences. A user need not have liked previous items in the same category as before, as long as similar people also liked it.

However, there are a few issues, notably the cold start problem for new items. Until someone rates them, they don’t get recommended. In general, data sparsity can affect the quality of user-based recommenders. Scaling can also be a challenge for growing datasets as complexity of matrix operations becomes too slow. Content-based recommenders are faster when the dataset is massive. Also, with a straightforward implementation, generated recommendations tend to be already popular, ignoring items from the long tail.

Future improvements for these models

In this proof-of-concept demonstration, we operated in an offline environment with no direct user feedback to influence recommendations. Even in this context, there are questions to explore:

How might using full podcast episode transcripts in the dataset, perhaps generated using an AI transcriber like OpenAI Whisper, change recommendations? Would this lead to higher quality recommendations, or does the metadata accomplish the same thing with less overhead?
Using a more complex embedding model and/or custom embedding model, i.e. rather than bag-of-words, perhaps a transformer model that uses attention like BERT could yield more relevant information
What if we generated user data in a less naive way? That is, using the similarity matrices from content-based filtering to generate the user profiles.
Combining multiple models into a hybrid recommendation system

When a recommendation system incorporates user feedback in real time, it is ‘online’. There are a few other types of data that may become useful in generating high quality recommendations in an online environment:

Implicit feedback from users such as listen duration, pause/plays, shares, etc. (which may produces higher quality data than explicit feedback like ratings, which can skew to extremes)
Time-weighting- newer/fresher content may be more popular in the short term
Personal/demographic data, like location (for example, recommending New York-themed podcasts to people in New York). Collaborative filtering can be used to find users with similar demographics and recommend similar content
Constructing a user-community- groups of users that interact may share preferences, even if they have different demographics or profiles. It is promising to utilize social network graphs to find areas of shared interests among users.

User-Community based recommender system concept

With collecting more user and demographic data comes the need to consider the tradeoff between privacy/security and quality of recommendations. There are also several other factors to consider carefully with the implementation of recommender systems.

Downsides of using recommender systems to serve content

Echo Chambers: Systems that only show users similar content to what they already consume can make it difficult for users to see alternate perspectives
Polarization: Recommendation systems can reinforce group identity and promote conflict between groups

Finding the balance between recommending content in a narrow window that a user enjoys and content that falls outside of that window is difficult. It is that balance that separates the good recommendations from the best recommendations.

Future of the Field

Recommendation systems utilizing ensemble models and deep learning are currently dominant; a popular architecture is the two-tower model, used by eBay and other e-commerce sites, which builds upon collaborative filtering with neural networks. In this method, user and item metadata is considered alongside user ratings of items, which is typically implicitly computed from click-through rate, engagement, etc.

For situations with multiple ranking objectives, the multi-gate mixture of experts architecture has emerged on YouTube among other sites. In this method, multiple expert networks focuses on different data patterns. Then, each of those networks has an individual gating network. This allows the model to learn a per-task and per-sample weighting of each of the expert networks. Thus, the MMoE can model the relationships between different tasks. However, basic concepts demonstrated in this implementation such as constructing embeddings, using similarity measures, and collaborative filtering are underlying even the most complex MMoE model.

In general, as the amount of content being produced only increases, the role that recommender systems play in shaping the content that is surfaced to users becomes more and more important.

Reflection & Takeaways

We picked this topic because several of us love podcasts! Beyond that, though, we also cared about learning more about how the content we consume every day gets served to us. Through this project, we have developed a better understanding of why content gets recommended, and how data collected by platforms that serve us content can be used to further optimize this.

We all have different taste in shows and certainly different levels of enthusiasm for the medium of podcasts, but had the collective objective of improving our skills as collaborative researchers. We enjoyed working through the problem selection and refinement process, background research, data collection, and model development. Presenting slides and preparing this writeup also helped us practice our scientific communication skillset.

References

Covington, P., Adams, J., & Sargin, E. (2016, September). Deep neural networks for YouTube recommendations. In Proceedings of the 10th ACM conference on recommender systems (pp. 191–198). https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45530.pdf
Fethi Fkih, Similarity measures for Collaborative Filtering-based Recommender Systems: Review and experimental comparison, Journal of King Saud University — Computer and Information Sciences, Volume 34, Issue 9, 2022, Pages 7645–7669, ISSN 1319–1578, https://doi.org/10.1016/j.jksuci.2021.09.014.
F.O. Isinkaye, Y.O. Folajimi, B.A. Ojokoh, Recommendation systems: Principles, methods and evaluation, Egyptian Informatics Journal, Volume 16, Issue 3, 2015, Pages 261–273, ISSN 1110–8665, https://doi.org/10.1016/j.eij.2015.06.005.
Jones, Rosie, Zamanie, Hamin, et. al. (2021). Current Challenges and Future Directions in Podcast Information Access. SIGIR. 1. arXiv:2106.09227.
Sarwar, Badrul & Karypis, George & Konstan, Joseph & Riedl, John. (2001). Item-based Collaborative Filtering Recommendation Algorithms. Proceedings of ACM World Wide Web Conference. 1. https://doi.org/10.1145/371920.372071.

Author Details

This project is co-authored by 4 seniors at Columbia as a portion of the applied math senior seminar final project — course details available here.

Yamini (linkedin)

senior, seas, applied math + cs minor // going into data science!
enthusiastic fan of 99% Invisible, This American Life, Radiolab, and Smithsonian Sidedoor

Kathy

senior, cc, applied math + econ minor + premed // biotech equity research + potentially med school or MD/MBA

Jafar (linkedin)

senior, seas, applied mathematics, combined plan // sports/tech industry + data science, engineering roles

Abhiram

senior, seas, applied math, combined plan // healthcare/biotech data scientist + also recruiting for pm roles
fan of Lex Fridman Podcast, Duncan Trussell Family Hour, and Hacks on Tap
focused on content that improves the quality of life of an individual, leaving them happier and healthier.