A Recommendation Engine Using Python for An Episode Available on Spotify Podcast

Julia Wu
Web Mining [IS688, Spring 2021]
9 min readMar 2, 2021
source: Spotify

When I work on my homework, one of my hobbies is listening to a podcast on Spotify. Spotify has offered a feature to recommend episodes to the users since 2019 November, and its recommendation engine seems to be based on my subscribed channels. However, many podcasts usually offer a wide range of topics and information, but I may be only interested in some of them. Therefore, sometimes its recommendation engine may not be able to find the episodes that I may be really interested in the most, so I may need to spend more than 10 minutes deciding the episodes for that day.

How can I quickly get my TOP 10 recommended episodes based on the episode I like?

If you have the same issue, this article is right for you to help you find the recommended episodes for you only based on the episode you are interested in. I am going to use Jupyter in the following steps to build a recommendation engine for episodes in Spotify.

Source code is in my GitHub!

Collect the data

Since I want to get the recommended valid episodes from Spotify based on the episode I like, I am going to collect the data associated with the episodes from Spotify by making an API call through Spotipy in Python.

I have to import some necessary libraries and files to help me collect and clean up the data. This time I store my credential file containing all my API IDs, keys, and tokens in the parent folder, so I need to use ‘os’, ‘sys’, and ‘inspect’ libraries to help me find the file path to import my Spotify ID and API key.

I also need to install the ‘Spotipy’ library before I can import it from my machine and make API calls to Spotify.

!python -m pip install spotipy

As usual, I import the ‘JSON’ library and take a look at a JSON object returned from the API call in a more readable format, so I can find and decide which column in an episode object I need to retrieve to build my recommendation engine.

Since Spotify limits a maximum that it only returns 50 records from each API call and 1,000 records for each query, I use the ‘in range’ method and a ‘step’ parameter to get 1,000 episodes from each query. To get a larger data set, I use a ‘for’ loop to query multiple queries. I also use the ‘set_option’ method to view the values in a data frame without eclipse notation.

Bug Attention!

Although it tells me that I can follow the same operation from Spotify documents, I find the operation for querying keywords from Spotipy seems not to work very well. For example, if I query ‘year:2020’, it returns me 1,000 random episodes that include ‘year:2020’ in their names instead of that are released in 2020 as it is supposed to. Also, it may raise an error when I query some specific letters, such as ‘g.’ Therefore, I put the ‘print’ method in my loop to know which letter my loop crashes at and then get rid of the invalid letter when I am collecting my data set.

This getEpisodes() function helps me get a data frame of the episodes with 7 columns, name, description, duration, explicit, language, release date, and URL. Finally, it drops all duplicate episodes before it returns the data set to me.

To get more than 1000 episodes on different topics from Spotify, I query 4 letters consecutively. This function will get 1000 episodes for each letter that have the letter in one of their properties and drop duplicates before returning the whole data set in a data frame to me.

episodes_data = getEpisodes(['h', 'b', 'f', 'm'], 1000)
print(episodes_data.shape)

To have a basic understanding of my data set, I use the ‘info’ method again, and the output shows the data set is perfect without any null values.

episodes_data.info()

Clean up data

To build this recommendation engine, I take ‘name’, ‘description’, ‘explicit’, ‘language’, ‘duration (ms)’, and ‘release date’ into account. To prepare the text data for the feature extraction, I import the ‘numpy’ library and create a new column ‘name + description’ to contain ‘name’ and ‘description’ for each episode. To prepare the date data for the feature extraction, I import the ‘datetime’ library and create a new column ‘days ago’ to count the number of days for each episode from the released date to the current date. Finally, I have a set of feature columns, ‘name + description’, ‘explicit’, ‘language’, ‘duration (ms)’, and ‘days ago.’

Feature extraction

To preprocess my data set for building the recommendation engine, I install and import the ‘SKLearn’ library. To featurize those feature columns associated with episodes in my data set, I use different encoders based on the data type of each column. The suitable encoders for the text data may be CountVectorizer and TfidfVectorizer, whereas the appropriate encoders for the categorical and numerical data may be OneHotEncoder and MinMaxScaler accordingly.

Since my feature columns have two categorical types, one text type, and two numerical types, I use the ColumnTransformer() method to contain different encoders and set different weights for each column. The OneHotEncoder is for my categorical columns ‘explicit’ and ‘language’, the TfidfVectorizer is for my text column ‘name + description’, and the MinMaxScaler is for my numerical columns ‘duration (ms)’ and ‘days ago.’ Now I am using the fit_tranform() method to standardize my data set. This method calculates the mean and variance of my feature columns one by one and transforms them using the respective mean and variance, and then I convert the result into an array for my next step.

Find and sort by the cosine similarity

I use cosine similarity from ‘SKLearn’ to calculate the similarity between all episodes in my data set. The concept is to measure the cosine of the angle between two vectors projected in a multi-dimensional space, and the cosine would be between 1 and 0.

from sklearn.metrics.pairwise import cosine_similaritycos_sim = cosine_similarity(token)

In this case, the larger the value is, the more similar the two episodes are. It means I have to be able to sort a list containing a unique column of all episodes based on the cosine similarity score between each of them and a given episode. Therefore, I use the index of my data set as a primary and foreign key of my data set and the list, and then I sort the list by the cosine similarity score in descending order.

For example, I find there is an episode ‘When Covid Hit Nursing Homes, Part 1: ‘My Mother Died Alone’, and I want to find its similar items.

list(episodes_data['Name'])

I set the ‘name’ to it, and I will get a list ‘sorted_scores’ sorted by the cosine similarity score between each episode and it.

Get the TOP 10 recommended episodes based on a given episode

Since I have the index as a primary and foreign key of my data set and the list ‘sorted_scores’, I can use ‘index’ to retrieve all columns and the cosine similarity score of an episode from my data set and the list. In this case, I set a ‘for’ loop to show the TOP 10 most similar episodes with their names and URLs to an episode ‘When Covid Hit Nursing Homes, Part 1: ‘My Mother Died Alone.’ Since the most similar item to an episode is the episode itself, I find the TOP 11 episodes and remove the first one.

I make these steps as a reusable function ‘getRecommendation’ to be called as many times as I want, and set a variable number to TOP N.

Example 1

getRecommendation('A Broken System for Housing the Homeless', 10)

Example 2

getRecommendation('Healthy Relationships', 10)

Example 3

getRecommendation('How to Decide', 10)

Conclude with limitations

In general, after I retrieve 1,000 records from Spotify as my data set, I have a set of feature columns, ‘name + description’, ‘explicit’, ‘language’, ‘duration (ms)’, and ‘days ago. I featurize those columns associated with the episodes by using suitable encoders based on the data types of those columns. OneHotEncoder is for my categorical columns ‘explicit’ and ‘language’, the TfidfVectorizer is for my text column ‘name + description’, and the MinMaxScaler is for my numerical columns ‘duration (ms)’ and ‘days ago.’ After a few clicks, this recommendation engine can return TOP N similar episodes with their names and URLs to one within the 1,000 records by calculating and sorting by the cosine similarity scores between all episodes of my data set. Now, I can get a list of recommended episodes with their links in two minutes!

There are three limitations to this project. First, the maximum of returning records from each query is 1,000 records, I have to query multiple times to get a larger data set. Second, Since the operation for querying keywords from Spotipy seems not to work very well, I have to use a ‘print’ method to find and remove the invalid queries. Third, my libraries ‘SKLearn’ and ‘Numpy’ seem to be incompatible with each other due to their versions 0.21.3 and 1.20.1. I receive deprecation warnings that tell me ‘np.int’ and ‘np.float’ are deprecated aliases for the built-in ‘int’ and ‘float’ when I use the fit_transform and the cosine_similarity methods to standardize my data set and calculate the cosine similarity between the episodes of my data set. So far, this recommendation engine is executable and works perfectly fine. However, I still cannot find a way to correctly silence the warnings even I’ve done tons of research. I appreciate any thoughts about how to correctly silence the warnings.

References:

“6.1.4. ColumnTransformer for Heterogeneous Data.” Scikit-Learn Machine Learning in Python, https://scikit-learn.org/stable/modules/compose.html#columntransformer-for-heterogeneous-data. Accessed 1 Mar. 2021.

Brownlee, Jason. “How to Use the ColumnTransformer for Data Preparation.” Machine Learning Mastery, 20 Dec. 2019, https://machinelearningmastery.com/columntransformer-for-numerical-and-categorical-data/.

Khanna, Chetna. “What and Why behind Fit_transform() and Transform() in Scikit-Learn!” Towards Data Science, 25 Aug. 2020, https://towardsdatascience.com/what-and-why-behind-fit-transform-vs-transform-in-scikit-learn-78f915cf96fe.

Prabhakaran, Selva. “Cosine Similarity — Understanding the Math and How It Works (with Python Codes).” ML+ Let’s DataScience, https://www.machinelearningplus.com/nlp/cosine-similarity/. Accessed 1 Mar. 2021.

Science, Computer. Build A Movie Recommendation Engine Using Python. 2020, https://www.youtube.com/watch?v=ueKXSupHz6Q.

Tingle, Max. Getting Started with Spotify’s API & Spotipy. 3 Oct. 2019, https://medium.com/@maxtingle/getting-started-with-spotifys-api-spotipy-197c3dc6353b.

“Web API.” Spotify for Developers, https://developer.spotify.com/. Accessed 26 Feb. 2021.

Welcome to Spotipy! https://spotipy.readthedocs.io/en/2.17.1/. Accessed 24 Feb. 2021

--

--

Julia Wu
Web Mining [IS688, Spring 2021]

Certified Google Advanced Data Analytics Professional | connect with me over LinkedIn: julia-h-wu/