Extracting the Joe Rogan Experience

A YouTube podcast data extraction and analysis how-to guide

Published in

The Startup

8 min readDec 30, 2020

I’ve been a fan of the Joe Rogan Podcast for many years. The wide variety of guests he has on has been both entertaining and educational. From scientists to comedians. He is a great podcaster because he is willing to listen to both sides and have a direct and honest conversation.

Joe Rogan is leaving YouTube and moving his primary podcast network to Spotify. In light of the move, I decided to use common Python libraries to extract some data from the JRE YouTube playlists before they inevitably get removed from the YouTube playlist. The primary purpose of this extraction was to practice data visualization techniques on data I was interested in. There are many visualization opportunities with the data I extracted. There is a chance to gain some interesting insights on the JRE podcast, such as the diversity of topics, the sudden rise in listenership over the last few years, the most consistently liked and disliked guests, and the correlation between topical subjects and guests. I can begin to see just how much influence Joe Rogan has on our culture, and how the guests he has on change our opinions on subjects. Personally, I never heard of people like Cornel West or Darryl Davis before they appeared on Joe Rogan, and now I enjoy listening to them and their incredible stories.

This article covers the steps I took to extract this data from YouTube using a library called pafy. Pafy allows you to get some high-level stats from YouTube videos. It has an easy-to-use and intuitive API and makes it easy to access and prepare data for analysis.

The code below shows how I extracted the data, however if you want to get straight into it, checkout the final dataset on Kaggle.

Let’s get started.

The Code

The following code snippets were run on Jupyter Lab. First, make sure you have Pandas and Pafy installed.

pip install pandas
pip install pafy

Make sure your imports work after you have installed the libraries.

import pandas as pd
import pafy

In order to get Pafy working, you will need to get an API key from YouTube. Once you have done that, you will need to set the API Key you generated.

pafy.set_api_key("YOUR_API_KEY")

Now, Pafy is able to access the YouTube API.

Before the move to Spotify, if you had a look at Joe Rogan’s YouTube channel you would have noticed that, under the playlists tab, he had multiple playlists of podcasts split up into batches. The screenshot below shows the current state of the playlist, which looks similar before the Spotify move, the major difference being a larger quantity of playlists.

The current state of the JRE YouTube channel

I had to manually obtain the URL for each of these playlists and add them to an array because we will use pafy to get out information on all these podcasts.

all_jre_playlists= [
"https://www.youtube.com/playlist?list=PLk1Sqn_f33Kt_vFRd3OyzqQ78Go4c1fyn", "https://www.youtube.com/playlist?list=PLk1Sqn_f33KuQyLE4RjEOdJ_-0epbcBb4", "https://www.youtube.com/playlist?list=PLk1Sqn_f33KvXucAFMo5Tc5p8e_mcc-5g", "https://www.youtube.com/playlist?list=PLk1Sqn_f33KtYIPnFjpI19BCz2unzWYlJ", "https://www.youtube.com/playlist?list=PLk1Sqn_f33Kvv8T6ZESpJ2nvEHT9xBhlb", 
"https://www.youtube.com/playlist?list=PLk1Sqn_f33KvtMA4mCQSnzGsZe8qsTdzV", "https://www.youtube.com/playlist?list=PLk1Sqn_f33Ku0Oa3t8MQjV7D_G_PBi8g1", "https://www.youtube.com/playlist?list=PLk1Sqn_f33KuU_aJDvMPPAy_SoxXTt_ub", "https://www.youtube.com/playlist?list=PLk1Sqn_f33KtVQWWnE_V6-sypm5zUMkU6"]all_jre_info = []# Iterate through each playlist and add each item to a list
for plurl in all_jre_playlists:
    playlist = pafy.get_playlist(plurl)
    for i in playlist['items']:
        all_jre_info.append(i)len(all_jre_info)
# 1325

The code snippet above goes through each playlist URL and gets the items in the playlist. In this case, the items are the individual JRE podcasts and their statistics. I appended these statistics to an array so I could generate a dataframe from the playlists. At the time I wrote the code, I obtained a total of 1,325 podcasts from the JRE YouTube channel.

Now that I had the podcasts, I wanted to put them into a dataframe so I could add more columns and clean the data.

yt_jre = pd.DataFrame.from_dict(all_jre_info)# Explode the MetaData from pafy api into the DataFrame
yt_jre = yt_jre['playlist_meta'].apply(pd.Series)# Set the timestamp to seconds
yt_jre['timestamp_created'] = pd.to_datetime(yt_jre['time_created'], unit='s')# Convert all the time data into a DatetimeIndex and extract the Year, Month and Day into its own columns.
yt_jre['year'] = pd.DatetimeIndex(yt_jre['timestamp_created']).yearyt_jre['month']=pd.DatetimeIndex(yt_jre['timestamp_created']).monthyt_jre['day'] = pd.DatetimeIndex(yt_jre['timestamp_created']).day

Notice, in the image above, there are several columns I am not interested in here. I can remove them and make this dataframe a little bit cleaner, however, for the task of extraction and basic cleaning, I decided to leave it as is, because I prefer to filter the data once I know I have no use for them.

The column I was immediately interested in was the title column. The title column contained multiple data points that could be split out into their own columns, for example, episode number and guest name.

# Use Regex to extract the episode number from the Title
yt_jre['Episode'] = yt_jre['title'].str.extract('#(\d*)', expand=True)# Use string manipulation to get the Guest Name out of the Title
def get_guest(x):
    if("-" in str(x)):
        guest = str(x).split("-")
        
        return guest[1].strip()
    elif("with" in str(x)):
        guest = str(x).split("with")
        
        return guest[1].strip()
    else:
        return xyt_jre['guest'] = yt_jre['title'].apply(lambda x: get_guest(x))

Notice also, the duration column is a string formatted value. I wanted to convert that into an integer column, denoting the podcast duration in minutes. This technique can be handy for further analysis, for example if you wanted to gain insight on what types of podcasts outlast others, or if you wanted to create a heat map on the most common duration of a podcast.

There is also a column called views, which consists of a string that contains a number (separated by commas). I wanted to extract the integer from this column and store a raw integer value for the views, too.

# Method to get the total minutes from the Duration Column
def convert_to_min(x):
    splits = str(x).split(":")
    if len(splits) == 3:
        hr = int(splits[0]) * 60
        total = hr + int(splits[1])
    else:
        total = int(splits[0])
    
    return totalyt_jre['duration_minutes'] = yt_jre['duration'].apply(lambda x: convert_to_min(x))# Probably could have used regex here.
yt_jre['views_raw'] = yt_jre['views'].str.replace(",", "")
yt_jre['views_raw'] = yt_jre['views_raw'].astype(int)

Now, I have all of the columns in a format that can be analysed or visualised.

Let’s take a look at the columns the in the dataframe.

All the columns we have in the dataframe

I have done some super basic cleaning on the raw dataframe. There is still more cleaning to do, but for the basis of this article, I will leave it here.

Next, I ran a describe method on the dataframe to see some basic statistics.

Statistics from the JRE YouTube podcast data

yt_jre['views_raw'].mean()
# 1323460.4403323263yt_jre['duration_minutes'].mean()
# 150.0725075528701yt_jre['likes'].mean()
# 16713.877643504533yt_jre['dislikes'].mean()
# 1566.5430513595165

Analysis

On average, each JRE podcast gets around 1.3M views. It would be interesting to see a time plot of when these views began to increase, because then you can identify the pivotal episodes that helped Joe Rogan improve his fan base.

His podcasts average a bit over two hours each, and he gets 16K likes and 1.5K dislikes per podcast on average. You can also do some benchmarking against other podcasts and see how they compare with similar guests. You can start to see which podcasts you might recommend to people based on the types of the guests that are on. For example, is Joe Rogan the go-to for comedy-type podcasts, or for science-related podcasts?

With some fairly basic analysis out of the way, you can now start to dig in and answer the bigger questions. Who is the most influential podcaster on YouTube? Or maybe you can develop a recommender system to redirect a user to a new podcast based on topics they want to hear about.

What’s Next?

I hope you enjoyed this article and identified some use cases for this data. I would love to see what you create and build. I would also love to receive feedback on how you would have approached this and what kinds of analysis and cleaning you would do when presented with a raw dataset.

Currently, I am working on building an interactive dashboard for this data for accessibility and additional analysis. It is built using React.js front end and a Flask server back end. There is the opportunity here to do a guest-versus-guest comparison to analyze listener preferences on the JRE podcast. Since you have time-based information, it could also be fun to plot the different guests across the time dimension, to see how the diversity of Joe’s guests changed from when he started to the podcasts he produces today.

Another opportunity is to generate a transcript from his podcasts for detailed topic analysis. I am looking at a library called SpeechRecognition that attempts to generate text from audio files. Here is a cool HackerNoon article that outlines this process. Since you already have all the Youtube links to Joe’s podcasts, you can use libraries that convert .mp4 to .mp3 or .wav files such as youtube_dl and insert them into speech recognition to try and generate a transcript. This will be ideal for the dashboard application, as you can begin to highlight the main topics in the podcast and conduct detailed search on topics across the dataset.

For all you talented data scientists and dataviz experts out there, I would love to see some cool analysis and visualisations with this form of data. I hope I made the data clean enough for use. I appreciate feedback on how I could have improved, and whether or not this data was useful.

Also, I would love input on your approach to cleaning and manipulating these dataframes, and what other insights you can find with the data provided.

Summary

I have just started out in data analytics and data science. I was a full stack developer before and I am enjoying the transition into the data space. I like working with data, trying to gain insights, and creating fun and interesting visualisations. I am still new to this space and look forward learning and growing in this field.

I have written a few dataviz articless on Nightingale as well, mostly on UFC data, you can check them out here and here.

If you want to find me on Twitter and Instagram feel free to do so. :)

Gist

I have included a gist below with all the code from the article. It was originally written in a Jupyter Lab notebook, so please forgive me for the messy code.

Please Note: At the time of writing the JRE YouTube playlists no longer exist, since the move to Spotify. You will not be able to fully replicate the code to get the Joe Rogan Data, but feel free to get data from a YouTube playlist of your choice, the logic is still the same.