Data mining Apple Podcast reviews of “The Joe Rogan Experience” to skyrocket his Spotify launch!
Introduction and Motivation
First of all, I’m a big fan of JRE — it’s probably the only non-professional podcast I am regularly listening to.
Second, the deal he made with Spotify to drop Apple Podcasts and YouTube (worth $100 mil!) was a massive statement for Spotify on their intentions in regards to podcasts.
…And also a massive change for his fanbase. I personally exclusively “watch” his podcast on YouTube, so this will be a great disturbance for my experience.
It’s interesting to note that just a couple of days ago (I had already started writing this at the time) Spotify announced it is launching video podcasts and they are starting with some creators that previously didn’t include video in their podcast. Maybe they did it because of JRE’s fanbase?
As a big fan of JRE I am sure that its is absolutely impossible for Joe to go through the millions of the comments, reviews, tweets that his fanbase of millions of people constantly write about the podcast.
Having worked as a Product Manager on iOS and Android apps I know that Apple and Google don’t provide any NLP analysis of the reviews you get so you need to manually read them every day to get a good understanding of the opinions, feelings, comments and questions that your users are posting.
Thus, I am motivated to change this and I want to help Joe with understanding what his fans are posting on Apple Podcasts!
1. Tools and Process
a. Scraping the data: | B. Data Cleaning & Preprocessing:
This is how the reviews look like on Apple Podcasts
To understand what & how I did for scraping, cleaning and preprocessing of the data please follow my old post: “ASOS: USING MACHINE LEARNING TO GET ACTIONABLE PRODUCT INSIGHTS FROM APP STORE REVIEWS”.
I’m using the same already built scraper, but I added the ability to scrape Apple Podcasts. I also messaged the creator of the scraper and provided him with the code that will make the scraper work on Apple Podcasts and he replied that he is going to include it — Yay!
In short:
- Removing empty reviews (because on Apple Podcasts you can just assign a star rating and not write a text comment)
- Removed unwanted characters (@ , #, $, %, &, .)
- Removed stop words (“a,” “and,” “but,”, “or,” etc)
- Made the entire text lower case
- Lemmatization (walking -> walk, walked -> walk etc.)
- Removed reviews shorter than 30 characters (not much insights on reviews that are as long as “great”)
C. Data Analysis
As we can see Joe Rogan has overwhelmingly positive reviews on Apple Podcasts — not a surprise having in mind his huge fan base!
Below we can see the top 20 most commonly used words in his reviews:
Here is how they look as a “Bag of words”:
I got curious why rape is mentioned so much and apparently this article explains it:
2. Using Natural Language Toolkit (NLKT) and SentimentIntensityAnalyzer()
Moving forward I wanted to focus only on the positive reviews so that we can see what the audience likes so I am using the already trained SentimentIntensityAnalyzer() which analyses text and returns values from -1 to +1 to show how negative/positive a review is.
Below are the 20 most commonly used words in positive reviews presented as bar chart and “bag of words”
Using the bag of words we can see that what motivates people to drop a 5* review on Apple podcasts is:
- Comedy, interesting guests, talking about philosophy, having diverse guests and opinions
3. Topic Modelling
It’s impossible for any stakeholder to read daily the thousands of reviews a podcast, app or a product gets to get a general understanding of what his customers are thinking about it. A good way to solve this problem is through Topic modeling.
Topic modeling is a machine learning technique that automatically analyzes text data to determine cluster words for a set of documents.
I am going to use Latent Dirichlet Allocation fo the following reasons:
- Easy to implement in Python
- Probabilistic model with interpretable topics and can work on any dataset
- It works well on small amounts of data (such as the short reviews on Apple Podcasts)
Disadvantages:
- There is no objective way to to choose hyper-parameters
- It’s not the fastest to run topic modelling approach
Base LDA
Running the basic LDA provides us with the following 5 topics, top 10 words per topic (to be explained below):
[(0,
'0.058*"bad" + 0.040*"conspiracy" + 0.029*"theory" + 0.026*"star" + '
'0.022*"stupid" + 0.020*"annoying" + 0.020*"racist" + 0.018*"drug" + '
'0.015*"terrible" + 0.012*"idiot"'), (1,
'0.020*"open" + 0.016*"country" + 0.014*"death" + 0.014*"sad" + 0.013*"jre" '
'+ 0.012*"minded" + 0.012*"medium" + 0.012*"dangerous" + 0.012*"self" + 0.011*"complete"'), (2,
'0.048*"rape" + 0.026*"dumb" + 0.022*"well" + 0.017*"human" + 0.016*"fear" + '
'0.014*"ignorant" + 0.013*"next" + 0.013*"crap" + 0.012*"factor" + '0.011*"white"'), (3,
'0.036*"man" + 0.017*"last" + 0.016*"big" + 0.014*"platform" + 0.013*"fan" + '
'0.013*"nonsense" + 0.012*"old" + 0.012*"woman" + 0.012*"hard" + '
'0.012*"poor"'), (4,
'0.031*"people" + 0.027*"guest" + 0.024*"time" + 0.018*"show" + '
'0.017*"episode" + 0.016*"good" + 0.015*"thing" + 0.013*"guy" + 0.013*"day"+ 0.012*"life"')]
And gives us the following perplexity & coherence (to be explained below):
Perplexity: -7.430625960051932
Coherence Score: 0.36507586388016394
Explaining the results:
The *numbers* before the words show the weight each word has on each topic. So in reality every topic contains all of the words in all reviews, however, they have different weights.
So for example Topic 1 has 0.58 weight for the word “bad”, “bad” is also present in Topic 5, but with a weight of 0.0001.
The number of words per topic visualised is just an arbitrary number of all the words that the LDA has ranked that you want to see. In my case I want them to be enough to get a broad picture of the topic, but not too much so that it becomes useless. It’s important to always look at the weights of the words.
For example in Topic 1 -> ‘0.058*”bad” and 0.012*”idiot”. Bad is more than 4 times more important for the topic, compared to idiot.
What is Perplexity?
“It captures how surprised a model is of new data it has not seen before, and is measured as the normalized log-likelihood of a held-out test set.” *1
However, optimising for Perplexity typically doesn’t lead in humanly readable results.
What is Coherence?
From the book Intelligent Data Engineering and Automated Learning:
“Topic coherence measures score a single topic by measuring the degree of semantic similarity between high scoring words in the topic.”
In simple words, higher coherence means better results.
So in our case we want to optimise for Coherence.
4. Running 540 different versions of the LDA to get the best Coherence score
In order to overcome the disadvantage of no objective way to choose hyper-parameters I am going to run 540 different version of the LDA with all the possible variations if the LDA parameters
- Number of Topics (K): 2 to 10
- Dirichlet hyperparameter alpha: (Document-Topic Density): 0.01, 1, 0.3
- Dirichlet hyperparameter beta: (Word-Topic Density): 0.01, 1, 0.3
And then I am going to evaluate all models based on coherence and pick the best model.
Using a script to do everything for me and save it in an .csv file — It took my 2017 Macbook Pro about 5 hours to run all 540 variations of the LDA.
Filtering and sorting the data to see the best performing LDA and its parameters for all Topic #numbers (2 to 10):
Highest coherence is found with 2 topics, followed by 7 and then 4 and 8. The coherence scores vary between 0.54 and 0.57, which is a great improvement over the baseline with 0.36.
Since the difference between each best model for any number of topics is very minimal I ran all of them and compared the inter-topic distance (how distinguishable are the topics):
After running all of the models and calculating the intertopic distance we can see that even though the LDA is good at picking up nuances in the topics it makes much more sense to pick 3 very distinguishable topics, instead of 3+, out of which the topics will be for the most part overlapping.
Based on this I am running a final model consisting of 3 topics with the following parameters:
- Number of Topics: 3
- Dirichlet hyperparameter alpha: ‘asymmetric’
- Dirichlet hyperparameter beta: 0.1
Topic 1:
'0.032*"guest" + 0.025*"good" + 0.020*"great" + 0.019*"time" + 0.019*"show" + 0.016*"episode" + 0.015*"interesting" + 0.015*"life" + 0.014*"day" + '0.013*"topic"'),Topic 2:
'0.025*"people" + 0.012*"opinion" + 0.009*"open" + 0.009*"view" +
'0.009*"right" + 0.008*"bad" + 0.007*"point" + 0.007*"question" +
'0.006*"side" + 0.006*"minded"')Topic 3:
'0.010*"woman" + 0.009*"eye" + 0.005*"boy" + 0.005*"fear" + 0.005*"money" + '0.004*"deal" + 0.004*"hater" + 0.004*"industry" + 0.004*"air" + '0.004*"mouth"')]
Now, because topics in terms of keywords are not perfectly humanly readable, I decided to make it easier to comprehend and “ran the model” backwards by assigning topics to each review. After assigning topics I sort all reviews and check the most representable review per topic.
Topic 1: “I like it, but…”
'0.032*"guest" + 0.025*"good" + 0.020*"great" + 0.019*"time" + '0.019*"show" + 0.016*"episode" + 0.015*"interesting" + 0.015*"life" + 0.014*"day" + '0.013*"topic"'),
Top 3 reviews that represent that topic
- Review #1 → 98.8%
This was a great podcast that helps a person who wants information about how to improve their minds and bodies. However Joe Rogan and his political guests are usually Islamaphobes and Rogan himself displays little to no empathy for muslims or POC, and doesn’t seem to want guests who will defend Islam or give perspectives of such.
2. Review #2 → 98.4%
Never have peter shiff again that guy is so annoying and does not know how to have a conversation. He was just spitting out rants and not left joe talk. Did not even fully consider joes ideas or completely address them.
3. Review #3 → 98.3%
I listen to JRE more than I listen to music now and what I most love about it the range of topics he covers and he manages to still have a sense to make all topics fun.
Topic 2: “Biggest fans”
'0.025*"people" + 0.012*"opinion" + 0.009*"open" + 0.009*"view" +
'0.009*"right" + 0.008*"bad" + 0.007*"point" + 0.007*"question" +
'0.006*"side" + 0.006*"minded"')
Top 3 reviews that represent that topic
- Review #1 → 97.6%
Easily the most entertaining, motivating and god damn interesting podcast EVER!!! Thank you Mr Rogan, you are an inspiration!!!!!
2. Review #2 → 97.3%
Keeping it real and have intelligent conversations! Gotta get Joey Diaz on more often, you and him together are HILARIOUS!
3. Review #3 → 96.9%
Love the show. Hate the repetition! Your confidence in belittling the virus is worrying. Isn’t alot in the real word for normalcy normality?
Topic 3: “Everything else?”
'0.010*"woman" + 0.009*"eye" + 0.005*"boy" + 0.005*"fear" + 0.005*"money" + '0.004*"deal" + 0.004*"hater" + 0.004*"industry" + 0.004*"air" + '0.004*"mouth"')]
Top 3 reviews that represent that topic
- Review #1 → 90.1%
Covers everything, from ancient aliens to Bert getting attacked by a bear. By far my favorite show. I could handle more MMA and BJJ talk, Eddie is great. More Bert, Stanhope, Ari, Joey and the Deathsquad.
2. Review #2 → 90.01%
One of the first podcasts I ever listened to. I can confidently say it changed the way I viewed and understood the world and help made me a more outgoing, compassionate and understanding person…it’s also funny as hell
3. Review #3 → 90.01%
Joe has the best guests…John Heffron, Joey Diaz, Ari Shafer, Eddie Bravo. Plus he doesn’t care what anyone else thinks. He’s a genuine person, and respects anyone and everyone as long as they’re open minded. Great podcast, always makes me laugh, and sometimes really makes me later evaluate decisions I’ve made. Or rather why I chose to make those decisions.
Insights
- Providing useful information at a glance
First of all, reading through 30,000–40,000 reviews is impossible, especially on a regular basis.
Because of that the charts:
- Top 20 most commonly used words from all reviews
- Top 20 most commonly used words from positive/negative reviews
- “Bag of Words” for all/positive/negative reviews
Provide insightful information at a glance. They can easily be run on a monthly/weekly/daily basis & compared period over period. I don’t want this article to become too big and thus I’m not doing this.
2. Understanding what are the main topics (concerns, comments, questions etc.) by your audience.
Imagine you post a picture on Instagram — it becomes viral and 30,000 people comment below your picture. If this is the first time this is happening, you might go through all of the comments, however, now imagine this happens for every picture you post. You are looking at the picture you posted in a hot air balloon while on a vacation in Kapadokia and wonder — are these people commenting on my looks? are they commenting on the balloon…. or maybe they are sharing how they went to Kapadokia and what their experience was?
Through running an LDA we can very easily see the main topics from said 30,000 comments. Identified in an unsupervised way — meaning that the computer figures things by itself without help from us. (So we don’t need to be experts on every picture and its contents, bu he may also come up with topics we are not very interested in — like people commenting the colours of the hot air balloon.)
Moreover, each review is assigned a % of which topic it belongs to.
Example with review #2330
lda_model[corpus[2330]]
Out[]:
[(0, 0.869107), (1, 0.07557727), (2, 0.055315714)]
This review is ~87% Topic 1, ~7.5% Topic 2 and ~5.5% Topic 3.
You can easily sort & read reviews by topics.
In our case, let’s say Joe is interested to read reviews that are around the theme “I like it, BUT” (which seems to be Topic 1, identified in an unsupervised way by the computer).
He can now very easily filter & read reviews on a monthly/weekly/daily basis that are made by fans of the podcast, BUT have some constructive feedback.
Something he wouldn’t be able to do before.
Going further
Areas for improvement
- In hindsight I think that picking 3 topics wasn’t the best decision, as Topic 3 seems way too broad to provide real insights. In best case scenario I would run everything again and look for insights with 4–5–6–7 topics.
- I would be curious to see what is the sentiment of each of the topics. Is any of the topics more positive or negative?
- Also, I haven’t used at all the Title field, which provides very rich data — typically a summary of the problem, so potentially it can provide even better insights than the Review. Another option is to merge Title + Review fields as both fields are supposed to describe the same problem, so we could potentially end up with more data on the same issue. Of course, extra words in the Title could also throw some things off.
- I could also try to use Topic Modelling techniques such as LSA, PLSA & lda2Vec and compare results & potentially get even more insights.
- If I had “industry knowledge” I would probably want to run a supervised model and for specific things in the data. E.g “find reviews which comment on the guests.” (But that’s the power of unsupervised learning, I could have very easily ran this on reviews of nuclear reactors and still get insights!)
End Notes
It took me about 20 hours to write the “ASOS Article” and for this article I reused almost all of the code I had from it and spent additional 20–30 hours to “upgrade”. This may sound a lot, but the really cool thing of such projects is that once you get everything running - continuing to get the same insights in the future is very easy. It’s almost just running a single python file that would provide all of the data above.