Data mining Apple Podcast reviews of “The Joe Rogan Experience” to skyrocket his Spotify launch!

Introduction and Motivation

https://newsroom.spotify.com/2020-07-21/spotify-fans-can-better-connect-to-creators-with-new-video-podcasts/

1. Tools and Process

  • Removing empty reviews (because on Apple Podcasts you can just assign a star rating and not write a text comment)
  • Removed unwanted characters (@ , #, $, %, &, .)
  • Removed stop words (“a,” “and,” “but,”, “or,” etc)
  • Made the entire text lower case
  • Lemmatization (walking -> walk, walked -> walk etc.)
  • Removed reviews shorter than 30 characters (not much insights on reviews that are as long as “great”)
Distribution of Review Ratings
Most used words in the reviews

2. Using Natural Language Toolkit (NLKT) and SentimentIntensityAnalyzer()

Most commonly used words from positive reviews
Bag of Words — Positive reviews
  • Comedy, interesting guests, talking about philosophy, having diverse guests and opinions

3. Topic Modelling

  • Easy to implement in Python
  • Probabilistic model with interpretable topics and can work on any dataset
  • It works well on small amounts of data (such as the short reviews on Apple Podcasts)
  • There is no objective way to to choose hyper-parameters
  • It’s not the fastest to run topic modelling approach
[(0,
'0.058*"bad" + 0.040*"conspiracy" + 0.029*"theory" + 0.026*"star" + '
'0.022*"stupid" + 0.020*"annoying" + 0.020*"racist" + 0.018*"drug" + '
'0.015*"terrible" + 0.012*"idiot"'),
(1,
'0.020*"open" + 0.016*"country" + 0.014*"death" + 0.014*"sad" + 0.013*"jre" '
'+ 0.012*"minded" + 0.012*"medium" + 0.012*"dangerous" + 0.012*"self" + 0.011*"complete"'),
(2,
'0.048*"rape" + 0.026*"dumb" + 0.022*"well" + 0.017*"human" + 0.016*"fear" + '
'0.014*"ignorant" + 0.013*"next" + 0.013*"crap" + 0.012*"factor" + '0.011*"white"'),
(3,
'0.036*"man" + 0.017*"last" + 0.016*"big" + 0.014*"platform" + 0.013*"fan" + '
'0.013*"nonsense" + 0.012*"old" + 0.012*"woman" + 0.012*"hard" + '
'0.012*"poor"'),
(4,
'0.031*"people" + 0.027*"guest" + 0.024*"time" + 0.018*"show" + '
'0.017*"episode" + 0.016*"good" + 0.015*"thing" + 0.013*"guy" + 0.013*"day"+ 0.012*"life"')]
Perplexity:  -7.430625960051932

Coherence Score: 0.36507586388016394

4. Running 540 different versions of the LDA to get the best Coherence score

  1. Number of Topics (K): 2 to 10
  2. Dirichlet hyperparameter alpha: (Document-Topic Density): 0.01, 1, 0.3
  3. Dirichlet hyperparameter beta: (Word-Topic Density): 0.01, 1, 0.3
Coherence per number of topics
  1. Number of Topics: 3
  2. Dirichlet hyperparameter alpha: ‘asymmetric’
  3. Dirichlet hyperparameter beta: 0.1
Topic 1:
'0.032*"guest" + 0.025*"good" + 0.020*"great" + 0.019*"time" + 0.019*"show" + 0.016*"episode" + 0.015*"interesting" + 0.015*"life" + 0.014*"day" + '0.013*"topic"'),
Topic 2:
'0.025*"people" + 0.012*"opinion" + 0.009*"open" + 0.009*"view" +
'0.009*"right" + 0.008*"bad" + 0.007*"point" + 0.007*"question" +
'0.006*"side" + 0.006*"minded"')
Topic 3:
'0.010*"woman" + 0.009*"eye" + 0.005*"boy" + 0.005*"fear" + 0.005*"money" + '0.004*"deal" + 0.004*"hater" + 0.004*"industry" + 0.004*"air" + '0.004*"mouth"')]

Topic 1: “I like it, but…”

'0.032*"guest" + 0.025*"good" + 0.020*"great" + 0.019*"time" + '0.019*"show" + 0.016*"episode" + 0.015*"interesting" + 0.015*"life" + 0.014*"day" + '0.013*"topic"'),
  1. Review #1 → 98.8%

Topic 2: “Biggest fans”

'0.025*"people" + 0.012*"opinion" + 0.009*"open" + 0.009*"view" + 
'0.009*"right" + 0.008*"bad" + 0.007*"point" + 0.007*"question" +
'0.006*"side" + 0.006*"minded"')
  1. Review #1 → 97.6%

Topic 3: “Everything else?”

'0.010*"woman" + 0.009*"eye" + 0.005*"boy" + 0.005*"fear" + 0.005*"money" + '0.004*"deal" + 0.004*"hater" + 0.004*"industry" + 0.004*"air" + '0.004*"mouth"')]
  1. Review #1 → 90.1%

Insights

  1. Providing useful information at a glance
  • Top 20 most commonly used words from all reviews
  • Top 20 most commonly used words from positive/negative reviews
  • “Bag of Words” for all/positive/negative reviews
lda_model[corpus[2330]]
[(0, 0.869107), (1, 0.07557727), (2, 0.055315714)]

Going further

  1. In hindsight I think that picking 3 topics wasn’t the best decision, as Topic 3 seems way too broad to provide real insights. In best case scenario I would run everything again and look for insights with 4–5–6–7 topics.
  2. I would be curious to see what is the sentiment of each of the topics. Is any of the topics more positive or negative?
  3. Also, I haven’t used at all the Title field, which provides very rich data — typically a summary of the problem, so potentially it can provide even better insights than the Review. Another option is to merge Title + Review fields as both fields are supposed to describe the same problem, so we could potentially end up with more data on the same issue. Of course, extra words in the Title could also throw some things off.
  4. I could also try to use Topic Modelling techniques such as LSA, PLSA & lda2Vec and compare results & potentially get even more insights.
  5. If I had “industry knowledge” I would probably want to run a supervised model and for specific things in the data. E.g “find reviews which comment on the guests.” (But that’s the power of unsupervised learning, I could have very easily ran this on reviews of nuclear reactors and still get insights!)

End Notes

_____

If you like this post, please click and hold down the 👏 button for 10 seconds to show your support!

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Stoimen Iliev

Pro Bono Machine Learning Consultant | Senior Product Manager & Certified Scrum Product Owner | MBA at Cornell Johnson | Fulbright Scholar | Software Engineer