Determining Popularity of Rising Pop Artists with Scraped Spotify Data and NLP Sentiment Analysis

James Pecore
Analytics Vidhya
Published in
7 min readOct 19, 2020
Executive Summary (image produced by author James Pecore)

Problem Statement:

Spotify uses its popularity parameter in order to rank songs, albums, and artists. This “popularity” metric is based on how often users stream songs from Spotify. But this metric only shows how popular very recent artists are in general (not popularity according to genre or popularity by song/lyrical content). As a result, historically VERY popular classic songs are overlooked. Additionally, artists who are VERY popular in their genre become ignored due to higher weight artists from higher popularity genres like “pop.” We need a new metric for popularity. In fact, we need more than one.

The following questions will help us re-evaluate Spotify’s stream-popularity metric in greater context of the data:

1. What can we say about a song’s popularity based on aspects of the music itself: like danceability, energy, and acousticness?

2. What can we say about a song’s popularity based on the content of an artist’s lyrics — the verbal connotations and vibe of the poetry?

3. How do each of these factors influence our ability to predict the popularity of an artist or song?

4. Finally, when using Regression modeling, Classification modeling, and NLP Clustering to predict the popularity of a musical artist, how can evaluate whether or not to trust Spotify’s ranking of popularity?

Executive Summary:

I created two different datasets with the APIs Spotipy and Genius. I also used a Kaggle dataset by Zaheen Hamidani to augment the size of my data.

Next, I build a wide variety of Regression Models for the dataset of around 150,000 songs. These models try to accurately predict a song’s “stream-popularity” based off of the song’s musical attributes (like energy, valence, modality, time signature, and other characteristics). I also use many different Classification Models to measure whether we can predict that a song is popular (above 75% popularity on a scale of 0 to 100) based off of these same song attributes.

For Lyric Attributes, I use the shorter list of playlist songs (just 700 songs) from Spotify as a basis for which lyrics to scrape. I scrape the lyrics for each of these songs off of Genius’ lyric library. I use sentiment analysis and NLP (CountVectorizer) to perform EDA on the most common words/sentiments for each song. Finally, I try to evaluate whether there is a correlation between most common words and song sentiment with its popularity.

Explanatory Data Analysis

Popularity Distribution of 150,000 Spotify Songs, image produced by author James Pecore
Correlation of Song Attributes with Stream-Popularity, image produced by author James Pecore

As data scientists, we should be surprised that one can use “Loudness” to accurately predict a Spotify song’s “stream-popularity” so accurately. Why is this?

Well, “stream-popularity” tends to favor more recently produced music (as current music is streamed more often and thus more “stream-popular” than older music).

Image from Music Tech Student (Itsaam), link provided in Works Cited

Contemporary music (2007 and onward) sounds louder when streamed due to the history of musical compression. Because late 2000s digital music innovations allowed for the music to be less compressed, modern music in its digital form is merely perceived as louder than digitized compressions of earlier years.

My point — loudness doesn’t make your music more popular at a certain point. If it did, “Heavy Metal” would be all of our favorite genres.

Acousticness, however, does seem to impact a song’s popularity. As the infographic below details, more popular songs generally have less elements of acoustic music and more elements of digital music. Given recent trends in pop music towards becoming more digitally produced in DAWs like Logic, Pro Tools, FL Studio, and Ableton, this data makes sense.

Correlation of Song Attribute (Acousticness) with Stream-Popularity, image produced by author James Pecore

Regression Modeling:

Lyrical Analysis:

Sentiment analysis is the process of creating binaries of words in order to determine whether a body of text is closer to one pole or another. For instance, I create a binary of “Love”-related words versus “Heartbreak”-related words. Then, I vectorize each word in each song’s lyrics using CountVectorizer. This converts the words into numerical vectors that can then be clustered based on similarity of words.

Finally, I create a metric that normalizes Sentiment Analysis for a song’s lyrics as either closer to +1 for “Love” songs OR as closer to -1 for “Heartbreak” songs. I can then use this lyrical metric (among other Sentiment Analysis binaries) as a feature for modeling.

Classification Modeling:

Clustering Analysis:

Recommendations

General Recommendation to Song Writers:

  • Increase Energy and Danceability to be around average values (60%)
  • Decrease Acousticness and use digital instruments / music production
  • Only increase Loudness to make it easy to listen to on a mobile phone
  • If you mention “love” more in your song, it can’t hurt

Recommendation 1: All-Time Stream Popularity

  • Create a new popularity metric based on:
  • “Total Number of Streams of All Time”
  • This will let us grade older songs comparably with newer songs
  • We could compare historical trends in music with current trends without improper scaling worries from Stream Popularity

Recommendation 2: Personal Popularities

  • Bring back a 5-Star or “One-to-Ten” review system for each user’s songs
  • This will let us assess what styles each individual user prefers
  • This will allow us to create a Regression Model and Recommender System for the user for their highest rated songs, improving user turnout

Recommendation 3: Song Features Review

  • Create an optional Features Review section for each song in Spotify
  • Vectorize the words used in Features Review
  • Create Sentiment Analyses with these Vectors
  • Create a recommender system with these Vectorized Sentiments

Recommendation 4: Individual Research

  • Artists with educational backgrounds in Music like Charlie Puth, Lizzo, and Lady Gaga have degrees in music from established music universities like Berklee, MSM, NYU, and University of Houston
  • Research should be done individually at a certain point on who to promote after you’ve narrowed down artists to your “Top Five”

Further Research and Future Projects

  1. Using Parallel Programming (AWS) not Serial Programming (Jupyter)
    - Processing all 150,000 song lyrics
    - Extending NLP Performing Sentiment Analysis on all 150,000 song lyrics
    - Performing NLP Clustering with SpaCy on all 150,000 song lyrics
  2. Using Public Opinion on Pop Songs for Sentiment Analysis
    - Scraping News/Twitter/Reddit/Tumblr/etc. Posts for All Songs
    - Using NLP to Determine if Public Opinion Towards Artist is -, 0, or +
  3. Using Song Attributes & Reviews to Create a Recommender System
    - Publish online or submit to Record Labels / Streaming Companies

Works Cited

  • General Assembly Data Science Immersive 2020
  • Cho, Youngmin P. “Quantify Music and Audio.” NYC Data Science Academy, 3 June 2019, nycdatascience.com/blog/student-works/web-scraping/spotify-x-billboard/.
  • Georgieva, Elena, lecturer. “HitPredict: Using Spotify Data to Predict Billboard Hits.” ICML 2020, researcher by Nicholas Burton and Marcella Suta, Stanford University, 18 July 2020.
  • Gingeleski, Ashley. “Spotify Web API: How to Pull and Clean Top Song Data Using Python.” Ashley Gingeleski, 11 Nov. 2019, ashleygingeleski.com/2019/11/11/spotify-web-api-how-to-pull-and-clean-top-song-data-using-python/.
  • Hamidani, Zaheen. Kaggle, 2019, www.kaggle.com/zaheenhamidani/ultimate-spotify-tracks-db.
    - Itsaam. “History and Development of Compression.” Music Tech Student, 11 Aug. 2013, musictechstudent.co.uk/music-technology/history-and-development-of-compression/.
  • Loot, Rare. “Extracting Spotify data on your favourite artist via Python.” Medium, 30 Dec. 2018, medium.com/@RareLoot/extracting-spotify-data-on-your-favourite-artist-via-python-d58bc92a4330.
  • Passy, Jacob. “How Spotify influences what songs become popular (or not).” MarketWatch, 18 June 2018, www.marketwatch.com/story/how-spotify-influences-what-songs-become-popular-or-not-2018-06-18.
  • Pierre, Sadrach. “Analysis of Top 50 Spotify Songs using Python.” Medium, Towards Data Science, 27 Dec. 2019, towardsdatascience.com/analysis-of-top-50-spotify-songs-using-python-5a278dee980c.
  • Sahu, Apratim. “Country-wise visual analysis of music taste using Spotify’s API & Seaborn in Python.” Medium, Towards Data Science, 12 June 2020, towardsdatascience.com/country-wise-visual-analysis-of-music-taste-using-spotify-api-seaborn-in-python-77f5b749b421.
  • SpaCy. spacy.io/. Sept. 2020.
  • Spotify for Developers. Spotify, developer.spotify.com/dashboard/. Sept. 2020.
  • Spotipy: Read the Docs. Spotipy, spotipy.readthedocs.io/en/2.16.0/. Sept. 2020.
  • Tweepy Documentation. Tweepy, docs.tweepy.org/en/latest/. Sept. 2020.

--

--