Visualizing Spotify Top 200 data in Tableau & implementing a fast Python web-scraper

Published in

Analytics Vidhya

13 min readApr 22, 2021

Image by Author: Author’s Spotify Account

Long before the birth of music streaming platforms, like the majority of the Indian population, I too got my song fixes by bootlegging or pirating them through some random website or actually legally watching the songs on Youtube (Napster wasn’t popular where I lived). Apple Music did exist then, but as a teen, who predominantly used Windows, I barely knew how to use Apple Music, let alone know that it was actually a paid service. Now, along with Apple, you have Amazon, Youtube, Pandora, Soundcloud, and multiple others competing in the music streaming industry.

But since its inception, Spotify has become a staple amongst mobile and computer users alike when it comes to streaming music. It is widely popular amongst both Apple and Android users, unlike Amazon Music or Apple Music which some may say carter more towards Apple or Amazon products.

Music is a universal language in itself that connects people across the globe & songs are just melodically described stories by the artist.

As a veteran Spotify user, I thought of how interesting it would be to analyze the data behind the musical stories and visualize this data-driven story.

Automating data collection through a webscraper

First of all, what is web scraping? Web scraping is a technique used to extract data or scrape data from website(s) or part of the website which is then saved in a readable format like a database or JSON file. It serves as a good alternative to an API when a certain website doesn’t have an API or you have limited access to the API. Web scrapers help to make use of the abundant data present on websites which could be getting new data or enhance existing data. Web scraping can be done easily through web scraping software which makes the task easy or it can be done programmatically by writing a piece of code which can be quite complex at times depending on the website structure and data to be scraped. Web scrapers essentially help automate the process of repeatedly collecting similarly structured data from web pages that have the same structure. Good examples of web scraping include: extracting product data from an e-commerce store or extracting stock prices data to make better predictions.

Web scraping is a technique used to extract data or scrape data from website(s) or part of the website which is then saved in a readable format like a database or JSON file.

Is it Legal?

Web scraping is considered illegal when you exploit or abuse the data for commercial purposes or business use or financial gain.

All websites have certain guidelines on what part of their website can and cannot be scrapped. As long as the web scraping is performed on the part of the website to which the owner has allowed or given access to scrape, you are safe.
Additionally, as long as you stay in the public domain and don’t evade prohibited data, scraping is considered legal.
As a human being, you can probably access or click 10 pages or so in a minute. But a software or program has the capability to access 100s or 1000s of pages in a minute. So, it is important to pay attention to your crawl rate that is — you don’t want to make repeated requests very frequently to the website so much so that the server can’t handle them causing congestion or making it think there is an attack being made on it.

How do I access the website's guidelines?

It is good practice to read the robots.txt file offered by each website. This file contains data on what part of the website can and can’t be accessed and at which speed you can access it.
How do you access the robots.txt file? In the URL box of your browser type in the website URL, you want to scrape and append a /robots.txt to the end of the URL. For example- https://google.com/robots.txt

The robots.txt usually returns fields like user-agent, disallow, allow, and crawl-delays.

Image by Author: Example of robots.txt for Buzzfeed

As an example see the above image. Buzzfeed wants the msnbot to wait 120 seconds before crawling each page and it has disallowed the msnbot from crawling any of the URL strings listed.

Here are few examples of robots.txt results and what they mean:

Blocking all web crawlers from all content

User-agent: * Disallow: /

Allowing all web crawlers access to all content

 User-agent: * Disallow:

Blocking a specific web crawler from a specific folder

User-agent: Googlebot Disallow: /example-subfolder/

Now the data that I want to scrape and visualize is the Spotify Charts website.

Luckily for me, the robots.txt for Spotify Charts allows all crawlers to access all content!! So I didn’t do anything illegal!

Image by Author: Results for Spotify Charts robots.txt file — Happy legal

Web scraping process

The data I scraped is the Spotify Top 200 Charts.

I web scraped data for all regions and for all dates of that region for which data was available on the Spotify charts website. The two essential libraries to build my web scraper were:

BeautifulSoup(bs4) library for handling the text extraction from the web page’s source code (HTML and CSS)
requests library for handling the interaction with the web page (Using HTTP requests)

Automating region and dates collection

Spotify offers Daily and Weekly charts and I wanted to make a dashboard for the Daily data.

Additionally, I wanted to scrape data for all regions and all possible dates available for that region. Now how would I do that?

Image by Author: Chart for India-in(left) and the United Arab Emirates-ae(right) on different dates

The Spotify Charts website has many different regions and dates by which you can filter the data.
As you can see from the above two images, the URL changes depending on the region and date you select. The region even has specific codes, India has the code in and the United Arab Emirates has the code ae. So I would need the codes of all regions and the dates to access the data of all regions and dates.

Now I could easily hard-code a list in Python having the names of all regions and dates. But there are some 60 regions here which would mean I would need to go to all 60 pages to see the region code for each country. Also, it is important to note, each region has different dates for which data is available. For example, United States has data available till January 1st, 2017 but on the other hand, India only has data available till February 27th, 2017.
I could create a loop that collects the data from today till let’s say a year back.

But why create lists manually, when you can programmatically fetch all regions, their codes, and dates available for those regions from the HTML code of the website by scraping it?

If you right-click in your browser and press inspect element and then use the browser inspector to click the region filter, you can see all regions available in that drop-down list.

Image by Author: Using browser’s inspect element to see website HTML code

From the above image, you can see the names of all regions available in the drop-down list and even their corresponding codes marked as data-value. You can use Inspect element to see the dates list as well. Now we know we want to extract data and from which HTML tags.

First I gave a base URL for my scrapper, which would help me scrape the website for all region names and their codes and save it to a list, and zip the codes and region names as a dictionary. Save this zipped dictionary to a CSV file it will be useful while aliasing during visualization.

###get all content on home_page 
home_page='https://spotifycharts.com/viral/global/daily'
hsession = requests.Session() 
hresponse = hsession.get(home_page)
soup=BeautifulSoup(hresponse.content.decode('utf-8'),"html.parser")###parse through the home_page to get all the region names and their codes
regions=[item['data-value'] for item in soup.find('div', {'data-type':'country'}).find_all('li', attrs={'data-value' : True})] 
region_fullname=[]
for li in soup.find('div', {'data-type':'country'}).find_all('li', attrs={'data-value' : True}): #run alone no probs
    region_fullname.append((li.get_text()))
region_dictionary=dict(zip(regions, region_fullname))

Next, I wanted to extract all the dates for each region. For this, I would need the primary URL for each region. As you remember, only part of the URL changes according to the region, and the rest remains the same. For this purpose, I created a list that would store all my base URLs for each region.

#create a list of URLS for the landing page of each region collected earlier
URLS=['https://spotifycharts.com/viral/'+region+'/daily/' for region in regions]

Now, we want to visit each URL in the URLs list and get all the dates available for that URL and create a new list that would have the region URLs along with the date appended to form the new URL of each region and date combination available for that region that we want to visit.

####function for getting all dates avalilable for each region
new_URLS=[]
def downloadPage(URL):                                                       #pass the landing page of a region
  s1 = requests.Session() 
  response1 = s1.get(URL)       
  soup_region=BeautifulSoup(response1.content.decode("utf-8"),"html.parser")  #get content of that page
  time.sleep(0.25)
  all_dates=[]                                                               #list for holding all dates on that region's page
  for li in soup_region.find('div', {'data-type':'date'}).find_all('li'):   
    all_dates.append((dt.datetime.strptime(li.get_text(), '%m/%d/%Y').date()))  
  [new_URLS.append(URL+str(date)+'/download') for date in all_dates]

Actual data to be collected

Image by Author: Scraping the table data

Now that we have a list of all regions and all their corresponding dates, we can move forward to actual web scraping to extract the Top 200 table displayed on the web browser. The above image shows the data we actually want to scrape. We want to scrape the highlight table.

Image by Author: Scraping the table data-tags in the table

The above image shows all data available in the table. It contains the data for the song image, song position, song trend icon, track which song belongs to and the number of streams the song has received.

Now, here’s a neat little cheat trick. Instead of scraping each cell in the table, we can just directly get the table by extracting the data from the “Download to CSV” button. The CSV file offers the entire table data shown in the browser in CSV format; barring the images and trend icon shown on the browser.

Building the web scraper

Now we can parse each of the HTML pages (each unique region and date combination) for the CSV data and store it in a temporary data frame. We even append the URL to the data frame because our CSV file doesn’t offer the region and date data and it is imperative while visualization to know which date and region the data belongs to. This is the actual web scraping code that collects the data for analysis.

#function to get actual top 200 data from new_URL having date and region
def downloadData(new_URL):
    s2 = requests.Session() 
    response2 = s2.get(new_URL)   
    soup_data=BeautifulSoup(response2.content,"html.parser")     #get page content for that particular date and region
temp_df = pd.read_csv(io.StringIO(soup_data.decode("utf-8"))) #save csv data to a temporary data frame
    temp_df['region_date'] = new_URL      
 
#add a new column for date(so we can extact later) and save the URL of that page
    return temp_df

Faster Web Scraping through Multithreading

But each region had about 2 to 3 years' worth of data, meaning each region would have about 1000 dates to choose from. This means we need to go to 70 regions X 1000 dates = 70,000 pages!

Parsing 70,000 pages is no joke because you need that much compute capacity on your computer. As this isn’t a commercial project, I didn’t possess a huge compute capacity. And individually parsing each link, sequentially, one after another, was time-consuming! The web parser would take about anywhere from 2–3 hours or more to go through 100 or more web pages. This means 70,000 pages would take a day or more.

I needed a faster web scraper. Enter, multi-threading.

Imagine multiple tabs open on your browser and you are simultaneously downloading data on all of them. This is multi-threading.

You have multiple threads which are parallelly performing the same task instead of performing the same task sequentially which means waiting longer. Since we are performing the same task but only our URL changes, multi-threading comes in handy. I have used the concurrent.futures thread library to perform multi-threading.

#use concurrent.futures threading to simultaneously obtain data for all days in a region
executor=concurrent.futures.ThreadPoolExecutor(max_workers=1510)          #max webpages being downlaoded simultaneously is 1510
futures = [executor.submit(downloadData, new_URL) for new_URL in new_URLS] #save the data in a list
wait(futures, timeout=60, return_when=ALL_COMPLETED)

Essentially, the above block of code is passing the function (a task we want to perform again & again) and the list of URLs we want to perform the task on. The executor runs the threads in parallel. The wait function has a timeout of 60 seconds while running multiple threads. This function keeps running your threads in the background and only stops once it has retrieved data for all threads/web pages. Threads that are not run remain in pending mode.

Multi-threading significantly reduced the scraping time from 2–3 hours to about 15 to 20 minutes for 1000 web pages!

Next, just combine all the data from the temporary data frame to a final data frame and you have the Top 200 data for all regions and dates and save it in a CSV or Excel file, ready to be used for visualization.

Visualizing Data Using Tableau

Image by Author: Spotify Daily Streaming Dashboard for Top 200 Data (Check out the fully functional dashboard here)

Visualization through Tableau is super-intuitive.

Data Cleaning

Bring in the stored CSV/Excel file into Tableau. Next break up the new_URL field we appended by Custom Split to get the region and date fields. And delete all unnecessary fields created due to the split and Hide the URL field.

I decided to make 4 charts:

Streams over time
Most popular tracks
Most Popular artists
Streams by region map

Here’s a quick breakdown of how to create them.

Streams over time

In a new sheet, drag Continuous WEEK(Date) to Columns and SUM(Streams) to Rows and filter out the Global data. Format the sheet as desired.

Popular tracks- Embedding Spotify URL

In a new sheet, drag SUM(Streams) to Columns and Track URL to Rows. Sort Add Track Name on Rows as well. Now you can see both Track Name and Track URL next to each other on the chart but we don’t need the Track URL so hide it.

Spotify track links are of the format: https://open.spotify.com/track/0A1hoCfMLkiAgvhWkkucJa

We only need the characters after track/ which is 0A1hoCfMLkiAgvhWkkucJa

To embed a Spotify URL this is the format: https://embed.spotify.com/?uri=spotify%3Atrack%3APASTEHERE. So we just need to paste the part after /track to this dummy embedding URL. If you needed to embed a playlist all you needed to do was change the %3Atrack%3A to %3Aplaylist%3A

Now create a calculated field called Embeded URL. We will be using this field to embed a player in our final dashboard which displays the song which is selected in this chart. Add this formula to the Embeded URL : “https://embed.spotify.com/?uri=spotify%3Atrack%3A"+SPLIT([Track URL],’/’,5)

Since we just need the part after track/ we use the split function to split the Track URL using the “/ ” as the delimiter. Format the sheet as desired.

Popular artists

In a new sheet, drag Track URL to Columns and change the Measure to Count (Distinct), and then drag the Artist field to Rows. Format the sheet as desired.

Streams by region map- Bulk re-aliasing

First of all, if you directly add country codes to the sheet. You will see that it gives 18 unknowns. You can manually edit these unknowns and correct them, but we can even correct them automatically.

Image by Author: Country codes in chart- gives 18 unknowns

Open the CSV file with the zipped dictionary of region codes and region names we created earlier during web scraping in Tableau.

Now go to Data>>Edit Blend Relationships and hit custom and select country codes from both data sources to blend on.

Now from the primary data set drag the Country Codes to the Rows and from the Zipped Country Codes dataset grab the Country names to the Rows. Then right-click on the Country names field and click Edit Primary aliases and click OK. Now all the country codes will be mapped to the actual country names like on the website.

Image by Author: Creating primary aliases in bulk

Now in a new sheet drag, the Country Codes to the blank space, and you should only see 1 unknown- Global. Filter it out by adding Country Codes to the Filters card. Add SUM(Streams) to the colors to create a filled map. Format the sheet as desired.

Finally, create a dashboard and all the sheets to it and add action filters for each chart. Add a web-page object to the dashboard and add a URL action to it and select source sheet as the Popular Track Sheets and action as Select and the URL as the calculated field: Embeded URL that we created earlier.