How to find duplicate songs in Pandora playlist: Pandas and Dictionary

Samuel Woojoo Jun
Future Vision
Published in
4 min readAug 7, 2019

I am a big fan of Pandora Radio, and I love their automated music recommendations, which is powered by the Music Genome Project. Only thing they are missing is removing the duplicated songs in the playlist, and I wanted to share the method how I do it. I also deployed this function to web app. All you need to do is just copy and paste the playlist URL to this website for checking the duplicates.

Following packages and functions are used in this work:

  • Requests: Playlist data request and receive
  • BeautifulSoup: Good friend of Web scraper, used for parsing html
  • Json: Converting string to dictionary form
  • Pandas: Dictionary to DataFrame
import pandas as pd 
import requests
from bs4 import BeautifulSoup
import json

First, request playlist information by using ‘requests’ and parse with ‘BeautifulSoup’:

'''
This is python command

This is displayed result
'''
url = input("Input Pandora Playlist URL: ")Input Pandora Playlist URL: https://www.pandora.com/playlist/PL:...r = requests.get(url)
soup = BeautifulSoup(r.content, "html")
print(soup)
<!DOCTYPE html>
<html lang="en">
<head>
<script type="application/ld+json">{"@type":"MusicPlaylist","@id":"PL: ... }</script>
<script>
var hasCommand = ....
....
var storeData = {"v4/catalog/annotateObjects":[{"TR:11...":
{"name":"Candle In The Wind (Remastered)","sortableName":"Candle
In The Wind Remastered)","duration":229,"trackNumbe...}]}
...

All the data you need is included in ‘var storeData =’, which is dictionary form. Let’s extract this:

page_str = str(soup)
json_dict = page_str.split('var ')[4].replace(';\n','')
.replace('storeData = ','')
{"v4/catalog/annotateObjects": [{"TR:11...":{"name":"Candle In The Wind (Remastered)", "sortableName":"Candle In The Wind (Remastered)","duration":229, "trackNumber":9,"volumeNumber":1,...}]
...

Convert this information to Dictionary using Json:

type(json_dict)
str
dic=json.loads(json_dict)
type(dic)

dict

This dictionary contains two keys:

  • v7/playlists/getTracks: Contains order of songs, denoted as trackID, in the playlist
  • v4/catalog/annotateObjects: Contains basic information of songs included in the playlist
dic.keys()dict_keys(['v4/catalog/annotateObjects', 'v7/playlists/getTracks'])

Our plans is to create two DataFrame, for each key in the dictionary, and merging it at the end:

df_tracks=pd.DataFrame(dic['v7/playlists/getTracks'][0]['tracks'])
df_tracks.head()
df_tracks

Each song is displayed as a ‘trackPandoraId’, so we need to pull song information from the other part of dictionary, annotateObjects.

df_info=pd.DataFrame.from_dict(dic['v4/catalog/annotateObjects'][0],
orient='index')
df_info=df_info.reset_index()
df_info.rename(columns={'index':'trackPandoxraId'}, inplace=True)
df_info

Now we have all the information we needed. Let’s merge these DataFrames:

df = df_tracks.merge(df_info, left_on='trackPandoraId', 
right_on='trackPandoraId').sort_values(by=['itemId'])
df: df_track and df_info are joined on ‘trackPandoraID’

Simply, use groupby function to display how many duplicates are in this playlist:

df[['name','artistName','itemId']].groupby(['name','artistName'])
.count().sort_values(by='itemId', ascending=False)
df grouped by name and artist displaying number of duplicates

Use duplicated function to see where the duplicate songs are located in the playlist (‘True’ means duplicate):

df['duplicated'] = df.duplicated(subset='name')
df[['name','duplicated']]
df displaying name and duplicated column only

Entire code of function is shown below:

def dupli_check(url):
'''
Input
URL(Sting): Pandora Playlist URL

Output
df_numbers(DataFrame): Number of duplicate songs in playlist
df_loc(DataFrame): Location of duplicate songs in playlist
'''

#request playlist information by using ‘requests’
#and parse with 'BeautifulSoup'

r = requests.get(url)
soup = BeautifulSoup(r.content,"html")
page_str = str(soup)
json_dict = page_str.split('var ')[4]
.replace(';\n ','').replace('storeData = ','')

#Convert this information to Dictionary using Json
dic=json.loads(json_dict)

#This dictionary contains two keys:
#['v4/catalog/annotateObjects', 'v7/playlists/getTracks']
#Our plans is to create two DataFrame, for each key
#in the dictionary, and merging it at the end:

df_tracks = pd.DataFrame(dic['v7/playlists/getTracks'][0]
['tracks'])

#Each song is displayed as a 'trackPandoraId',
#so we need to pull song information from the other part of
#dictionary, annotateObjects.

df_info = pd.DataFrame.from_dict(
dic['v4/catalog/annotateObjects'][0], orient='index')
df_info = df_info.reset_index()
df_info.rename(columns={'index':'trackPandoraId'}, inplace=True)

#Now we have all the information we needed.
#Let’s merge these DataFrames:

df=df_tracks.merge(df_info, left_on='trackPandoraId',
right_on='trackPandoraId').sort_values(by=['itemId'])

#Simply, use groupby function to display
#how many duplicates are in this playlist:

df_numbers = df[['name', 'artistName', 'itemId']]
.groupby(['name', 'artistName']).count()
.sort_values(by='itemId', ascending=False)

#Use duplicated function to see where the duplicate
#songs are located in the playlist:

df['duplicated'] = df.duplicated(subset='name')
df_loc = df[['name','duplicated']]

return df_numbers, df_loc

All the functions mentioned above is deployed to the web app. All you need to do is just copy and paste the playlist URL to this website for checking the duplicates. Quick demo regarding how to use the web app is shown below. Enjoy!

Linkedin, Blog, Originally published at https://github.com.

--

--

Samuel Woojoo Jun
Future Vision

Chemical and materials engineer turned data scientist