How to find duplicate songs in Pandora playlist: Pandas and Dictionary
I am a big fan of Pandora Radio, and I love their automated music recommendations, which is powered by the Music Genome Project. Only thing they are missing is removing the duplicated songs in the playlist, and I wanted to share the method how I do it. I also deployed this function to web app. All you need to do is just copy and paste the playlist URL to this website for checking the duplicates.
Following packages and functions are used in this work:
- Requests: Playlist data request and receive
- BeautifulSoup: Good friend of Web scraper, used for parsing html
- Json: Converting string to dictionary form
- Pandas: Dictionary to DataFrame
import pandas as pd
import requests
from bs4 import BeautifulSoup
import json
First, request playlist information by using ‘requests’ and parse with ‘BeautifulSoup’:
'''
This is python command
This is displayed result
'''url = input("Input Pandora Playlist URL: ")Input Pandora Playlist URL: https://www.pandora.com/playlist/PL:...r = requests.get(url)
soup = BeautifulSoup(r.content, "html")
print(soup)<!DOCTYPE html>
<html lang="en">
<head>
<script type="application/ld+json">{"@type":"MusicPlaylist","@id":"PL: ... }</script>
<script>
var hasCommand = ....
....
var storeData = {"v4/catalog/annotateObjects":[{"TR:11...":
{"name":"Candle In The Wind (Remastered)","sortableName":"Candle
In The Wind Remastered)","duration":229,"trackNumbe...}]}
...
All the data you need is included in ‘var storeData =’, which is dictionary form. Let’s extract this:
page_str = str(soup)
json_dict = page_str.split('var ')[4].replace(';\n','')
.replace('storeData = ',''){"v4/catalog/annotateObjects": [{"TR:11...":{"name":"Candle In The Wind (Remastered)", "sortableName":"Candle In The Wind (Remastered)","duration":229, "trackNumber":9,"volumeNumber":1,...}]
...
Convert this information to Dictionary using Json:
type(json_dict)
strdic=json.loads(json_dict)
type(dic)
dict
This dictionary contains two keys:
- v7/playlists/getTracks: Contains order of songs, denoted as trackID, in the playlist
- v4/catalog/annotateObjects: Contains basic information of songs included in the playlist
dic.keys()dict_keys(['v4/catalog/annotateObjects', 'v7/playlists/getTracks'])
Our plans is to create two DataFrame, for each key in the dictionary, and merging it at the end:
df_tracks=pd.DataFrame(dic['v7/playlists/getTracks'][0]['tracks'])
df_tracks.head()
Each song is displayed as a ‘trackPandoraId’, so we need to pull song information from the other part of dictionary, annotateObjects.
df_info=pd.DataFrame.from_dict(dic['v4/catalog/annotateObjects'][0],
orient='index')
df_info=df_info.reset_index()
df_info.rename(columns={'index':'trackPandoxraId'}, inplace=True)
Now we have all the information we needed. Let’s merge these DataFrames:
df = df_tracks.merge(df_info, left_on='trackPandoraId',
right_on='trackPandoraId').sort_values(by=['itemId'])
Simply, use groupby function to display how many duplicates are in this playlist:
df[['name','artistName','itemId']].groupby(['name','artistName'])
.count().sort_values(by='itemId', ascending=False)
Use duplicated function to see where the duplicate songs are located in the playlist (‘True’ means duplicate):
df['duplicated'] = df.duplicated(subset='name')
df[['name','duplicated']]
Entire code of function is shown below:
def dupli_check(url):
'''
Input
URL(Sting): Pandora Playlist URL
Output
df_numbers(DataFrame): Number of duplicate songs in playlist
df_loc(DataFrame): Location of duplicate songs in playlist
'''
#request playlist information by using ‘requests’
#and parse with 'BeautifulSoup'
r = requests.get(url)
soup = BeautifulSoup(r.content,"html")
page_str = str(soup)
json_dict = page_str.split('var ')[4]
.replace(';\n ','').replace('storeData = ','')
#Convert this information to Dictionary using Json
dic=json.loads(json_dict)
#This dictionary contains two keys:
#['v4/catalog/annotateObjects', 'v7/playlists/getTracks']
#Our plans is to create two DataFrame, for each key
#in the dictionary, and merging it at the end:
df_tracks = pd.DataFrame(dic['v7/playlists/getTracks'][0]
['tracks'])
#Each song is displayed as a 'trackPandoraId',
#so we need to pull song information from the other part of
#dictionary, annotateObjects.
df_info = pd.DataFrame.from_dict(
dic['v4/catalog/annotateObjects'][0], orient='index')
df_info = df_info.reset_index()
df_info.rename(columns={'index':'trackPandoraId'}, inplace=True)
#Now we have all the information we needed.
#Let’s merge these DataFrames:
df=df_tracks.merge(df_info, left_on='trackPandoraId',
right_on='trackPandoraId').sort_values(by=['itemId'])
#Simply, use groupby function to display
#how many duplicates are in this playlist:
df_numbers = df[['name', 'artistName', 'itemId']]
.groupby(['name', 'artistName']).count()
.sort_values(by='itemId', ascending=False)
#Use duplicated function to see where the duplicate
#songs are located in the playlist:
df['duplicated'] = df.duplicated(subset='name')
df_loc = df[['name','duplicated']]
return df_numbers, df_loc
All the functions mentioned above is deployed to the web app. All you need to do is just copy and paste the playlist URL to this website for checking the duplicates. Quick demo regarding how to use the web app is shown below. Enjoy!
Linkedin, Blog, Originally published at https://github.com.