How to find duplicate songs in Pandora playlist: Pandas and Dictionary

Published in

Future Vision

4 min readAug 7, 2019

I am a big fan of Pandora Radio, and I love their automated music recommendations, which is powered by the Music Genome Project. Only thing they are missing is removing the duplicated songs in the playlist, and I wanted to share the method how I do it. I also deployed this function to web app. All you need to do is just copy and paste the playlist URL to this website for checking the duplicates.

Following packages and functions are used in this work:

Requests: Playlist data request and receive
BeautifulSoup: Good friend of Web scraper, used for parsing html
Json: Converting string to dictionary form
Pandas: Dictionary to DataFrame

import pandas as pd 
import requests 
from bs4 import BeautifulSoup 
import json

First, request playlist information by using ‘requests’ and parse with ‘BeautifulSoup’:

'''
This is python command
This is displayed result
'''url = input("Input Pandora Playlist URL: ")Input Pandora Playlist URL: https://www.pandora.com/playlist/PL:...r = requests.get(url)
soup = BeautifulSoup(r.content, "html")
print(soup)<!DOCTYPE html> 
<html lang="en"> 
<head> 
<script type="application/ld+json">{"@type":"MusicPlaylist","@id":"PL: ... }</script> 
<script> 
    var hasCommand = .... 
    .... 
    var storeData = {"v4/catalog/annotateObjects":[{"TR:11...":
   {"name":"Candle In The Wind (Remastered)","sortableName":"Candle
    In The Wind Remastered)","duration":229,"trackNumbe...}]} 
    ...

All the data you need is included in ‘var storeData =’, which is dictionary form. Let’s extract this:

page_str = str(soup)
json_dict = page_str.split('var ')[4].replace(';\n','')
            .replace('storeData = ',''){"v4/catalog/annotateObjects": [{"TR:11...":{"name":"Candle In The Wind (Remastered)", "sortableName":"Candle In The Wind (Remastered)","duration":229, "trackNumber":9,"volumeNumber":1,...}] 
...

Convert this information to Dictionary using Json:

type(json_dict)
strdic=json.loads(json_dict)
type(dic)
dict

This dictionary contains two keys:

v7/playlists/getTracks: Contains order of songs, denoted as trackID, in the playlist
v4/catalog/annotateObjects: Contains basic information of songs included in the playlist

dic.keys()dict_keys(['v4/catalog/annotateObjects', 'v7/playlists/getTracks'])

Our plans is to create two DataFrame, for each key in the dictionary, and merging it at the end:

df_tracks=pd.DataFrame(dic['v7/playlists/getTracks'][0]['tracks'])
df_tracks.head()

Each song is displayed as a ‘trackPandoraId’, so we need to pull song information from the other part of dictionary, annotateObjects.

df_info=pd.DataFrame.from_dict(dic['v4/catalog/annotateObjects'][0],
        orient='index')
df_info=df_info.reset_index()
df_info.rename(columns={'index':'trackPandoxraId'}, inplace=True)

Now we have all the information we needed. Let’s merge these DataFrames:

df = df_tracks.merge(df_info, left_on='trackPandoraId', 
     right_on='trackPandoraId').sort_values(by=['itemId'])

df: df_track and df_info are joined on ‘trackPandoraID’

Simply, use groupby function to display how many duplicates are in this playlist:

df[['name','artistName','itemId']].groupby(['name','artistName'])
   .count().sort_values(by='itemId', ascending=False)

df grouped by name and artist displaying number of duplicates

Use duplicated function to see where the duplicate songs are located in the playlist (‘True’ means duplicate):

df['duplicated'] = df.duplicated(subset='name')
df[['name','duplicated']]

df displaying name and duplicated column only

Entire code of function is shown below:

def dupli_check(url):
    '''
    Input
    URL(Sting): Pandora Playlist URL 

    Output
    df_numbers(DataFrame): Number of duplicate songs in playlist
    df_loc(DataFrame): Location of duplicate songs in playlist
    '''

    #request playlist information by using ‘requests’ 
    #and parse with 'BeautifulSoup'
    r = requests.get(url)
    soup = BeautifulSoup(r.content,"html")
    page_str = str(soup)
    json_dict = page_str.split('var ')[4]
                .replace(';\n    ','').replace('storeData = ','')
    
    #Convert this information to Dictionary using Json
    dic=json.loads(json_dict)
    
    #This dictionary contains two keys:
    #['v4/catalog/annotateObjects', 'v7/playlists/getTracks']
    #Our plans is to create two DataFrame, for each key 
    #in the dictionary, and merging it at the end:
    df_tracks = pd.DataFrame(dic['v7/playlists/getTracks'][0]
                ['tracks'])

    #Each song is displayed as a 'trackPandoraId', 
    #so we need to pull song information from the other part of 
    #dictionary, annotateObjects.
    df_info = pd.DataFrame.from_dict(
              dic['v4/catalog/annotateObjects'][0], orient='index')
    df_info = df_info.reset_index()
    df_info.rename(columns={'index':'trackPandoraId'}, inplace=True)
    
    #Now we have all the information we needed. 
    #Let’s merge these DataFrames:
    df=df_tracks.merge(df_info, left_on='trackPandoraId',
    right_on='trackPandoraId').sort_values(by=['itemId'])

    #Simply, use groupby function to display 
    #how many duplicates are in this playlist:
    df_numbers = df[['name', 'artistName', 'itemId']]
                 .groupby(['name', 'artistName']).count()
                 .sort_values(by='itemId', ascending=False)

    #Use duplicated function to see where the duplicate 
    #songs are located in the playlist:
    df['duplicated'] = df.duplicated(subset='name')
    df_loc = df[['name','duplicated']]
    
    return df_numbers, df_loc

All the functions mentioned above is deployed to the web app. All you need to do is just copy and paste the playlist URL to this website for checking the duplicates. Quick demo regarding how to use the web app is shown below. Enjoy!

Linkedin, Blog, Originally published at https://github.com.

How to find duplicate songs in Pandora playlist: Pandas and Dictionary

Written by Samuel Woojoo Jun