Analysing UK Chart history — 1956 to 2017

Part 1: Extracting the data & Creating the data set

After several months learning Python on Dataquest.io, I’ve been looking for a real world topic to analyse as there is no better way to learn than to do.

Rather than browsing datasets until something spiked my interest, I decided to choose a topic I was passionate about, and then find a dataset to match.

For me it was always going to be something to do with music, the history of album charts seemed a great place to start. Music charts say a lot about a period in time, both in music and the country as a whole.

Defining the topic:

I had seen several pieces of analysis on the US music charts but nothing that focused on UK. Initial topics I wanted to cover:

  • Has there a change in the most popular genres over the past 40 years from pop to rock to hiphop?
  • What impact did Britpop have on the 90s or Indie in the 00s?
  • Has there been a change to more electronic music since 2000?
  • Do albums of certain genres have better longevity?

Finding the dataset:

While it was easy to find a dataset with US chart history, the UK equivalent does not appear to exist. Necessity being the mother of invention, I decided to take this opportunity to improve my web scraping Beautiful Soup skills and create the dataset myself.

The Plan:

Web scraping and creating my own datasets were one of the key reasons to learn Python —the flexibility gives me great satisfaction and I always get a smile when I complete tasks that would be impossible in Excel.

My simple plan was:

  • Write a script extract a top 100 album chart from officialcharts.com, retrieving weekly position, album name and artist
  • Find a link to the previous week and repeat the extraction on this week. Rinse and repeat.
  • Once I had a list of all albums across all time periods, create a unique list of artists, and then use the Spotify API to retrieve genre information for each artist.
  • Take my dataset, mix it all together, and create some brilliant but pretty graphs.

Extracting Chart Information

After several web scraping misadventures I was pleasantly surprised to find how clean and structured officialcharts.com was:

Each album had a clearly defined position, album name and artist tag which could be easily extracted via Beautiful Soup.

Each page also had a link to the previous weeks charts, which was retrieved based on link text (‘prev’).

This allowed me to create a infinite loop, with each page retrieved until a page wouldn't retrieve (requests not returning a 200 status ok code)

The plan was to create one data-frame of all albums, and then write to CSV, however it there were any interruption (generally caused by missing pages) this would cause the script to crash and need to start over again.

To increase reliability I changed this to take each weeks albums into a list of lists and then write this to CSV after each page. While this slowed the script down slightly, it means if the loop fails for any reason, we can simply pick up where we left off.

Full code to extract each weeks chart’s:

import requests, bs4, pandas as pd
import csv
def getalbums(url):
allalbums = []
print(‘Getting Page %s ‘ %url)
req = requests.get(url)
req.raise_for_status()

#Exit loop if status code is not 200
if req.status_code != 200:
return None

soup = bs4.BeautifulSoup(req.text,”lxml”)

#Retrieve chart dates and tidy the format
sdate = soup.find_all(“p”, class_=”article-date”)
date = sdate[0].text.split(‘-’)[0]
#retrieve album position, artist and album name 
positions = soup.find_all(“span”, class_=”position”)
albums = soup.find_all(“div”, class_=”title”)
artists = soup.find_all(“div”, class_=”artist”)
#create a list of each album, tidying the format
for i in range (0,len(positions)):
album = []
album.append(date.strip(‘\r’).strip(‘\n’).strip(‘ ‘))
album.append(positions[i].text)
album.append(artists[i].text.strip(‘\n’).strip(‘\r’))
album.append(albums[i].text.strip(‘\n’).strip(‘\r’))
#append each album list to the weeks list 
allalbums.append(album)

#find previous weeks information and create link, exit loop if link can't be found
prevlink = soup.find(“a”,text=”prev”)
if prevlink == None:
return None
link = (prevlink[‘href’])
link = ‘http://www.officialcharts.com/’ + link

#write weekly albums to CSV, appending to existing file
with open(“output.csv”,’a’,newline=’’) as resultFile:
wr = csv.writer(resultFile)
wr.writerows(allalbums)
resultFile.close()
#clear out the weekly list and proceed to next weeks file
allalbums = []
getalbums(link)


#Enter start page to start the loop
getalbums(‘http://www.officialcharts.com/charts/albums-chart/19610702/7502/')

After a few hours, I had the source file I needed, a csv with each weeks top album charts going back to the first week in 1956

As always, the data was not perfect, there are weeks that didn’t exist, there are weeks which don’t add up and over time the quantity of album in the weekly charts changed — 100 since 1993, 75 from 1977–1993, and then a decreasing number, starting with only 5 in 1956.

However we still have enough to use for analysis, Looking at the initial extracted CSV:

Getting genre information

Once we had a full list of albums we now need to retrieve genre information from Spotify.

I passed the information back into Pandas to extract the unique artists name, then loops these through the Spotify API to get the genre information.

I used Stotipy to retrieve a the information from spotify (in one big dictionary) and then extracted the first genre record. I’m sure there could have been an easier way whether through Stotipy or Spotify itself, however this gave me the information I needed

#my csv wrote to the wrong format - due to size I couldn't rerun
albums = pd.read_csv(‘output.csv’,encoding = “ISO-8859–1”)
uniqueartists = albums[‘artist’].unique()
ua = uniqueartists.tolist()
genres = pd.DataFrame(ua,columns=[‘artist’])
def getgenre(artist): 
print(artist)
spotify = spotipy.Spotify()
try:
results = spotify.search(q=’artist:’ + artist, type=’artist’)
except:
return(‘notfound’)
try:
return(results[‘artists’][‘items’][0][‘genres’][0])
except:
return(‘no genre’)

genres[‘genre’] = genres.apply(getgenre,axis=1)

Unfortunately Spotify genres aren’t as usful as expected, some tags looked odd or unique:

  • 2 Pac : G Funk
  • Elvis Prestley: Christmas
  • Michael Buble: Adult Standard

So, lets skip Spotify and try LastFM tag information instead

This time I used pylast to connect to the lastFM API, this time I grabbed the top 3 tags for each artist.

from itertools import islice
import pylast
import pandas as pd
api = api
secret = secret
last = pylast.LastFMNetwork(api_key=api,api_secret=secret)
def getlastfmtag(art):
try:
artist = last.get_artist(art)
except:
return(‘not found’)
try:
a = []
for similiar in islice(artist.get_top_tags(),3):
a.append(similiar.item.get_name())
return(a)
except:
return(‘not found’)

artists[‘lastfm’] = artists[‘artist’].apply(getlastfmtag)
artists.to_csv(‘genreswlastfm.csv’, index=False)

Now 2Pac is back in Rap, Elvis is back in Rock n Roll and Michael Buble is a Jazz singer

This created csv of all the unique artists, complete with top 3 LastFM tags

We can then merge the two CSV’s together back in Pandas (I love being able to use left joins rather than simply lookups).

To give a weighted album list, I allocated 100 points to the top album each week, with a dropping score for each album. This means the 100th album in 2015 would get 1 point, the lowest charting album (10th) in 1960 would get 90 points.

While this isn’t strictly scientific — we are comparing each years proportion of genres against other years, different totals for each year don’t cause any concern.

To merge csvs and add weightings:

annualalbums = pd.read_csv(‘output.csv’,encoding = “ISO-8859–1”)
genres = pd.read_csv(‘genreswlastfm.csv’)
allalbums = pd.merge(annualalbums,genres,on=’artist’,how=’left’)
def returnrank(num):
return 101-num
allalbums[‘weighting’] = allalbums[‘position’].apply(returnrank)
allalbums.to_csv(‘allalbums.csv’,index=False)

This gave me one final CSV of all albums, complete with the top 3 tags from last FM.

In total we have 218445 rows going back to 29/7/1956 and we are nearly done.

You’ll notice the genre column is showing incorrect format, with tags formatted as a python list . This I resolved in Excel (Cheating I know) by splitting using txt to columns and removing unnecessary information.

The final step was to remove ‘tags’ that aren’t genres — for example Wham is listed as 80’s which is of no use to us — in this case we would want him to gbe included the 2nd option of Pop.

Tags removed included: 50s, 60s, 70s, 80s, 90s, Seen Live.

This meant for most albums I had 1 or more genre tag, I took the most popular one to be the final genre, we now have our dataset complete.

Looking at unusable rows (Those with no genres) we have 14759 unusable rows, giving us 203076 Rows to review.

One final piece of repair completed in Excel:

A lot of early records (pre-1960) were soundtracks, with the genre listed as ‘mysterious’ . This genre was replace with ‘soundtrack’ to give a more accurate view of what the albums actually were.

A great learning experience, my dataset is now complete, its time to move over to part 2 : analysing our data.