Data Dive: What songs were hot in 2000?

Published in

han_

7 min readFeb 7, 2017

…besides the thong song? I can’t remember anything hotter than the thong song in 2000. As a middle-schooler when that came out…I can’t tell you how many of my friends (including me) loved that song…but had absolutely no clue what a thong was.

A few days ago, it came up on my playlist shuffle, so it got me thinking, what songs were hot in 2000?

Data Cleaning and Processing

Billboard data summary- the columns titles for all 83 columns

I started with a dataset from billboard.com for the songs in that made the Billboard top 100 in the year 2000.

Let’s start with some quick exploratory analysis on the data:

Looks like I have a healthy amount of data, with 317 rows and 83 columns, all in string form. We have the typical song information- artist, track, song length, genre. Then some information about the song’s performance in 2000- date entered the top 100, date peaked, and then one column with the song rank for each of the up-to 76 weeks while in the top 100. The most popular song that made the list in 2000 must have topped out on the list for a max of 76 weeks, while not every song has an entry in each of these 76 week columns.

First things first, I needed to transform the data into workable types so that I perform analysis.

I decided I need to do two things:

Transform the numeric data into workable forms. This includes columns of song length, date entered, date peaked.
Create a couple summary columns with the unwieldy 76 columns of week data so that I could work with those easily.

For step 1, I wrote a few functions, and used the datetime converter built into pandas to convert the time columns. I created 6 new columns: song length in minutes and seconds, time entered, time peaked, and days peaked (difference between entered and peaked) in time and in integer form.

import pandas as pd
import numpy as np
import re
from datetime import datetime
from scipy import statsdata=pd.read_csv("billboard.csv")
data2=datadef minute (n):
    return int(n[0])def second (n):
    return int(n[2:4])data2["min_lenth"]=data2['time'].apply(minute)
data2["sec_length"]=data2['time'].apply(second)data2["time_entered"]=pd.to_datetime(data2['date.entered'])
data2["time_peaked"]=pd.to_datetime(data2['date.peaked'])
data2["days_peaked"]=data2["time_peaked"]-data2["time_entered"]
data2["int_days_peaked"]=data2["days_peaked"].dt.days

For Step 2, I created a function that I could apply only to the week columns that allowed me to extract the total weeks in the Billboard 100 and the average position during that time.

def weekcleaner (x):
    if x=="*":
        return np.nan
    else:
        return float(x)data2.loc[:,'x1st.week':'x76th.week']=data2.loc[:,'x1st.week':'x76th.week'].applymap(weekcleaner)data2["sum_of_weeks"]= data2.loc[:,'x1st.week':'x76th.week'].count(axis=1)
data2["average_place"]=data2.loc[:,'x1st.week':'x76th.week'].apply(np.mean, axis=1)

Examining the Data

So let’s take a look at the 10 songs that were able to stay on the billboard 100 for the longest:

Top 10 songs staying on Billboard 100 for the longest time in the year 2000

Creed! I did not expect to see you there at #1. Two top 10 hits…we didn’t have the best taste in music in 2000.

Where is Sisqo?

Not bad, 3 songs in the top 100 and Thong Song spent 28 weeks in the top 100.

I was interested in song performance so I wanted to do a quick visualization of the song performance characteristics- days peaked, weeks on the Billboard 100, and average place:

Histogram of average place of songs in the top 100.

Histogram of the number of weeks in top 100.

A curious trend emerged when I looked the total weeks spent on the top 100: there is a big spike at 20 weeks. Now this is curious, why would most songs top out at 20 weeks? There must be an effort for Billboard to boot out songs that hit 20 weeks. Let’s turn to wikipedia:

Wikipedia describing Billboard.com rules for recurrent songs.

Sure enough- songs that hit 20 weeks which are below the rank of 50 automatically drop out.

Let’s check out data set to see if this is the case:

I created a function that would allow me to find the first rank of the song when it entered the top 100 and the last rank of the song before it exited the top 100.

def first_rank(x):
    if x.first_valid_index() is None:
        return None
    else:
        return x[x.first_valid_index()]
    
weeks=data2.loc[:,'x1st.week':'x76th.week']
    
data2["first_rank"]=weeks.apply(first_rank, axis=1)
data2["last_rank"]=weeks.loc[::,::-1].apply(first_rank, axis=1)
data2['peak_rank']=pd.Series(weeks.min(axis=1))data3=data2[['artist.inverted', 'track',"int_days_peaked", "average_place", "sum_of_weeks", "first_rank", "x20th.week","last_rank"]]
print(np.mean(data3[data3['sum_of_weeks']==20]["last_rank"]))
data3[data3['sum_of_weeks']==20]data3[data3['sum_of_weeks']==20]['last_rank'].describe()
plt.figure(figsize=(4,2))
sns.distplot(data3[data3['sum_of_weeks']==20]['last_rank'])

If we look at the last rank of the songs which were on the top 100 for exactly 20 weeks:

Summary statistics for the last rank in songs lasting exactly 20 weeks in the top 100

Histogram for the last rank in songs lasting exactly 20 weeks in the top 100

Among the 82 songs that stayed for 20 weeks, their mean rank was 80. From the distribution, a vast majority of the songs were above rank 50, showing us that indeed, those songs ranking lower than 50 were automatically booted from the top 100 once 20 weeks were up. The couple that were ranked lower than 50, I assume, just fell out of the top 100 by themselves after week 20.

Drawing Conclusions from the Data

I also wanted to see if there were relationships I could tease out between variables. What made songs able to last past the 20 week threshold? One hypothesis I wanted to test was: songs that stay on the top 100 for more than 20 weeks start off at a higher rank and peak at a higher rank than songs that do not.

First, a quick look at the relationship between variables:

data3=data2[['artist.inverted', 'track',"int_days_peaked", "average_place", "sum_of_weeks"]]
data3.sort_values('int_days_peaked',ascending=False)
data3.sort_values('average_place',ascending=True)
data3.sort_values('sum_of_weeks',ascending=False)dataplot=data2[["int_days_peaked", "average_place", "sum_of_weeks"]]
sns.pairplot(dataplot, kind='reg', size=2, aspect=1.5)

Pairwise relationships for three variables: days to peak, average rank, and sum of weeks in top 100.

Using pairplot from the seaborn package is a great way to visualize pairwise relationships between a group of variables. All of the relationships we expect to see emerge- the longer the song takes to peak, the longer it stays in the top 100. The higher the average rank (smaller rank number), the longer it is in the top 100.

In order to see if the population of songs on the top 100 for more than 20 week (songs_>20) and songs unable to last more than 20 weeks (songs_<20) are truly different, I am going to use a 2 sample student’s T test.

Assumptions: Because I do not expect these two populations to have equal variance OR have equal sizes (n) I will be using a Welch’s T Test. This student’s T test is for equal or unequal sample sizes and unequal variances.

Welch's t-test - Wikipedia

In statistics, Welch's t-test, or unequal variances t-test, is a two-sample location test which is used to test the…

en.wikipedia.org

My null hypothesis or H0 = In the population of songs_>20 and songs_<20, there is no statistically significant difference between starting rank and max rank.

I will set my p value threshold to be 0.05 or 5%.

In other words, are the songs who start hot or peak high the ones with staying power?

To run this test, I will have to first separate the songs_>20 and songs_<20 and then run a t test on the samples. The stats library within scipy library in python has a great function that can help us do this called stats.ttest_ind().

from scipy import statssongs_above_20=data3[data3['sum_of_weeks']>20][["first_rank", 'peak_rank']]
songs_stuck_at20=data3[data3['sum_of_weeks']<=20][["first_rank", 'peak_rank']]print("first rank:", stats.ttest_ind(songs_above_20['first_rank'], songs_stuck_at20['first_rank'], equal_var=False))
print("peak rank:", stats.ttest_ind(songs_above_20['peak_rank'], songs_stuck_at20['peak_rank'], equal_var=False))fig, ax= plt.subplots()
sns.distplot(songs_above_20['first_rank'], label='above 20')
sns.distplot(songs_stuck_at20['first_rank'], label='below 20')
ax.legend(loc='upper left')fig, ax= plt.subplots()
sns.distplot(songs_above_20['peak_rank'], label='above 20')
sns.distplot(songs_stuck_at20['peak_rank'], label='below 20')
ax.legend(loc='upper right')

Results for student’s t test for first rank and peak rank variables comparing songs_>20 and songs_<20 populations.

The results confirm my hypothesis. The p values for both first rank and last rank are well below the 0.05 threshold so we can safely reject the null hypothesis- there is a statistically significant difference between the two populations with respect to the first rank and last rank variables. In fact, we actually need to divide these p values in a half since they are 2-tailed p values and we are actually only testing and expecting that the peak rank and first rank of >20 is larger (smaller rank number) than the <20 population.

The first rank for two populations: >20 and <20.

The peak rank for two populations: >20 and <20.

We also see that, between the two variables, peak rank is the real distinguishing factor. Those songs that have true staying power peak high, and ride that wave to a long run in the top 100, powering through week 20.

Good thing thong song peaked at #3.