4 min readFeb 9, 2017

Project 2 Billboard — The practice of using Python to process data and to generate insights

This project is the 2nd individual project from General Assembly Data Science Immersive course. The goal of this project is to practice data cleaning, data munging, data visualization, and data modeling skills learned in class. The dataset for this project is a csv file containing 317 rows and 83 columns of data about songs that peaked billboard in the year of 2000. The raw data is messy with many different problems, which I’ll address in next sections. The workflow of this project is:

data cleaning & data munging
explorative analysis & data mining
data visualization
data modeling

Part I: Data Cleaning

Columns naming convention. Replace “.” with “_” in the data columns.
Inconsistent naming convention in the “artist.inverted” column. Some cells use “firstname lastname” convention, while others use “lastname, firstname” convention. I will transform them to follow the “firstname lastname” convention for all cells.
Data type transformation for “date.entered”, “date.peaked”, and columns “x2nd.week” to “x76th.week”.
Dummy variable for “x1st.week” to “x76th.week” should include pd.nan instead of “*” for further analysis.
“time” column should really be “length” column, and the data in this column should be reformed to minutes and the dtype should be floats.
Change “artist_inverted” to “artistt
In ‘genre’ columns, ‘R & B’ and ‘R&B’ should be the same.

After cleaning the data, the dataset now looks like this:

Here is the data type of each column:

Data type of columns after data cleaning

Part II: Explorative Analysis

Before getting into data modeling, I took a look at the dataset from the perspective of descriptive statistics. In the chart below, the only useful columns is the length_in_sec column, which represents the length of a song in seconds. Moving forward, I will try to find correlations between columns, but since there are no other columns ready for statistical analysis, at least in their current form, I will create a derived column called days_to_top to capture information regarding the number of days it takes each song to peak the billboard since its initial presence on billboard.

Descriptive statistics for numeric columns

Part III: Data Visualization

I created a histogram for the newly created days_before_top column, and we can get some ideas on how the distribution of this data is like. As the chart below shows, most songs reach the peak in less than 100 days with quite many songs reach peak at the first day, and the longest time to reach the peak is above 300 days.

Next, I look at the count of songs for each genre, and the results demonstrate that Rock music seem most popular in 2000 with 103 songs peaked billboard.

To take a closer look at the popularity of genres, I created a chart on the mean rankings for songs in a particular genre. R&B has the lowest mean ranking among all genres, representing R&B music tent to have higher rankings than other music genres.

Similarly, I draw a chart for the average days_before_top. This chart gives us some more information on how fast on average each music genre hits peak. Country music has a much shorter waiting time than other genres.

Part I: Data Cleaning

After cleaning the data, the dataset now looks like this:

Here is the data type of each column:

Part II: Explorative Analysis

Part III: Data Visualization

Written by Shiyang Feng