Know yourself Series —

Spotify Music Data Analysis: Part 2

Data Cleaning & Preprocessing

Pragya Verma
Analytics Vidhya

--

Photo by Clay Banks on Unsplash

Before starting with data analysis it is pertinent to check the data and solve any messiness present in it to get accurate results.

In the first part, the dataset was scraped using the Spotify API. The data is for almost two years ever since I started listening to music on Spotify.

In this part of the series, I am going to look for any aberrations present in data that might later skew our results. This will be followed by data preprocessing to transform data as required.

The code used in this article is from the notebook data_cleaning.ipynb in my GitHub. Check it out!

Table of contents

  1. Importing Data
  2. Dropping duplicate columns
  3. Dropping duplicate rows
  4. Null check
  5. Data Format check
  6. Value check
  7. Exporting Data
  8. Conclusion
  9. Links to other parts of this series

Importing Data

First, let’s import the data. So, I have two datasets — songs and playlists. The CSV file titled final stores the history and characteristic information about all previously played songs. The playlist data is stored in the CSV file titled playlist which gives the playlist information along with the details of all the songs in each playlist.

The data can be loaded using read_csv() . The argument passed in the read_csv is the data file path. Moreover, the first few lines of the data can be viewed using the head()function.

So load both the songs and playlist dataset into two different dataframe df1 and df2 respectively.

# loading the song file
df1 = pd.read_csv('spotify data scraping\output\\final.csv')
df1.head()
Result of df1.head()
# loading the playlist data file
df2 = pd.read_csv('spotify data scraping\playlist.csv')
df2.head()
Result of df2.head()

Dropping Duplicated Columns

Now, to declutter the data column-wise let’s have a quick look at the above data snapshot, we can see that there are some unwanted columns such as Unnamed: 0.

The whole list of data columns can be requested using the following code:

df1.columns
Columns in df1
df2.columns
Columns in df2

By glancing at this it can be confirmed that many of these attributes will not be required in the analysis. Thus it is significant to remove such unnecessary columns to free up space and reduce the processing time.

Data columns can be deleted using drop() function and as arguments, the column name list is provided.

df1 = df1.drop(columns=['Unnamed: 0', 'name', 'endTime', 'type', 'uri', 'track_href', 'analysis_url', 'duration_ms' ])
df2 = df2.drop(columns=['Unnamed: 0', 'id', 'spotify_id', 'list_id' ])

Dropping duplicate rows

Like any human being, I have listened to a single song multiple times. Hence, there are multiple entries of the same song in the dataset. I need information such as total streaming time, the number of times a piece of music heard and many more such data, so I will not get rid of the duplicates otherwise I will lose this information and the data exploration will not yield correct results.

On the other hand, duplicates are not needed to examine the peculiarities of all the tunes. So to handle this, I will create a new dataframe to store the unique song and its properties by dropping the duplicate rows.

First, let’s have a look at the number of rows before removing duplicates.

# shape before dropping duplicate rows
df1.shape
df2.shape

The output for the above code is (8413, 17) and (342, 18) respectively. So there are 8413 rows in the songs data and 342 rows in the playlist data.

To delete the duplicate rows I will use the drop_duplicates() function. For function arguments, mention keep = "Last" as the latest instance of those rows is needed. Further, for defining the rules of finding duplicates there is an argument passed inside the function called subset . It takes the name of columns based on which the duplicates will be identified and removed, if not mentioned then the function remove the exact duplicates from the data. So for the songs data, duplicates can be removed based on the track name, further for the playlist dataset, multiple entries of a song belonging to the same playlist can be eliminated. In other words, based on the list_name (playlist name) and the song_name.

# we are keeping the last instance instead of the first as we need the latest entry of the songdf1_new = df1.drop_duplicates( subset = 'trackName', keep = 'last')
df2_new = df2.drop_duplicates( subset = ['list_name', 'song_name'], keep = 'last')

After dropping the unrequired rows, I am storing them in the different dataframe so that subsequently I can export them into a new CSV file.

And the shape of the resulting dataset is (2135, 17) and (242, 18) respectively.

Finally, I have 4 data frames df1, df2 which stores the original songs and playlist data, and df1_new, df2_new which stores the deduplicated songs and playlist data.

Null check

Empty cells are the most common and anticipated data cleaning problem. So after eliminating the duplicate rows, checking for any missing data points is the most important step.

In python dataframe, check for empty cells can be performed using isnull() , further using the sum() function to get an overall sum of the empty values. I am doing this because isnull() return true or false boolean results for each cell and scrolling through each row to look for a blank cell is a rather tedious task.

The output of null check for df1
df1.isnull().sum()

The output of the above code line shows that the sum of null cells are zero. Therefore there are no empty values to deal with in the song’s dataset.

The output of null check for df2

Similarly, I will perform the null check for the playlist data.

df2.isnull().sum()

The output suggests that there is one empty value.

So I perused through the dataset and I found that I have created one playlist in my Spotify that does not have any song in it. Hence in the output, there is no empty list_name value however, the rest of the value for it is empty because no song is present in it. Thus to deal with this the simplest solution is to delete that playlist entry as it contains no useful information.

# playlist data
df2.dropna(subset=['song_id', 'song_name', 'artist', 'popularity', 'release_date', 'energy', 'liveness', 'tempo', 'speechiness', 'acousticness', 'instrumentalness', 'danceability', 'duration_ms', 'loudness', 'valence', 'mode', 'key'] , inplace = True)
# deduplicated playlist data
df2_new.dropna(subset=['song_id', 'song_name', 'artist', 'popularity', 'release_date', 'energy', 'liveness', 'tempo', 'speechiness', 'acousticness', 'instrumentalness', 'danceability', 'duration_ms', 'loudness', 'valence', 'mode', 'key'] , inplace = True)

After dropping the empty row perform the null check again to verify.

Null check for playlist data

From the output, it can be concluded that there is no empty row now.

Data Format Check

I will now check whether the attributes have a data type mismatch or not because sometimes a float value is read as a string, the string value is read as number format and many more such data mismatch.

For this test, simply request the data type of all the columns using dtypes and verify if all data types are alright.

df1.dtypes
the output of df1.dtypes

From the above output, it can be inferred that there is no data mismatch and all the data is stored in their respective data type format.

Similarly, I checked the data type for the playlist dataset.

df2.dtypes
The output for df2.types

Here also it can be verified that there is no data mismatch.

Value check

In this step, I am going to check whether the content inside the cells are correct or not. So by finding the max and min of the numerical values we can determine if the values are in the acceptable range or not.

So I will request the max value and the min value of the columns using .max() and .min() respectively.

df1.max(axis=0)
The output of d1.max()
df1.min(axis=0)
The output of df1.min()

The above code yielded that there is no data value problem in the songs dataset. All the numerical values lie in the acceptable range.

Exporting data

I have identified and rectified all of the incongruous data. Now all the data is cleaned and processed so this can be export into new CSV files using to_csv().

df1.to_csv('data\song_data.csv') df1_new.to_csv('data\distinct_song.csv') df2.to_csv('data\playlist_data.csv') df2_new.to_csv('data\distinct_playlist.csv')

Conclusion

All the data aberrations and quandary are handled.

Now, in the next part, I will perform data exploration and analysis.

--

--

Pragya Verma
Analytics Vidhya

Data professional focused on end-to-end solutions, exploring data analytics and engineering to unlock data’s potential.