Know yourself Series —

Spotify Music Data Analysis: Part 3

Data Visualization

Pragya Verma

Published in

Analytics Vidhya

11 min readJul 18, 2021

In the last series the dataset was checked for any corrupted data point, i.e., incorrectly formatted, duplicate, or incomplete data point. After this examination, I found no such abnormalities present in it. Then I modified and transformed the dataset as well to suit the requirements.

In this third part of the series, I am going to dig deeper to understand my streaming history and my moods. Moreover, I will review my music's characteristics and features and the playlists created by me.

The code used in this article can be found on my GitHub in the file exploratory_data_analysis.ipynb.

Music Attributes
My streaming History
How often did I listen to music?
Whom did I listen to most?
My mood throughout the year
Feature Analysis
Mean of audio features
Histogram of Features
Histogram for Tempo
Heatmap
Scatterplot
Pivot Table
Playlist Analysis
Songs by Year
Songs by Key
Conclusion
Link to other parts of this series

Music Attributes

The dataset gathered in the first article of this series contains the song attributes. Before performing data analysis it is necessary to understand those features individually.

The song attributes in the dataset are explained below:

Tempo: The tempo of the song. The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, the tempo is the speed or pace of a given piece and derives directly from the average beat duration.

Energy: Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. Higher the value more energetic the song.

Danceability: Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. The value ranges from 0 to 1. Higher the value more suitable the song is for dancing.

Loudness: Loudness values are averaged across the entire track. It is the quality of a song. It ranges from -60 to 0 DB. Higher the value, the louder the song.

Valence: A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).

Liveness: Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides a strong likelihood that the track is live.

Acousticness: A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.

Speechiness: Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audiobook, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.

Mode: Songs can be classified as major and minor. 1.0 represents major mode and 0 represents minor.

Key: Key is the pitch, notes or scale of song that forms the basis of a song. 12 keys are ranging from 0 to 11.

My streaming history

In this section, I am going to analyse my behaviour and personality with the help of my music listening history. Here I am using my Spotify listening history. I first started using Spotify in 2019 and continue to listen to songs on it. Hence in the past 2 years, I have generated a huge amount of data which will be enough for analysis.

For analyzing streaming history I will import the overall music streaming data stored in song_data.csv .

How often did I listen to music?

In this visualization, I will determine that every day for how long I used to listen to music. This plot is similar to a plot on GitHub which shows us the number of commits we make every day, but, here I will be showing the amount of time (in minutes) I listened to music.

To achieve this, first, convert time from ms to minutes and convert DateTime to date format. Once the data is converted into the required format then group the minutes by date this way we will get the number of minutes of music heard each day.

Once the streaming time is fetched, shape the dataframe by removing unnecessary columns. The final data frame is stored in new_daily_length .

Plot the heatmap for the time using the calplot .

The above code yields the following graph —

I listen to music every day when I used to travel to my college. As per the graph, I listened to songs for as little as 5 mins to over 80 mins and more than 100 mins as well. Moreover, on many days I never listened at all this is probably because I might be late for my college or I never had the time to plugin. Ever since the lockdown was imposed in April 2020, I listened for less amount time (approx less than 30 minutes). However, there were a few days when I would listen for more than an hour.

Whom did I listen to most?

I have heard multiple songs from hundreds of artists. It would be fun to determine the artist whose most songs I have streamed.

To determine this value I will group the dataframe by unique artists and count the minutes heard for each artist. Further, sorting the data by minutes heard to get the top heard artists.

Now plotting all the artists' will hotchpotch the visual. Hence I have extracted the top 30 artists heard and stored them in the most_heard_30 .

The dataframe can now be visualized into a horizontal barplot using barplot .

The visual looks like as shown below —

Barplot to visualize the most streamed artist

Since the bar graph could only show 30 artists clearly. In the next plot, I have made a word cloud, where I am representing my top 100 artists with no compromise in clarity.

The word cloud is as follows —

As you can see in the above graph, the bigger the artist name and darker the blue, the most I listened to the artist. Further, the size decreases and the colour becomes lighter and lighter towards yellow. Moreover, this plot easily inculcates 100 artists. My most heard artists are Jason Derulo, Halsey, The chainsmokers, DEAMN, Zedd and many more.

Now, apart from fetching the most heard artist based on the time, I will also extract artists whose most songs I have listened to. To get this list I will put a lower limit of 5 songs and then plot the bar chart.

The above chart yields the following visual —

Barplot to visualize count of songs per artist

From the above visual we can see that most of my songs are by DEAMN, Kesha, Rak-Su, OMI and Iggy Azalea.

My mood throughout the year

After understanding the streaming history it is pertinent to discover the mood as well. Now the dataset contains the data valence which defines the emotions of the song quantitatively. So I will be using the attribute valence to determine my sentiments as one often listen to music according to their frame of mind.

The dataframe is prepared which can be plotted. This dataframe can be visualized using the error plot which shows every day min, max and average value. The perpendicular line depicts the minimum and maximum value and the small triangle on it displays the average value.

The above code produces the following chart —

As you can see I listened to you all types of music be it festive or melancholy.

Similarly, I created a visual to determine every day’s minimum and maximum quantity of the dance and energy aspect of the music.

In the above graph, I have added a median line at 0.5 and the average value of the dance and energy levels lie above this line for almost all days. Even though I listen to melancholy songs they have high dance and energy levels.

Feature Analysis

Now comes the section, where I will analyze the features of my music list. For this analysis, I will be importing the unique song list as I do not need the streaming time data. Import distinct_song.csv since the required information is stored in this file.

I will analyze my music characteristics such as danceability, energy, speechiness, liveness, valence, and many more.

Mean of audio features

Before performing any analysis on the quantitative variables it is pertinent to get an overall picture of these variables. So let’s first get the mean values of each feature.

Let’s plot the mean of each feature in the barplot.

The above code yields the following plot —

Barplot to visualize the mean value of audio characteristics

From the above visual it can be inferred that on average my songs have more danceability and energy, and less acoustic feature. However, the mood is neutral overall.

Histogram of features

The dataset contains a lot of audio features so to get a brief overview of it I can plot a histogram for all the features.

To get the histogram for each attribute I will run a loop. The code is as follows —

The result looks something as shown below —

The code will yield a histogram for all the features here I have just included 4 plots.

Tempo

Each song has its defined tempo. I will plot a histogram to visualize the count of songs belonging to each tempo.

Additionally, the dataset includes the major and minor modes of the songs. We can also segregate the graph based on different modes using the following code —

The above code produces the graph as shown below.

Music has 12 keys, so we can get the count of songs for each key and mode using a barplot

We can simply use a countplot to get this output with mode as hue.

Heatmap

There are many characteristic features of songs in the dataset. If there is any relationship between the song features or not can be determined using a heatmap. This will help to better understand music peculiarities.

From the above heatmap, it can be inferred that there is a huge correlation between the following variables —

loudness X energy

2. valence X danceability

3. valence X energy

4. valence X loudness

5. energy X time_signature

Scatterplot

Scatterplots are also a great way to determine a relationship between two variables. So I will be plotting it for the variables for which we found the strong correlation above. Additionally, I will also add Mode in the plot to get the analysis as per major and minor modes.

Pivot table

Different artists have different pitches, loudness, tempo, etc. Key is also a major factor in music.

With the help of the pivot table, I can get the average key of songs that were sung/created by them.

Pivot table for artists and their music average key

We can see the average key value for a few artists in the above picture, such as for 3LAU is 4.2, will.i.am is 5, and so on for other artists.

Playlist Analysis

In the previous sections, I have extensively analyzed the songs and their features. Now, in this section, let's get a brief analysis of the playlists that I created in my Spotify account.

So first import the playlist_data.csv file. It will be focusing on 5 major features — energy, danceability, valence, liveness, and acoustics. So group the data by playlists and get the average values of these variables.

The best way to get an overall idea about the playlists value is to plot a radar map.

In the above chart, an overall view of all my playlists can be read.

Songs by Year

The playlists contain numerous songs. The songs release year matters, so I can determine whether I like to hear more old songs or the newer ones.

The dataset contains the release_date attribute. The year can be fetched from the release_dateand can be plotted to get the count of songs released each year.

The plot can be a simple bar chart to get the count of songs in each year.

From the above visual, it can be inferred that I listen to more of the most current songs rather than old songs.

Songs by key

Finally, I will also determine the songs in my playlist belonging to each key.

Simply with the bar plot, I can get the number of songs in each key.

Conclusion

In this article, visual analysis was performed to get a better understanding of the data. Many relevant deductions were made such as my listening history, along with my moods and my choice of artists. Further, dissected and learned about my songs and playlists mathematically.

In the next article, I will perform cluster analysis on the dataset.

Links to other parts of this series

Spotify Music Data Analysis Part 1: Data Gathering
Spotify Music Data Analysis Part 2: Data Cleaning & Preprocessing
Spotify Music Data Analysis Part 3: Data Visualization
Spotify Music Data Analysis Part 4: Cluster Analysis