Apple Music Activity Analyser — Exploring the Data

Published in

The Startup

12 min readJul 28, 2020

I started using Apple Music in 2016, and when I found out I could request an archive with all my usage data, I thought it would be great to dive into it and try to understand what are the patterns in the way I use this service!

It occurred to me, after a few hours wrangling, cleaning, looking from different angles at the data, that it may be useful for other people to be able to dive into their own data without going through the trouble of parsing and processing it all. And just like that, a python package was born: apple_music_analyser!

This first article is going to walk you through my analysis journey. You are welcome to check out the Jupyter notebook I built over time, for more details!

This second article is actually going to be more focused on how to use the package (let’s call it a tutorial).

And finally, this article introduces the web interface freshly released!

Note: you can request your data to Apple, see Apple’s Data and Privacy page.

Another note (edit from Oct, 12th 2020): a web interface is now available for you to explore your data!

Understanding the data

Upon requesting the data to Apple, we receive an archive (Apple_Media_Services) that contains a lot of folders. We are interested only in the Apple Music Activity folder. And there, again, a lot of files are available. Let’s look at what each of them contains:

Apple Music — Recently Played Containers : albums, playlists recently played (recently I believe means within the past 6 months) → this is not relevant to figure out patterns or understand the usage of the service overtime
Apple Music — Recently Played Tracks : tracks recently played (recently I believe means within the past 6 months) → this is not relevant to figure out patterns or understand the usage of the service
Apple Music Library Activity : records all the actions performed with the library (either user actions, or automated software actions) → relevant for our analysis!
Apple Music Library Playlists : describes the playlist created in the library (including its name, identifier, and the identifiers of each track it contains) → this is not relevant to figure out patterns or understand the usage of the service
Apple Music Library Tracks : describes each track of the library (including its title, artist, genre, release year, album, when it was added to the library…) → relevant for our analysis!
Apple Music Likes and Dislikes : lists the rating associated to a track, and when it was rated → relevant for our analysis!
Apple Music Play Activity : lists the play activity history, with associated track info such as genre, provider, duration, activity timestamp and type, timezone,… → extremely relevant for our analysis!
Identifier Information : matches a track title with an identifier → this can be useful for our analysis!
Music — Favorite Stations : I believe this list the stations that you create → this is not relevant to figure out patterns or understand the usage of the service overtime
Music — Onboarding Artists : the artists you select when first launching the service for Apple to start recommending you content → this is not relevant to figure out patterns or understand the usage of the service overtime
Music — Onboarding Genres : the genres you select when first launching the service for Apple to start recommending you content → this is not relevant to figure out patterns or understand the usage of the service overtime

Out of all the content of the archive provided by Apple, we are going to use only 5 of them (those with the name in bold). We clearly see two types of files: those that have information about tracks and listening activity, and the Apple Music Library Activity. This last file is going to be analyzed independently of the others.

Let’s dive into the exploration of the data!

Cleaning and wrangling

I am not going to walk you through the whole process, but simply outline what I did, and why I did it like this. If you want more details, feel free to take a look at the Jupyter notebook I built over time.

So here is the problem I am trying to solve: can I build a single dataframe that will allow me to build statistics and identify trends on the type of music I listen to, at what moment in time, if the trends change from one year to another, how I find the tracks (search, library,..)?

Of course, the answer is yes, and here is how we can get to this dataframe.

First observation, the file we have that contains the most information about the playing activity is…. Apple Music Play Activity. So this dataframe is going to be our base, that we will enrich with informations from the other dataframes.

Step 1: cleaning and restructuring of the Apple Music Play Activity dataframe

Let’s look at one row of the dataframe:

Row 121 of the Apple Music Play Activity dataframe

Just at first glance, there are many columns that look very interesting:

Event Type / End Reason Type -> to spot whether a track was skipped
Feature Name -> to spot how the track was found (ex: suggestion playlist)
Genre
Content Name/Artist Name -> to be able to fetch info about the track
Event Start/End Timestamp -> to identify when the track was listened to

We also notice that this dataframe does not contain any column with an id that can help us match each row to some information contained in the other dataframes….

The cleaning up of this dataframe will consist in
1. Rename the columns containing song title and artist
2. Time columns: first obtain a timestamp column without missing values, using Event Start Timestamp and Event End Timestamp
3. Time columns: add time columns from the timestamp column (year, month, day of the month,…), with conversion to local time
4. Remove outlier rows (Apple Music service started in 2015, so we drop rows with a year before 2015)
5. Add a column with a flag for partial vs complete listening of a given track
6. Add a column with a simplified ‘origin’ of the song, i.e. how it was found (search, suggestion, library,…)
7. Add a column with a calculation of the listening duration in minutes
8. Remove outliers of listening duration (99th percentile)
9. Drop unused columns

The columns we are dropping are the following:

['Apple Id Number', 'Apple Music Subscription', 'Build Version', 'Client IP Address','Content Specific Type', 'Device Identifier', 'Event Reason Hint Type', 'Activity date time','End Position In Milliseconds', 'Event Received Timestamp', 'Media Type', 'Metrics Bucket Id', 'Metrics Client Id','Original Title', 'Source Type', 'Start Position In Milliseconds','Store Country Name', 'Milliseconds Since Play', 'Event End Timestamp', 'Event Start Timestamp','UTC Offset In Seconds','Play Duration Milliseconds', 'Media Duration In Milliseconds', 'Feature Name']

And that’s it for this dataframe!

Step 2: cleaning and restructuring of the Apple Music Likes and Dislikes dataframe

Let’s look at the structure of this dataframe, by looking at the third row

Item Description    Lara Fabian - Piano nocturne
Preference                                  LOVE
Created                 2017-10-03T08:03:22.849Z
Last Modified                                NaN
Item Reference                         897148799
Title                             Piano nocturne
Artist                               Lara Fabian
Name: 2, dtype: object

We notice that both the title and the artist name are in the same column called ‘Item Description’. So here the parsing will consist in creating two new columns, Title and Artist.

Step 3: cleaning and restructuring of the Apple Music Library Tracks dataframe

Let’s look at the structure of the Apple Music Library Tracks, by looking at the third row:

Track Identifier                                   182865686
Title                           Should I Stay or Should I Go
Artist                                             The Clash
Album                                            Combat Rock
Genre                                              Acid punk
Track Year                                              1982
Track Duration                                        190119
Track Play Count                                           1
Date Added To Library                   2014-03-25T10:41:02Z
Skip Count                                                 0
Date of Last Skip                                        NaN
Release Date                                             NaN
Purchased Track Identifier                               NaN
Composer                                           The Clash
Last Played Date                        2014-04-04T09:07:25Z
Track Like Rating                                        NaN
Apple Music Track Identifier                             NaN
Tag Matched Track Identifier                             NaN
Name: 2, dtype: object

With this dataframe, there is not much to do really, besides dropping some columns that are not used later on — and even this is not of great importance!

We decide to drop the following columns:

['Content Type', 'Sort Name','Sort Artist', 'Is Part of Compilation', 'Sort Album','Album Artist', 'Track Number On Album', 'Track Count On Album', 'Disc Number Of Album', 'Disc Count Of Album','Date Added To iCloud Music Library', 'Last Modified Date', 'Purchase Date', 'Is Purchased', 'Audio File Extension', 'Is Checked', 'Audio Matched Track Identifier', 'Grouping', 'Comments', 'Beats Per Minute', 'Album Rating', 'Remember Playback Position', 'Album Like Rating', 'Album Rating Method', 'Work Name', 'Rating', 'Movement Name', 'Movement Number', 'Movement Count', 'Display Work Name', 'Copyright', 'Playlist Only Track', 'Sort Album Artist', 'Sort Composer']

Step 4: let’s build a data structure for each Track

Ok so so far what we have are four dataframes parsed (i.e. restructured and cleaned), but how do we manage to match items from one to the others? Here is where comes to play a new data structure of Track. The idea is to use a Track class, and for each instance, update its information from the various dataframes.

So basically, going through the rows of each input dataframe, we try to identify the rows that may represent a track we already saw and for which we already have an instance, or if it is a new one with different titles or ids or associated genres, and we create or update track instances, and track for each artist all the songs listened to.

The logic is the following:

we loop through the Apple Music Library Tracks dataframe, we create track instance when we encounter a new song, we update an existing instance when we already saw a similar song
we loop through Identifier Information. As this dataframe contains only title and id, we are not going to be able to create new instances of Tracks (too little information about a track), so we simply update existing instances when we find a match with the ids
we loop through the Apple Music Play Activity dataframe, we create track instance when we encounter a new song, we update an existing instance when we already saw a similar song
we loop through the Apple Music Likes and Dislikes dataframe, and again as this dataframe contains very little information about each track, we update existing instances when we already saw a similar song (similar here meaning with a similar combination of Title and Artist)

Some comments here:

a track instance is just a placeholder for informations about songs, that simplifies the way we can match songs from one dataframe to songs from another dataframe. We do not have, for each dataframe, a column with a unique identifier that we could use for this purpose. Besides, it happens that we listen to the same song, identified by Apple as different tracks (because the album it comes from is different, so it is a different media file, while really it is the same song)
for each track instance, we record in which dataframe we gathered information from (using the row index)
to know whether we already ‘saw’ a song (i.e. we already have a track instance for a given item), we use either the ids, or a similarity score between the ‘Title && Artist’ string combination (for example, comparing ’High Hopes && Pink Floyd’ and ’High Hopes (Edit) && Pink Floyd’ will return a high similarity score)
we exclude in this processing step the rows that do not contain a Title (‘NaN’), or those we could not find a close match either using an if or a combine string of the form ‘Title && Artist’

Step 5: merge the Tracks information to the Apple Music Play Activity dataframe

We decided to use the Apple Music Play Activity dataframe as a base. Besides, for each track instance, we know in which rows of this dataframe this track was seen. It is now easy to add information on each row about the track, such as all the genres this track has, whether it is in the library or not, and its rating.

We then end up with a dataframe for visualizations, that has one row per listening activity, and a lot of information about each song that was listened to. We are now ready to visualize some stuff!

Example of a row of the visualization dataframe after parsing and processing

Data analysis and visualizations

And now for the cool stuff, the visualizations! Here are the questions that I was trying to answer:

Are there any trends on the moment songs are listened to? Does it change from one year to another? What’s the average listening time per day/week/month across years?
What are the genres most LIKED, most DISLIKED? Is there a link between the origin of the track and its rating?
Where are songs most usually found? How are found songs in the library (evaluation of the relevance of the suggestions)?
What’s the ratio of songs skipped? Is there a link between the origin of the track, its rating, and whether it was skipped or not?
Can we establish a ranking of artists? Of genres? Is there a difference between the years we observe?
General appreciation of how much time songs are listened to, using some filters per year, artist, genre….
Life cycle of a song (listened to for a long period, vs short period)

Note that here I am displaying static images, but as they are rendered using Plotly, the visualizations generated are interactive!

Distribution of the number of tracks listened to per month for different years

Let’s notice that both 2016 and 2020 only have partial data!

Distribution of the count of tracks listened to per day of the month for different years

We cannot really see trends in the days of the month, so let’s look at the days of the week and the hour of the day instead!

Distribution of the percentage of tracks listened to per day of the month for different years

Now what about how much time I listen to music per day? Here is an example in 2018.

Now filtering only on Pink Floyd in 2019

Note about the filtering: I built an helper function that takes as an argument a dictionary of parameters to filter the dataframe on. It basically performs a query on the dataframe using a combination of the filter parameters passed in the dictionary.

Now taking a look at rankings

Ranking of genres listened to for different years

And if we filter only on tracks that have a rating ‘LOVE’:

Ranking of genres listened to for different years if the song was rated ‘LOVE’

Or looking at a list of the top 5 artists that have songs rated ‘LOVE’:

Or where I find the songs I liked:

Ranking per year of the origin of a song that has a rating ‘LOVE’

There is actually so much you can figure out with this data!

This analysis allowed me to dive into my listening habits and trends. Every conclusion that could be taken from the data made sense, which is actually reassuring.

Some of the insights that I take away from this analysis are related to:

the genres, artists, tracks I prefer, and their ranking depending on a variety of filters
the trends over time of genres, artists listened to, and when I tend to listen to music
how I find songs, if I skip them a lot
what device I usually use to listen to music

Conclusion and further work

Data is power! I am really happy I could have a better understanding of what music I listen to, when, and what are my favorite genres, artists and tracks, and the evolution over time. You may think it is not worth spending hours on a dataset to get to this conclusion, but what is fun is the journey, and also the fact that now I have graphs to answer the question ‘What’s your favorite song?’ :)

Also, what started as a tiny project of exploration of my own data actually turned out to become two big projects: a python package, and a webpage for people to simply upload their data and get all these cool visualizations without needing to write a single line of code.

The Python package is already available! Check the second article of this sequence of articles, and the GitHub repository!

As for the web page, I’ll let you know when it is online!

I would be happy to know what you think about this work, please get in touch!