Apple Music activity analyser — getting started with the package

Published in

Towards Data Science

9 min readJul 28, 2020

In another article (see here), we took a look at the analysis I made of my own data. It is now time for me to show you how to use the apple_music_analyser package, so that you can perform similar data analysis on your own data!

Note: you should request your data to Apple, see Apple’s Data and Privacy page.

Another note (edit from Oct, 12th 2020): a web interface is now available for you to explore your data! And you can read about it here.

Install the package

First of all, install the package

pip install apple-music-analyser

All the dependencies are installed automatically. There is a folder docs and a folder examples that can contain useful content for you! You can take a look at it from the GitHub repository.

Imports and data input

You can use your own data, of course! But if you prefer, you can use the test data that comes with the package (used for the test). Get it here.

Step 1, we import all the modules we will need.

# import to parse an archive, save and load a pickle file
from apple_music_analyser.Utility import Utility
 
# import to actually parse and process the data
from apple_music_analyser.VisualizationDataframe import VisualizationDataframe
 
# import to filter the df
from apple_music_analyser.Query import QueryFactory
 
# import to build visualizations
from apple_music_analyser.DataVisualization import SunburstVisualization, RankingListVisualization, HeatMapVisualization, PieChartVisualization, BarChartVisualization

Step 2, we extract just the files we want to build the visualizations. Let’s say the archive Apple provided to you is inside a folder called data.

path_to_archive = 'data/Apple_Media_Services.zip'
input_df = Utility.get_df_from_archive(path_to_archive)

Now if you are using the test_df.zip, beware that the structure of the archive is slightly different than the structure of the Apple archive. And because of that, we need to pass an extra parameter to the get_df_from_archive method with the structure of the archive. Like so:

path_to_archive = 'data/test_df.zip'target_files = {
 'identifier_infos_path' : 'test_df/Apple Music Activity/Identifier Information.json.zip',
 'library_tracks_path' : 'test_df/Apple Music Activity/Apple Music Library Tracks.json.zip',
 'library_activity_path': 'test_df/Apple Music Activity/Apple Music Library Activity.json.zip',
 'likes_dislikes_path' : 'test_df/Apple Music Activity/Apple Music Likes and Dislikes.csv',
 'play_activity_path': 'test_df/Apple Music Activity/Apple Music Play Activity.csv'
 }input_df = Utility.get_df_from_archive(path_to_archive, target_files)

And that’s it! input_df is a dictionary of the following structure :

{  
"identifier_infos_df" : identifier_infos_df,
"library_tracks_df" : library_tracks_df,
"library_activity_df" : library_activity_df,
"likes_dislikes_df" : likes_dislikes_df,
"play_activity_df" : play_activity_df    
}

Visualization Dataframe structure

The package defines a class called VisualizationDataframe to build all the objects that we will use for analysis and visualization later on. So let’s instantiate this class, using the input_df object we got previously.

viz_df_instance = VisualizationDataframe(input_df)

Ok, basically that’s it for this library :) Now you have a cleaned, parsed and processed data structure, with a few objects you will want to use for your analysis (see more details below).

Now let’s pause for a sec. This step of instantiating the VisualizationDataframe class may take a few seconds depending on the size of your data (for my data of a few tens of thousands of lines, it took a bit less than 30 seconds). You may not want to be running all the cleaning, parsing and processing every time you want to look at your data, unless the files inside the archive changed! So here come two handy functions: save and load pickles.

You can save the instance of the VisualizationDataframe class as a pickle file, and load it later on whenever you want to analyse/visualize the data again.

# we want to save viz_df_instance, but we could decide to save only the visualization dataframe, or any other object really....
 Utility.save_to_pickle(viz_df_instance, 'visualization_structure.pkl')
 
 # we want to load the file that was saved
 saved_visualization_structure = Utility.load_from_pickle('visualization_structure.pkl')

OK so let’s say we have in memory viz_df_instance, either because we just instantiated it, or because we loaded it from a pickle file. Let’s access some of its properties!

First of all, the most useful one I would say, is what I call the visualization dataframe. This pandas dataframe contains one row per playing activity, with as much information about each track as possible, such as its rating, all the genres associated to it, whether it is in the library, whether it was played partially, ….

# this returns the df_visualization property of the instance
df_viz = viz_df_instance.get_df_viz()

It is a pandas dataframe, so you can manipulate it like any pandas dataframe (get its shape, filter on some values, …)!

There are other objects you may want to access, that live in the track_summary_objects of the instance:

# the list of genres
genres_list = viz_df_instance.track_summary_objects.genres_list
 
# the list of titles for each artist
artist_titles = viz_df_instance.track_summary_objects.artist_tracks_titles

And count dictionaries (count of songs per year and genre, or year and artist):

# build a dictionary of counts per genre for each year
genre_counts_dict = viz_df_instance.track_summary_objects.build_ranking_dict_per_year(df_viz, 'Genres')
 
# or the same dictionary but with a count per artist
 artist_counts_dict  = viz_df_instance.track_summary_objects.build_ranking_dict_per_year(df_viz, 'Artist')

Use the Query module

Now let’s say you don’t want to be looking at the whole dataframe. And let’s say that you are not super comfortable with Pandas. No worries! The Query module is here to help you!

Basically, you build a dictionary with your query parameters, and create an instance of the Query module, that will provide a filtered dataframe for you. Let’s look at an example:

# we define the conditions of the filter
# so we want only songs played in 2017, 2018, 2019, that have a rating of 'LOVE' and were listened to completely
query_params = {
 'year':[2017, 2018, 2019],
 'rating':['LOVE'],
 'skippe':False
}# we get the visualization dataframe
df_viz = viz_df_instance.get_df_viz()# define the query
query_instance = QueryFactory().create_query(df_viz, query_params)# get the filtered df
filtered_df = query_instance.get_filtered_df()

The query_parameters dictionary accepts the following structure:

params_dict = {
   'year':list of int,
   'genre':list of str,
   'artist':list of str,
   'title':list of str,
   'rating':list of str,
   'origin':list of str,
   'offline':bool,
   'library':bool,
   'skipped':bool
}

This filtered dataframe is, again, a pandas dataframe that you can manipulate as usual with Pandas.

Play with visualizations

Now comes the fun! I will not go through all the examples possible here, I invite you to execute the example scripts from the GitHub repository, you will see straight away the visualizations.

Here I will show you two of my favorite visualizations, because I think they are the most meaningful: the sunburst and the heat map.

Important note: the module from the package we are going to use here is built using Plotly, and is actually just a wrapper to get quickly simple visualizations, and not a replacement to Plotly. Each class of this module has a property figure that you will want to interact with just like you would do with Plotly to get fancier visualizations!

Sunburst

This is a nice way to represent ranking.

There are four types of data you will be able to rank using this visualization: genres, artists, titles and track origin.

And because sunburst is actually providing ranking information, we are going to pass the class not the dataframe but a ranking dictionary built using this dataframe:

# we get the ranking dictionary
ranking_dict =  viz_df_instance.track_summary_objects.build_ranking_dict_per_year(df_viz, 'Genres')

You can replace ‘Genres’ by ‘Title’, ‘Artist’ or ‘Track_origin’.

# we create an instance of the SunburstVisualization class
# the second argument is the title we want to use for the plot, you can set it to whatever you want!
sunburst = SunburstVisualization(ranking_dict, 'Genre')# we render the plot - note that the graph will appear in your browser automatically
sunburst.render_sunburst_plot()

Heat Map

The Heat Map is a visualization I really like because it helps highlighting active days. Basically, there are two types of Heat Maps:

the one that will plot the months on the x-axis and the days of the month on the y-axis,

and the one that will plot the day of the week on the x-axis and the hour of the day on the y-axis.

Heat map per day of the week for a single year

In the first case, we will be able to figure out which days we listen to music more. With the second, it’s more when in the day.

Note: the time listening will be summed up in each cell if the combination month/DOM or DOW/HOD appears multiple times in the input dataframe.

You have the possibility to plot multiple subplots, which means that you can compare years (by plotting one subplot per year of the first type of heat map), or compare months.

In this example, let’s plot two heat maps:

case 1 -plot the months on the x-axis and the days of the month on the y-axis for the years 2018 and 2019 on two different subplots
case 2 — plot the day of the week on the x-axis and the hour of the day on the y-axis for the month of February of 2019

Case 1

We define the years for which we want to plot the heat maps, using a query dictionary (we are going to actually need filtered_df for each year).

Then we create an instance of HeatMapVisualization with the visualization dataframe not filtered (you will understand why after), and two subplots, one for each year (2018, 2019).

query_params = {
     'year':[2018, 2019]
}
heat_map = HeatMapVisualization(df_viz, 2)

For each year we want to plot, we will get a filtered_df with just the elements of that year (that is why we pass the whole dataframe to instantiate the HeatMapVisualization)

for year in query_params['year']:
 
   # we create a query parameters dictionary with a single year, and     all the other parameters that we had in the query_params dict    defined above
   year_query_params = query_params
   year_query_params['year'] = [year]
   
   # we get a filtered df
   query_instance = QueryFactory().create_query(df_viz, year_query_params)   # get the filtered df
   filtered_df = query_instance.get_filtered_df()
   
   # we replace the dataframe initially passed by the year filtered df
   heat_map.df = filtered_df
   
   # we render a single trace, in this case for month on x-axis, and day of the month (DOM) on y-axis
   heat_map.render_heat_map('DOM', str(year))

And finally we render the plot

# we render the whole figure that will contain 2 subplots, one per year, and on each the data just for that yearheat_map.figure.show()

And the plot will look like something like this:

Heat Map for each day of the month for 2018 and 2019

Case 2

We are going to create a HeatMap instance using a dataframe that is filtered on February 2019. Quick note on why we perform the filtering like this, simply because there is currently no way to use the Query module to query on months…. But if enough people comment that it would be great to have it I might add it!

# first we get a filtered dataframe on February 2019
df_viz_feb_2019 = df_viz[(df_viz['Play_Year']==2019)&(df_viz['Play_Month']==2)]# we create the HeatMap instance
heat_map = HeatMapVisualization(df_viz_feb_2019)

Then we build and render the plot of type ‘DOW’ (day of the week), with a legend 2020.

# generate the plot, the second argument is used as a legend
heat_map.render_heat_map('DOW', 'Feb 2019')# display the plot rendered
heat_map.figure.show()

And the plot will look like something like this:

Heat Map for each day of the week of Feb 2019

One last thing

I hope this introduction gave you a good starting point to use this package! Its purpose is really to abstract as much as possible for you to spend time mostly on the analysis of your data.

I built a comprehensive documentation that goes into more details about the structure of the package, as well as a few files with many examples. Feel free to take a look at them on the GitHub repository!

Also, I would be happy to assist if you have any question, and know what you think about this project, please get in touch!