Data Analysis & Visualisation of Netflix Viewing History

Gaurang Swarge
Analytics Vidhya
Published in
5 min readNov 23, 2019
Photo by Jens Kreuter on Unsplash

This is part of my series of documenting my small experiments using R or Python & solving Data Analysis / Data Science problems. These experiments might be redundant and may have been already written and blogged about by various people, but this is more of a personal diary and my personal learning process. And, during this process, i hope that i can engage and inspire anyone else who is going through the same process as mine. If a more knowledgeable person than me, stumbles upon this blog and thinks there is a much better way to do things or i have erred somewhere, please feel free to share the feedback and help not just me but everyone grow together as a community.

Recently, I was going through my Netflix’s “My Account” page and realised that you could download your profiles viewing activity in a csv format, I immediately thought it would be pretty cool to visualise my Netflix usage. Lately, i have been practicing my python skills, this seemed like a good opportunity to use Matplotlib / seaborn libraries.

Though, i was set up for disappointment, because this is the data that Netflix exported:

The csv file had only 2 columns, date and the name of the show /season / episode in one column.

I figured, there isn’t much i can do about this and had thought of giving up on this project, but then again i didn’t want to give up so easily, besides this is the essence of working with the data, figuring out how to make things work. I took it up as a challenge for myself to atleast be able to get two visualisations out of this to figure out some insights into my Netflix related behaviours.

Since i had only 2 columns to deal with, i started tinkering with the pandas data functions to get more out of these columns and by the time i finished, I managed to go from 2 columns to 10 columns in the dataset. I replicated the same process for my wife’s Netflix profile , in order to do an comparison of our viewing habits.

First things first, lets start with the visualisations that i could extract from the data. Here they are:

Day wise viewing distribution
Viewing history based on the day of the week
Month-wise Viewing History
What do we prefer? Movie or a TV Show?
Most watched shows

About the Data:

This Data is from August 2018 to Mid-Nov 2019. There are few things that this data doesn't capture.

First, Obviously data cannot tell us when both me and my wife watch Netflix together.

Other problem with the dataset is, the shows which have most number of episodes and seasons, will be more frequent in the dataset than shows which have only couple of seasons. so naturally shows with most frequencies are the shows which have multiple seasons and episodes (Eg: Friends, Brooklyn 99 etc).

With that out of the way, lets move on. So some of the insights based on the graphs:

  • January & December was when i spent most amount of time watching Netflix (obvious reason, it was holidays )where as my wife watched most amount of Netflix in May,June,August (reason: she was in between the jobs ) (Did you notice how July is lower than August, thats because her Mom was visiting us in July, she spent more time with her than Netflix)
  • I usually watch Netflix on weekends, whereas my wife watches Netflix mostly on Sunday and Monday (that’s interesting insight, is she trying to beat the Monday Blues?)
  • Her third most watched day is Friday which is usually my least watched Netflix day.
  • In terms of shows, the most amount of time i spent watching is Brooklyn 99 where as my wife spent most of her time watching Bojack Horseman (so did I, Bojack is at second place in my viewing pattern)
  • Between TV Shows and Movies, both of us watch TV shows the most. Even when we do watch movies, its almost always on a Saturday.

How did i do it?

So, now that is out of the way this is how i went about generating the visualisation.

After importing the csv file into my notebook.

I started first with tinkering around with the date column, first I converted the column in datetime format. This enables us to extract the individual components of a date. I extracted Day, Month, Year, Day_of_week from this date column into separate columns using the to_datetime function of Pandas.

Post this i turned my attention towards Title column. If you notice carefully, entries in the Titles are constructed in this format in the column “Show Name: Season: Episode Name”. Since this pattern is mostly consistent in all the dataset, we can split the string and extract it into 3 seperate columns: show_name, season, episode_name.

I also noticed, that the title of any Movie that was in the dataset, it only had a Movie Name, which leads me to believe that all the rows where season is Null, it means it is most likely a Movie. Lets create another column which specifies whether its a Movie or a TV Show

Now that we have fleshed out our dataset with new columns, we can start visualising the data. I wont get into details of how to visualise, You can check out the code for visualisations in case you are interested at this link :

GitHub Rep : https://github.com/rckclimber/analysing-netflix-viewing-history

I’m sure there is far more that can be done in this dataset to glean insights, one such idea that i have is to scrape the details of all the shows and add more columns to this dataset, like “Genre”, “Episode Time” etc. so that we can dig much deeper.

Well maybe my next post can tackle these ideas :)

--

--

Gaurang Swarge
Analytics Vidhya

Ex-Entrepreneur, Data Scientist. love climbing, hiking and yoga