An Exploratory Data Analysis on trending Television Networks and Shows in the Media Industry.

Published in

Web Mining [IS688, Spring 2021]

10 min readFeb 6, 2021

Do you like watching TV shows and movies in your leisure time? I am sure most of would love to do that. Also, i am pretty sure most of the time you all waste some of your time in finding a good TV Network where you can watch popular TV shows. A new study reveals that users spend 17.8 minutes, on average, browsing possible TV and movie options before selecting something. Also, there are alot of Networks available in the market where we can watch movies and every network has different set of Tv shows.
In this blog, we will be exploring popular TV shows and Streaming Networks using TVMAZE API. I primarily work on answering questions like — which Network is popular and have maximum number of shows, rating perecentage of shows on different networks, what all shows are still running their episodes and which shows have ended, runtime of shows. Also, if you are specifically looking for Japenese shows, i will be exploring popular Japnese whose based on their Runtime and Ratings.

So, for this analysis, I used Python language in the Jupyter notebook. I have taken the relevant data source, cleaned the data into a usable format, and made insightful visualization using amazing libraries available in python.

Let’s get started

So, to answer the above questions I need a data source from where I can extract significant features like TV show language, Genre, Type of show, the Network on which the show is featured.

After a lot of exploration, I chose to use TVMAZE API (Application Programming Interface) which is accessible for free. It provided all the features in the dataset that can answer our research question and that’s how I found it relevant.

If you guys are not familiar with API, you can have a quick look at this link and get an idea about API so that you can move forward. TVMAZE provides a REST API that returns a JSON. TVMAZE API calls rates are limited to allow at least 20 calls every 10 seconds per IP address. If the rate is exceeded, we might get an HTTP 429 error.

I used the URL: http://api.tvmaze.com/shows
This above URL will be used to make a HTTP request using GET method, to retrieve data from a specific resource. To be more clear, here API acts as a service provider which inputs request in the form of HTTP request and send it to server on the other end, API then gets a response from the server in the form of data.

Data Collection

Importing required Libraries

import requests
import json
import pandas as pd
import ast
import matplotlib
import matplotlib.pyplot as plt%matplotlib inline

The requests the library is the de facto standard for making HTTP requests in Python. It’s the main library for consuming data when using an API. The data we request using the HTTP method comes in JSON format so json library is used to parse JSON string to python. The pandas library is used for data manipulation and analysis and the matplotlib library is used to create some interactive visualizations in python.

I created a data_request function that is used to collect the data from the API as JSON and converted it into a data frame using pandas. The data frame is saved as a CSV file using pandas.

def data_request(link):
    
    result = requests.get(link)
    result.status_code
    if result.status_code == 200:
        data = result.json()
        print("\n Data is retrieved sucessfully. \n")
        
        df = pd.DataFrame.from_records(data)
        df.to_csv('tvmazedata.csv', encoding='utf-8', index=False) 
        print(df.head())
    else:
        print("\n ERROR")

Calling the function and retrieving the data. The data looks raw.

Loading the dataset

Data Pre-Processing

Here, I will be cleaning the dataset into a more usable format

Dropping the columns that are not required

There are 20 columns in the dataset. Dropping columns, ‘id’, ‘_links’, ‘url’, ‘officialSite’, ‘externals’, ‘schedule’, ‘image’, ‘weight’, ‘updated’, ‘webChannel’, ‘summary’, ‘webChannel’, as they will not be required for data analysis.

Handling missing values

The “network” column is a categorical feature. It has 7 missing values. We don’t know what the missing values can be, we will fill it with {‘name’:’Others’}.

df1["network"].fillna("{'name':'Others'}", inplace = True)

Further modification

The values in “network” column are in the form of a dictionary consisting of key-value pairs. Extracting the value of a key called “name” from the dictionary and putting it in a new column “Show_network”.

The “name” key contains the name of the network of the shows. We will extract the value of this key and put it in a new column, “Show_network”

net = []
for i in df1.loc[0:]["network"]:
    res = ast.literal_eval(i)
    net.append(res["name"])
    
df1["Show_network"] = net

Similarly, the values in “rating” column are in the form of a dictionary consisting of key-value pair. Extracting the value of a key called “rating” from the dictionary and putting it in a new column “Show_rating”.

net_2 = []
for i in df1.loc[0:]["rating"]:
    res_2 = ast.literal_eval(i)
    net_2.append(res_2["average"])
df1["Show_rating"] = net_2

There are 6 missing values in the column “rating”. We will calculate the average of all the ratings in the column, and fill the missing values by the mean of the rating column.

avg = round(df1["Show_rating"].mean(), 2) 
df1["Show_rating"].fillna(avg, inplace = True)

The column "premiered" consists of dates when the show was premiered. The values are strings. We will convert the convert string date time into Python Date time object using pandas' to_datetime() function.

#Coverts string datetime into python date time objectdf1['premiered'] = pd.to_datetime(df1['premiered'], infer_datetime_format=True)  #Column consists of years when the show was releaseddf1['Show_premiered'] = df1['premiered'].dt.year

The new column “Show_premiered” consists of the years when the respective show was released.

Saving the cleaned dataset

The cleaned dataset will be saved in a new CSV file. The new CSV file named “tvmazedata_cleaned.csv”

df1.to_csv("tvmazedata_cleaned.csv", encoding='utf-8', index=False)
print("\n The file is saved as tvmazedata_cleaned.csv \n")

The cleaned dataset contains 9 columns.

Cleaned Dataset Features:

The name column consists of the name of the shows.

Type column consists of the type of the show. For example, a show can be scripted.
Language column consists of the language of the show. for example, a show can be in English.
The genres column consists of the genres of the show. It contains a list of genres.
Status column consists of the status of the shows. For example, a show might have ended.
Runtime column consists of the runtime of the show in minutes.
Premiered volume consists of the exact date when the show was premiered.
Show_network column consists of the network to which the show belongs.
Show_rating_round consists of the rounded off rating of the show. It has been rounded off for convince.
Show_premiered column consists of the year when the show was premiered (Pythons’s Date Time Format).

Exploratory Data Analysis

Info() function returns the count of values in each column, also tells the numbers of non-nulls values and the data type of each column. There are 6 categorical features, 1 DateTime type and 3 are numerical features.

Describe() function computes a summary of statistics about numeric columns. This function gives the mean, std, and IQR values.

The “Show_network” contains the name of networks. There are 44 distinct networks for each show. Each show will belong to one of these networks. Our aim is to find the network that has the maximum number of shows.

counts = df2["Show_network"].value_counts() 
p = counts.sort_values().plot.barh(figsize=(25,20), fontsize=18) 
p.set_xlabel("Number of Shows",fontsize=18) 
p.set_ylabel("Networks",fontsize=18)
p.set_title("Show networks vs the number of shows for individual networks", fontsize=20)

The maximum number of shows are from “CBS” which is 29 out of 240 shows. After that “ABC” has the maximum number of shows which is 27.

Most rated shows

Now we want to see what rating most shows have. For this, “Show_rating_round” column is used that contains ratings that are rounded off.

We can see that 62.07% of the shows in CBS have a rating of 8 and only 3.45% have a rating of 9.

Maximum shows on CBS have a rating (rounded off) of 8. The number of shows is 18.
The least number of shows have a rating (rounded off) of 9. The number of shows is 1.

The “status” feature consists of the status of the shows. The show may be running, ended, or to be determined.

203 shows have the status ended. This is because they have ended and no new seasons are releasing.

Now let's see which Network has the maximum number of highest rated shows with status ended.

HBO has the 5 shows with a rating of 9 (maximum rating) and has ended. After that FOX has 4 shows, with a rating of 9, and has ended.

Cinemax, CTV Sci-Fi Channel, Channel 4, NBC, and Nickelodeon has the least number of shows, i.e. 1 show, which rating of 9 and has ended.

The shows in the dataset can have one of the two languages, English and Japanese. 4 shows( Hellsing, Hellsing Ultimate, Berserk, Death Note) have Japanese language and 235 shows have the English language.

The above plotted three bar graphs for the Japanese shows represents:

Japnese Shows VS Runtime: We can see that the maximum runtime is 50 minutes and the minimum runtime is 25 minutes. “Hellsing Ultimate” has the maximum runtime and “Berserk” has the minimum runtime.
Japnese Shows VS Ratings: We can see that the shows have one of the two ratings, 8 or 9. “Hellsing” and “Hellsing Ultimate has a rating of 8, whereas “Beserk” and “Death Note” has a rating of 9.
Japnese Shows VS Premiered Years: We can see that out of these 4 shows, the show that was released the first was “Berserk” in 1997. “Hellsing Ultimate” and “Death Note” were released in 2006.

Visualizing the “premiered” column using a histogram. This column contains the full date, including the day and the month, when the show was premiered.

The peak is from the year 2011 to 2014. This is because the total number of shows released is around 142, which is the maximum. This means that the networks released more shows as they were becoming popular in this period. From the year 1989 to 1992, the least number of shows were released. There are no shows from 1992 to 1996.

Limitations

One of the major limitations is the dataset is quite small. With free TVMaze API calls rate is limited to allow at least 20 calls every 10 seconds per IP address. Also, there can be more relevant features that can be added for more detailed analysis.
The second problem I faced while converting data from nested JSON to the proper format and saving the data into a data frame. I solved that limitation by taking online references from StackOverflow.

Conclusion and Summary

Through our analysis, we can see which all shows are popular and which all networks have different shows. CBS network has the maximum number of shows, and a maximum number of shows from CBS has been rated 8. It means most shows that have been released over the years are from CBS. Apart from that, we can see that most of the shows run for 60 minutes. The maximum number of shows have ended, which means no new seasons for the shows are released. From our analysis, we can see that HBO has the maximum shows that have ended with a rating of 9. HBO has one of the most successful shows, Game of Thrones, which is highly rated and has ended. We can also see that most of the shows are scripted, we can infer that most people prefer scripted shows over other types of shows.

Further Analysis:

Further, we can also use the location to find what shows are popular in which regions and how do people choose what shows to watch. It can be related to the rating or runtime or genres. We can also analyze the average age groups of people that watch a particular genre of shows.

References

[1] https://stackoverflow.com/questions/53951554/add-column-of-dataframe-based-on-a-nested-json-in-column

[2] https://medium.com/python-in-plain-english/from-api-to-pandas-getting-json-data-with-python-df127f699b6b.

[3] https://www.tvmaze.com/blogs/3/tv-api

[4] https://medium.com/datadriveninvestor/introduction-to-exploratory-data-analysis-682eb64063ff