Exploring Anime data

The purpose of this article is to share how to apply some basic exploratory data analysis techniques to things that might interest you. In this article, we’ll be swimming a bit in the anime data pool. Here is a GitHub link to the example code.

Why exploratory data analysis is important

Some of you might wonder why exploratory data analysis is important, though we probably know that it is the first step in any data analysis process.

An analogy is that I’d examine my calendar, tools, layout, and space before deciding on what anime to draw and how to draw so that the output would be the best. I could jump right into drawing whenever and whatever I like, but the result would be non-organized and messy.

Similarly, for data work, we always want to understand the situation before making any assumptions and decisions. It would help identify errors, understand any patterns, detect outliers or anomalies, and find interesting relations among the variables. In real industry work, helping identify errors may even uncover unidentified backend app data errors for the engineering team. Needless to say, accurate data is the foundation for any trustworthy downstream data work (analysis, dashboard, statistical models, machine learning, etc).

Example data set

I chose the Anime Recommendations Database from Kaggle to start with because I personally like drawing and watching anime. Exploring anime data sounds interesting to me and who knows I might find some interesting insights and useful information to add to my anime knowledge base.

Tools

I used Jupyter notebook for this exercise, which is a web-based interactive computing platform. On a side note, for additional/alternative resources, 1) Google collaboratory is a good platform where you could share your collab notebook conveniently; 2) Anaconda is a distribution of the Python/R programming languages for scientific computing that aims to simplify package management and deployment.

Libraries to import

pandas is a python library written for data manipulation and analysis. It offers data structures and operations for manipulating numerical tables and time series.

matplotlib.pyplot is a state-based interface to matplotlib. pyplot is mainly intended for interactive plots and simple cases of programmatic plot generation.

seaborn is a python library written for data visualization based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.

%matplotlib inline sets the backend of matplotlib to the ‘inline’ backend. When using the ‘inline’ backend, your matplotlib graphs will be included in your notebook, next to the code.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Step 1: Load anime data to a pandas DataFrame

read_csv() is an important pandas function to read CSV files. na_values is used to create a string that considers pandas as NaN (Not a Number). na_values = [‘Unknown’] is used to set “Unkown” values to NaN to prepare data.

anime_df = pd.read_csv('anime.csv', na_values = ['Unknown'])
anime_df.head()

Step 2: Descriptive statistics

Pandas describe() is used to view some basic statistical details like percentile, mean, std, etc. Pandas info() prints information about a DataFrame including the index type and columns, non-null values, and memory usage.

For example, for the data field “episode”, the average is 12 episodes (about 75th percentile), whereas the median is 2, with a maximum of 1,818 episodes. As we know, the median is more robust against outliers, let’s take a look at the distribution.

anime_df.describe()
anime_df.info()

Step 3: Plot distributions

Let’s plot the distribution for “episodes”. We can see that it’s a right-skewed distribution with a very long tail. I took the shortcut by cutting the long tail ≥ 100 by eyeballing the “meat” so that we could observe the distribution better. In this case, we would be exploring the outliers and consider removing the outliers for summary metrics. For outlier removals, a common practice is to use the box plot method to remove outliers for normal distributions and use the adjusted box plot method for right-skewed distributions, which could be its own article. Looking at the distributions below, a question for you is how would you present the results to others?

anime_df["episodes"].loc[anime_df["episodes"] < 100].hist(bins = 20)
anime_df["rating"].hist(bins = 20)
anime_df["members"].loc[anime_df["members"] < 20000].hist(bins = 20)
Num of episodes distribution
Rating distribution
Num of members distribution

Step 4: Plot more visual representations

Let’s plot anime counts for “genre” first. This is a little tricky field, for which we have to first split multiple genres into one genre per row. Looking at the countplot (empowered by seaborn) below, it’s obvious that “comedy” is the most common genre for anime, which is almost 1x the second common genre “action”.

Plotting a pie for “genre” could be a good exploratory visual as well. Let’s try it on “type”. Looking at the pie chart below, TV (31%), OVA (27%), and movie (19%) are noticeably the top three anime types.

Note that OVA stands for “Original video animation”, which are Japanese animated films and series made especially for release in home video formats without prior showings on television or in theaters.

# countplot for "genre"
all_genres = []
def extract_single_value(column):
for item in column:
item = item.strip()
all_genres.extend(item.split(', '))
return all_genres
genre_column = anime_df["genre"]
genre_column = genre_column.astype(str)
all_genres = extract_single_value(genre_column)
genre_df = pd.DataFrame(all_genres, columns = ['genre'])
ax = sns.countplot(x="genre", data=genre_df, order=genre_df['genre'].value_counts().index)
plt.xticks(rotation = 90, ha = 'right')
plt.gcf().set_size_inches(15, 8)
# genre_count_df = genre_df.groupby(['genre']).size().reset_index(name='counts')# countplot for "type"
pie = sort_type.plot.pie(y = 'type', autopct='%1.1f%%', figsize=(10, 10))
Anime counts by genre
% of anime-type

I know we only covered a small part of exploratory data analysis, and there's so much more we could discover using this superpower, to mine gold out of this “black sand”.

To summarize what we covered in this article:

First of all, it’s most important to find the “right” data source, as any downstream data work should be data-centric.

Second, exploratory data analysis is the first step of any data project so that we could understand everything before we make any assumptions and decisions.

Third, there are many useful python libraries already for data analysis that you could import first to assist in discovering the data insights.

Fourth, there could be different data representations/visualizations, especially for different data types (e.g., continuous, discrete, categorical). Create mocks before jumping into visualizations.

Lastly, presenting the insights to influence others is the key to this deliverable.

Hope exploratory data analysis would be a useful tool in our daily life. Happy “snorkeling” and we’ll “deep dive” soon!

--

--

--

let fear be your guide

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

The Palantiri and Lord of the Rings and the Sins of a Data Scientist

How Firouzja Gets a 10,000 Points Rating in a Fictional Tournament

Towards Ending Injustice: 3 Ways Deep Learning Could Halt the Rise of Police Brutality

Big Data Use-Cases in Healthcare(Covid-19)

Intuitive Explanation of Sum/Difference Of Random/Correlated Variables

Building the Map of Life, our single source of Healthcare R&D data powered by data42

Cracking the Code: Roadmap for Faster, Smarter Data and Analytics Modernization

An Exploratory Analysis on Cetaceans in the US

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
kebabex

kebabex

let fear be your guide

More from Medium

Customer Segmentation for Arvato Financial Services

Visualizing Data Efficiently and Explaining It

Data Mining-Association Rules

Data Analytics — Notes for Tyro