# Exploring Anime data

## Let curiosity be your guide

The purpose of this article is to share how to apply some basic exploratory data analysis techniques to things that might interest you. In this article, we’ll be swimming a bit in the anime data pool. Here is a GitHub link to the example code.

## Why exploratory data analysis is important

Some of you might wonder why exploratory data analysis is important, though we probably know that it is the first step in any data analysis process.

An analogy is that I’d examine my calendar, tools, layout, and space before deciding on what anime to draw and how to draw so that the output would be the best. I could jump right into drawing whenever and whatever I like, but the result would be non-organized and messy.

Similarly, for data work, we always want to understand the situation before making any assumptions and decisions. It would help identify errors, understand any patterns, detect outliers or anomalies, and find interesting relations among the variables. In real industry work, helping identify errors may even uncover unidentified backend app data errors for the engineering team. Needless to say, accurate data is the foundation for any trustworthy downstream data work (analysis, dashboard, statistical models, machine learning, etc).

## Example data set

I chose the Anime Recommendations Database from Kaggle to start with because I personally like drawing and watching anime. Exploring anime data sounds interesting to me and who knows I might find some interesting insights and useful information to add to my anime knowledge base.

## Tools

I used Jupyter notebook for this exercise, which is a web-based interactive computing platform. On a side note, for additional/alternative resources, 1) Google collaboratory is a good platform where you could share your collab notebook conveniently; 2) Anaconda is a distribution of the Python/R programming languages for scientific computing that aims to simplify package management and deployment.

## Libraries to import

pandas is a python library written for data manipulation and analysis. It offers data structures and operations for manipulating numerical tables and time series.

matplotlib.pyplot is a state-based interface to matplotlib. pyplot is mainly intended for interactive plots and simple cases of programmatic plot generation.

seaborn is a python library written for data visualization based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.

%matplotlib inline sets the backend of matplotlib to the ‘inline’ backend. When using the ‘inline’ backend, your matplotlib graphs will be included in your notebook, next to the code.

`import pandas as pd`

import matplotlib.pyplot as plt

import seaborn as sns

%matplotlib inline

## Step 1: Load anime data to a pandas DataFrame

read_csv() is an important pandas function to read CSV files. na_values is used to create a string that considers pandas as NaN (Not a Number). na_values = [‘Unknown’] is used to set “Unkown” values to NaN to prepare data.

`anime_df = pd.read_csv('anime.csv', na_values = ['Unknown'])`

anime_df.head()

## Step 2: Descriptive statistics

Pandas describe() is used to view some basic statistical details like percentile, mean, std, etc. Pandas info() prints information about a DataFrame including the index type and columns, non-null values, and memory usage.

For example, **for the data field “episode”, the average is 12 episodes (about 75th percentile), whereas the median is 2, with a maximum of 1,818 episodes**. As we know, the median is more robust against outliers, let’s take a look at the distribution.

`anime_df.describe()`

anime_df.info()

## Step 3: Plot distributions

Let’s plot the distribution for “episodes”. We can see that it’s a right-skewed distribution with a very long tail. I took the shortcut by cutting the long tail ≥ 100 by eyeballing the “meat” so that we could observe the distribution better. In this case, we would be exploring the outliers and consider removing the outliers for summary metrics. For outlier removals, a common practice is to use the box plot method to remove outliers for normal distributions and use the adjusted box plot method for right-skewed distributions, which could be its own article. **Looking at the distributions below, a question for you is how would you present the results to others?**

`anime_df["episodes"].loc[anime_df["episodes"] < 100].hist(bins = 20)`

anime_df["rating"].hist(bins = 20)

anime_df["members"].loc[anime_df["members"] < 20000].hist(bins = 20)

## Step 4: Plot more visual representations

Let’s plot anime counts for “genre” first. This is a little tricky field, for which we have to first split multiple genres into one genre per row. Looking at the countplot (empowered by seaborn) below, it’s obvious that **“comedy” is the most common genre for anime, which is almost 1x the second common genre “action”**.

Plotting a pie for “genre” could be a good exploratory visual as well. Let’s try it on “type”. **Looking at the pie chart below, TV (31%), OVA (27%), and movie (19%) are noticeably the top three anime types**.

Note that OVA stands for “Original video animation”, which are Japanese animated films and series made especially for release in home video formats without prior showings on television or in theaters.

# countplot for "genre"

all_genres = []

def extract_single_value(column):

for item in column:

item = item.strip()

all_genres.extend(item.split(', '))

return all_genresgenre_column = anime_df["genre"]

genre_column = genre_column.astype(str)

all_genres = extract_single_value(genre_column)

genre_df = pd.DataFrame(all_genres, columns = ['genre'])ax = sns.countplot(x="genre", data=genre_df, order=genre_df['genre'].value_counts().index)

plt.xticks(rotation = 90, ha = 'right')

plt.gcf().set_size_inches(15, 8)# genre_count_df = genre_df.groupby(['genre']).size().reset_index(name='counts')# countplot for "type"

pie = sort_type.plot.pie(y = 'type', autopct='%1.1f%%', figsize=(10, 10))

I know we only covered a small part of exploratory data analysis, and there's so much more we could discover using this superpower, to mine gold out of this “black sand”.

**To summarize what we covered in this article:**

First of all, it’s most important to find the “right” data source, as any downstream **data work should be data-centric**.

Second, **exploratory data analysis is the first step of any data project** so that we could understand everything before we make any assumptions and decisions.

Third, there are **many useful python libraries already for data analysis** that you could import first to assist in discovering the data insights.

Fourth, there could be **different data representations/visualizations**, especially **for different data types** (e.g., continuous, discrete, categorical). Create mocks before jumping into visualizations.

Lastly, **presenting the insights to influence others is the key** to this deliverable.

Hope exploratory data analysis would be a useful tool in our daily life. Happy “snorkeling” and we’ll “deep dive” soon!