You watch Anime? What kind of fan are you?

Published in

INST414: Data Science Techniques

4 min readDec 8, 2023

Introduction

In the vast sea of shows to watch, no single person is going to have the time to test every one out in order to figure out their prefrences. This drawa a parallel to the vast sea of data that is available on the internet for individuals to work with. This has prompted us to go on a dive in the sea with the intention of uncovering non-obvious insights is a rewarding yet challenging task. For this analysis, we delve into the world of anime preferences, aiming to understand underlying patterns in viewer choices. The insight we seek is to identify distinct clusters of anime series based on their attributes, leading to a better understanding of viewer preferences and potentially informing content creation decisions.

Data Source and Features

The dataset under scrutiny is a compilation of anime series, each characterized by features such as the number of episodes, duration, popularity among members, and user scores. The similarity metric we employ is based on these features, allowing us to gauge the likeness between different anime series. Thisdataset was published by user MITTVIN on Kaggle.com.

Determining K-Values

Selecting an appropriate value for k, the number of clusters, is pivotal. In this analysis, we use the elbow method to determine the optimal k. By fitting the data to k-means clustering models for a range of k values and plotting the sum of squared distances, we identify the “elbow” point where the rate of decrease sharply changes. This inflection point serves as our guide, striking a balance between granularity and simplicity. The visualization constructed in this process is seen below.

Results

Clusters in our dataset represent groups of anime series with similar characteristics or appeal. For instance, one cluster might consist of long-duration series with high user scores, indicating a preference for immersive, well-received shows. Another cluster could include shorter, more niche anime with a cult following. Understanding the characteristics of each cluster provides insights into the diverse preferences within the anime community. We used Pyplot from MatPlotLib in order to construct the clusters and visualizations seen below.

Within these visualizations we can see that the animes are split into three clusters. The clusters are based on primarily medium for the anime (movie,tv,OVA), its number of episodes, its fan score, and the number of self proclaimed “members” that each anime has. Starting from the back and working forward, Cluster 3 consists of the show “One Piece”. This show is an anomaly in the sense that is has over 1000 episodes and has been running for much longer than every other item on the list. Clusters 0 and. 1 appear to be created based off the popularity of their respective shows, particularly the shows “score” rating and their number of “members”. Cluster 0 holds the shows with the higher score and more members. Notable items in cluster 0 are Steins Gate, Full Metal Alchemist, and Attack on Titan. We believe that ultimately, these clusters were formed off their popularity with cluster 1 having average animes, cluster 0 having more renound and popular pieces of media, and then cluster 2 just has one piece due to its religious fanbase and episode count of over 1000. Our results tell us that One Piece is on a playing field of its own and that

Cleaning and Limitations

One massive limitation of our dataset is the lack of the “show duration” being present in any of the visualizations pictured above. This is a result of difficulties encountered while cleaning.

When first presented with the dataset, it was formatted in such a way that made visualization and clustering impossible without inital cleaning. First, the “members” field was a string that had a comma seperated number followed by the word “members”(“12,403 members” for example). We had to remove all commas and then drop the word members before being able to manipulate this field.

A similar situation was extracting the number of episodes. There was a field called “episode” originally that had a string containing the medium for the anime followed by the number of episodes within parentheses. For example , the values in the “episode” field for popular anime Hunter X Hunter was “TV (148 eps)”. We separated the medium of the show and the number of episodes in order to give us two different fields, with one of them having numeric values that are now available for analysis.

Ultimately, the issue with cleaning the “duration” field turned into our dataset having no values for “duration_years” and negatively affected the accuracy of our analysis. Due to that issue, our results may be biased in a sense where a show that ran for 40 episodes in one season is seen to be just as popular as. a show that ran for 40 episodes in 5 seasons. Typically, a shows animator decides whether or not to renew a show based on its popularity. A show going on for multiple season is a sign that it is desired by fans and animators alike

A code to the repository used can be found here