Anime Watcher Status Clustering for Popularity

Ty Barker
INST414: Data Science Techniques
3 min readDec 11, 2023

For this assignment, I used K-Means clustering on an anime dataset downloaded from Kaggle. The code explores various K values and justifies the optimal number of clusters using the sklearn silhouette score and visualization techniques with matplotlib. The purpose of this is to extract the insight of which anime has the most similar user watch status which can be useful for the production studios that are making new animes so that they can tell which anime are the most successful.

After download, the data underwent preprocessing to ensure compatibility with the K-Means algorithm. This involved handling missing values, some basic dimensionality reduction, and encoding categorical features.

K-Means Implementation:

The core of the code lies in the implementation of the K-Means algorithm. We utilize the sklearn.cluster.KMeans module from the scikit-learn library. The code iterates through various values of K, starting from 2 and increasing progressively to a limit of 15 which i admittedly chose arbitrarily. For each K, the algorithm randomly initializes K centroids and iteratively assigns data points to the closest centroid until convergence occurs.

Silhouette Score and Elbow Method:

To determine the optimal number of clusters, we employ the silhouette score and the elbow method. The silhouette score is calculated by sklearn.metrics.silhouette_score, measures the average silhouette coefficient for all data points. A silhouette value closer to 0indicates good clustering, while a value close to 1 suggests poor clustering.

The elbow method is a visual approach to determine the optimal K. It involves plotting the silhouette score against various K values. The optimal K value is typically identified as the “elbow” point where the silhouette score starts to plateau which for this assignment happened to be at a value of 10.

Elbow method plotted in Matplotlib

Cluster representation and meaning:

After the use of the elbow method, I used a K value of 10 as that seemed to be the place where the plateau indicates that the k value can't get much better. This resulted in ten clusters ranging from a size of four to the hundreds and the thousands. These clusters do seem to have a view into the most popular shows among viewers which is the insight I hoped to find. we can see this because shows that have the most praise and have had the biggest fan base at some point in time are in small clusters, and the bigger clusters are the more medium-hype shows. We even start to see clusters where the bulk of the points are second seasons of shows that had really good first seasons but the second was not as popular.

a cluster containing many spin-offs and second seasons.

Git hub link

--

--