Recommendation System for Anime Data

Meimi Li
Analytics Vidhya
Published in
5 min readAug 9, 2020

Simple, TfidfVectorizer and CountVectorizer recommendation system for beginner.

The Goal

Recommendation system is widely use in many industries to suggest items to customers. For example, a radio station may use a recommendation system to create the top 100 songs of the month to suggest to audiences, or they might use recommendation system to identify song of similar genre that the audience has requested. Based on how recommendation system is widely being used in the industry, we are going to create a recommendation system for the anime data. It would be nice if anime followers can see an update of top 100 anime every time they walk into an anime store or receive an email suggesting anime based on genre that they like.

With the anime data, we will apply two different recommendation system models: simple recommendation system and content-based recommendation system to analyse anime data and create recommendation.

Overview

For simple recommendation system, we need to calculate weighted rating to make sure that the rating of the same score of different votes numbers will have unequal weight. For example, an average rating of 9.0 from 10 people will have lower weight from an average rating of 9.0 from 1,000 people. After we calculate the weighted rating, we can see a list of top chart anime.

For content-based recommendation system, we will need to identify which features will be used as part of the analysis. We will apply sklearn to identify the similarity in the context and create anime suggestion.

Data Overview

With the anime data that we have, there are a total of 12,294 anime of 7 different types of data including anime_id, name, genre, type, episodes, rating, and members.

Implementation

1. Import Data

We need to import pandas as this well let us put data nicely into the dataframe format.

import pandas as pd
anime = pd.read_csv('…/anime.csv')
anime.head(5)
anime.info()
anime.describe()

We can see that the minimum rating score is 1.67 and the maximum rating score is 10. The minimum members is 5 and the maximum is 1,013,917.

anime_dup = anime[anime.duplicated()]
print(anime_dup)

There is no duplicated data that need to be cleaned.

type_values = anime['type'].value_counts()
print(type_values)

Most anime are broadcast of the TV, followed by OVA.

2. Simple Recommendation System

Firstly, we need to know the calculation of the weighted rating (WR).

v is the number of votes for the anime; m is the minimum votes required to be listed in the chart; R is the average rating of the anime; C is the mean vote across the whole report.

We need to determine what data will be used in this calculation.

m = anime['members'].quantile(0.75)
print(m)

From the result, we are going to use those data that have more than 9,437 members to create the recommendation system.

qualified_anime = anime.copy().loc[anime['members']>m]
C = anime['rating'].mean()
def WR(x,C=C, m=m):
v = x['members']
R = x['rating']
return (v/(v+m)*R)+(m/(v+m)*C)
qualified_anime['score'] = WR(qualified_anime)
qualified_anime.sort_values('score', ascending =False)
qualified_anime.head(15)

This is the list of top 15 anime based on weighted rating calculation.

3. Genre Based Recommendation System

With genre based recommendation, we will use sklearn package to help us analyse text context. We will need to compute the similarity of the genre. Two method that we are going to use is TfidfVectorizer and CountVectorizer.

In TfidfVectorizer, it calculates the frequency of the word with the consideration on how often it occurs in all documents. While, CountVectorizer is more simpler, it only counts how many times the word has occured.

from sklearn.feature_extraction.text import TfidfVectorizertf_idf = TfidfVectorizer(lowercase=True, stop_words = 'english')
anime['genre'] = anime['genre'].fillna('')
tf_idf_matrix = tf_idf.fit_transform(anime['genre'])
tf_idf_matrix.shape

We can see that there are 46 different words from 12,294 anime.

from sklearn.metrics.pairwise import linear_kernelcosine_sim = linear_kernel(tf_idf_matrix, tf_idf_matrix)
indices = pd.Series(anime.index, index=anime['name'])
indices = indices.drop_duplicates()
def recommendations (name, cosine_sim = cosine_sim):
similarity_scores = list(enumerate(cosine_sim[indices[name]]))
similarity_scores = sorted(similarity_scores, key=lambda x: x[1], reverse=True)
similarity_scores = similarity_scores[1:21]
anime_indices = [i[0] for i in similarity_scores]
return anime['name'].iloc[anime_indices]
recommendations('Kimi no Na wa.')

Based of the TF-IDF calculation, this is the top 20 anime recommendations that are similar to Kimi no Na wa..

Next, we are going to look at another model, CountVectorizer() and we are going to compare the result between cosine_similarity and linear_kernel.

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
count = CountVectorizer(stop_words = 'english')
count_matrix = count.fit_transform(anime['genre'])
cosine_sim2 = cosine_similarity(count_matrix, count_matrix)
recommendations('Kimi no Na wa.', cosine_sim2)
cosine_sim2 = linear_kernel(count_matrix, count_matrix)
recommendations('Kimi no Na wa.', cosine_sim2)

Summary

In this article, we have look at the anime data and trying to build two types of recommendation systems. The simple recommendation system let us see the top chart anime. We have done this by using the weighted rating calculation on the voting and number of members. Then, we continue to build the recommendation system based on anime’s genre feature. With this, we apply both TfidfVectorizer and CountVectorizer to see the differences in their recommendation.

Hope that you enjoy this article!

--

--

Meimi Li
Analytics Vidhya

A graduate from business intelligence major who fascinate about Machines learning and Artificial Intelligence.