Recommending Animes Using Nearest Neighbors
Recommender systems can be broadly divided into two categories : content based and collaborative filtering based. Content based recommender systems focus on the properties of the content to recommend items to user.
In this case we focus on the property of the “similarity” with the assumption that if a user likes a certain item like a movie, news article or a product, he/she may like similar items in future.
In entertainment/media based products like anime or movies, the assumption certainly holds to a certain extent. If I like to watch movies like Lord of the Rings, Pirates of the Caribbean, The Hobbit , it’s likely that I’ll like other fantasy/adventure based movies like The Golden Compass or Stardust.
Anime’s are a specific style of animation originating in Japan. Per Wikipedia :
The word anime is the Japanese term for animation, which means all forms of animated media. Outside Japan, anime refers specifically to animation from Japan or as a Japanese-disseminated animation style often characterized by colorful graphics, vibrant characters and fantastical themes.
I’ve always been a major anime fan all throughout my life. I’ve seen nearly all popular anime’s and I religiously follow new anime’s in each season. Finding good anime suggestions is actually pretty hard because there’s few websites for anime recommendation and rating. Of the ones which are popular, Myanimelist stands at the top with a huge database of anime and a vibrant community who rate and review the anime’s with precise ratings.
Recently, Myanimelist launched a dataset on Kaggle and I ended up making a simple recommender system with the data. In this post I’ll go over the procedure of making such a recommender based on properties like anime genre, rating, number of members reviewing the anime and share some results.
This data set contains information on user preference data from 73,516 users on 12,294 anime. Each user is able to add anime to their completed list and give it a rating and this data set is a compilation of those ratings. There’s two datasets, anime.csv with anime related data and rating.csv containing user preference related data. I’ve used only the ‘content’ based features, so only the Anime.csv dataset was used.
- anime_id — myanimelist.net’s unique id identifying an anime.
- name — full name of anime.
- genre — comma separated list of genres for this anime.
- type — movie, TV, OVA, etc.
- episodes — how many episodes in this show. (1 if movie).
- rating — average rating out of 10 for this anime.
- members — number of community members that are in this anime’s “group”.
K nearest Neighbor
To find similar items(animes) I’ve used K-nearest neighbor , a very simple machine learning algorithm. K-nearest neighbor finds the k most similar items to a particular instance based on a given distance metric like euclidean, jaccard similarity , minkowsky or custom distance measures.
KNN is used for both classification and regression problems. In classification problems to predict the label of a instance we first find k closest instances to the given one based on the distance metric and based on the majority voting scheme or weighted majority voting(neighbors which are closer are weighted higher) we predict the labels.
In an unsupervised setting such as this context we can simply find the neighbors and use them to recommend similar items. In rough words, to suggest similar animes I first find k-similar anime’s and recommend them to user. In this case I’ve retrieved top 5 most similar anime’s to a given query. For example, if I query “Naruto” to the recommender system, it will return me top 5 anime’s similar to Naruto.
I’ve used genre, type, episodes, rating and members as features and did not use the name feature by choice. I could have handled the text feature with tf-idf or other tactics like bag of words, but using the names actually would have made the recommendation ‘too easy’. It’s easy to show similar anime’s like Naruto if we show Naruto 2nd season and all the Naruto movies, I wanted to see how far a simple approach without using text features will go.
Code Notebook and Kaggle results
I ended up getting a bronze medal for the kernel in my former account(which I’ve since deleted), but here goes the upvotes!
Notebook from Github Gist :
Load the dataset
As I’ve mentioned before, the anime name feature is dropped. The type and episodes features have many missing values. I’ll go feature by feature now to describe how they were handled.
Many animes have unknown number of episodes even if they have similar rating. On top of that many super popular animes such as Naruto Shippuden, Attack on Titan Season 2 were ongoing when the data was collected, thus their number of episodes was considered as “Unknown”.
For some of my favorite animes I’ve filled in the episode numbers manually. For the other anime’s, I had to make some educated guesses. Changes I’ve made are :
- Animes that are grouped under Hentai Categories generally have 1 episode. So I’ve filled the unknown values with 1.
- Animes that are grouped are “OVA” stands for “Original Video Animation”. These are generally one/two episode long animes(often the popular ones have 2/3 episodes though), but I’ve decided to fill the unknown numbers of episodes with 1 again.
- Animes that are grouped under “Movies” are considered as ‘1’ episode as per the dataset overview goes.
- For all the other animes with unknown number of episodes, I’ve filled the na values with the median which is 2.
I’ve changed the type to categorical variables using pd.get_dummies, a pandas method for converting categorical features to dataframes with dummy/indicator variables.
Rating, Members and Genre
For members feature, I Just converted the strings to float.Episode numbers, members and rating are different from categorical variables and very different in values.
Rating ranges from 0–10 in the dataset while the episode number can be even 800+ episodes long when it comes to long running popular animes like One Piece or Naruto. This can bias the distance metric in KNN because features containing bigger numbers will be weighted heavily while the other features will be discounted.
So I ended up using MinMaxScaler from scikit-learn as it scales the values from 0–1.Many animes have unknown ratings. These were filled with the median of the ratings.
Before the scaling I transformed the genre feature to categorical variables with pandas and concatenated the dataframe to the other features. The end result is the dataframe called
anime_features which has all the features.
The columns of the
anime_features dataframe is given below.
Fit KNN Model
The scaling function returns a numpy array containing the features. Then we fit the KNN model from scikit learn to the data and calculate the nearest neighbors for each distances. In this case I’ve used the unsupervised
NearestNeighbors method for implementing neighbor searches. Note that I’ve used k=6 as a parameter because the first neighbor that the KNN returns is always itself since the distance of an instance to itself is 0 and we can’t use that.
I wrote some helper functions to query and show the results.
get_index_from_name(name): Returns the index of the anime if given the full name.
get_index_from_partial_name(name): Returns the index of all the animes that has that substring in their name. Many anime names have not been documented properly and in many cases the names are in Japanese instead of English and the spelling is often different. So I created this one.
print_similar_animes(query,id): Prints the top 5 similar animes after querying. We can query by both name and Id.
Recommending second seasons and related animes :
As we’ll see below, the recommender performs surprisingly well. If we consider an anime’s second season and related anime movies and other products must be very similar to itself, the recommender predicts them very well.
Naruto’s second season is Naruto Shippuden and Noragami’s 2nd season is Noragami Aragoto which are the first recommendations for both anime’s. Mushishi and Gintama are also long running animes and has many products which are all recommended. Shounen anime fans will know that the other recommendations for Naruto are also very similar e.g Katekyo Hitman Reborn, DBZ, Bleach and Boku No Hero Academia all are very popular action/shounen animes.
Anime Movie Recommendation
I wanted to check if the recommendations for animes and anime movies are different or not. Naruto is an anime, which is other anime movies related to it. If I get movie recommendations for movies and anime recommendations for anime’s, I can consider the recommender is considering the type of the expected content properly. To my surprise, despite such as a simple approach the recommender does pick up the difference between the movies and the anime’s quite well.
First we check for the content that has “Naruto” in it to see the name of the movies.
Then we check for the recommendations and well, it works! Since the distance metric is minkowsky the ‘type’ feature is helping us to differentiate between movie vs anime type.
I’ve been thinking about implementing a collaborative filtering based recommender with the rest of the dataset soon.