Metis Project 2: Prediction Model on Anime Rating Score

Credits: Randyadr

As an anime lover, I am always keeping a lookout for new anime recommendations, therefore for my Metis Project 2, I am intrigued to predict the anime rating scores based on the MAL website on Top Anime Ranking List.

Methodology

Phase I: Data Processing & Data Cleaning

For this project, I started off by web-scraping the top 1000 ranked animes and their anime details using Beautiful Soup, however as I was doing my EDA, I realized that this data I scraped was biased, as it left out the animes with lower anime rating scores. Therefore, I went back to do more scraping and this time, I managed to retrieve all the ranked animes with a rating score and in addition, I alternated the pages for web-scraping.

Moving on data cleaning, I first dropped all the missing values and and then I studied each feature column, for which I noticed for there were too many different types of sources and studios, and thus, I selected the top few and reclassify the rest as others.

Phase 2: Feature Engineering & Selection

Now onto feature engineering, first I studied the pair-plot graphs for each feature, and from there, I selected 3 features for log transformation. Then I created a heatmap to check on their collinearity to my target.

Heatmap on Collinearity on Features

Heatmap above served as guide for me to remove features which were highly collinear.

For all my categorical features, I proceeded to create dummy variables for them, and did my OLS Statistic Summary to check on P-value, in which I would drop all the features with p-value<0.05 as a reference on their level of significance to my target.

Finally, this is the OLS Regression Result on my Dataset.

OLS Regression Result

Using RidgeCV, I went on to explore on the importance of my features selected for my model.

Phase 3: Model Selection (Train & Evaluate)

For modeling, I am utilizing Polynomial, Lasso, Ridge, and ElasticNet to see which one will return with the best R² score and lowest MAE and RSME. And guess who is the winner? Polynomial. Honestly speaking, not a fan of, but since the result was so, so let us accept it:)

I went on to check on the polynomial model’s performance, and below is the result.

Not that bad, isn’t it?

Conclusion

From the feature importance graph earlier, I came to a conclusion that good predictors for anime rating scores would involve these following aspects:

  1. If anime is produced by Production I.G
  2. If anime is added as Favorite on MAL website
  3. If anime storyline originate from Manga.

Insights

As you have already learnt by now, I do also feel that my prediction model in predicting anime rating scores is not the best. This is supported by its mediocre performance and also the check done on Linear Regression Assumption 1, in which it clearly states “regression is linear in parameters and correctly specified”. However, based on the graphs I derived for my model, they indicated otherwise.

Check on LR Assumption

My insight is further confirmed when I checked back on MAL website and realized that the animes scores (my target) are calculated based on a non-linear formula. Therefore, my next step will be to explore more on non-linear regression models to further improve on this, but… it has to wait till I have acquired the necessary knowledge and skills:)

Here is the link to my GitHub which have my notebooks with codes + descriptions and also my presentation slides!