Anime ratings and Gradient Descent

Tyree Stevenson
8 min readApr 17, 2018

--

What the heck is Gradient Descent?

Gradient descent is an optimization technique used to find the minimum value of a function. So what does that mean in English? So image you are sliding down a bowl water slide,

you will continue to slide down until you reach the bottom of the slide. That is exactly what Gradient Descent is doing for a function! Essentially you are a point on the slide, the slide is a function, your sliding motion is gradient descent, and the minimum point of the function is the bottom of the slide. In conclusion, Gradient Descent allows you to find the minimum value of a function aka the bottom of the water slide.

Why is this important?

Machine Learning and Deep Learning! Machine Learning and Deep Learning is essentially one big optimization task. You give the computer some data and expect it to make the most accurate prediction, classification, categorization, etc. So how do you ensure your predictions are accurate? You do this by minimizing the error in your prediction. This is known as the loss. The loss is the difference between your prediction and the actual value. In machine learning, you will use lose functions to assess the accuracy and fine tune your model. What is the best way to minimize this function? Gradient Descent of course.

Where’s the Anime?

In this blog post, we will be using the number of members in an Anime’s fan club to predict the rating of that Anime. The predictive model we will be using is Linear Regression.

Linear Regression Explained!

Linear Regression is a statistical model that allows you to predict continuous values, you can do this by using some independent variable x to predict some dependent variable y. A real-world example is predicting ice cream sales based on the temperature outside.

Your x value is the temperature and your y value is ice cream sales. As the temperature increases, the ice cream sales increases, thus your x and y are positively correlated.

The Magic Line of Predictions aka The Line of Best Fit

The magic line of predictions, formerly known as the Line of Best Fit, is used to predict values in your dataset. The Line of Best Fit is the bread and butter of Linear Regression, you can use this line to predict ice cream sales based on the temperature. We can calculate Linear Regression with this formula

h(x) is your prediction and x is your input. Using these points(x, h(x)) we can generate the Line of Best Fit. Why is this line not going through every point? That is because we can never have 100% accuracy in our predictions so we strive to get close enough. The space between our points on and Line of Best Fit is known as the error.

Minimizing Error

Once again, error, or loss, is the difference between our prediction value and the actual value. We can calculate this using our cost function.

If this is confusing google “Summation Notation”

The cost function tells you the cost of every prediction you make and the cost of a prediction is the loss. So how will you ever minimize this function? With Gradient Descent of course! So what exactly are we minimizing? Theta! The smaller your theta values, the lower your cost will be.

Everything is moving along in one big circle!

Calculating Gradient Descent

Let's Calculate Gradient Descent!

If this is confusing google “Summation Notation”

Remember our water slide example, for every iteration or amount we move down our water slide, theta is updated This goes on until we reach the bottom.

Learning Rate

The squiggly circle in your formula is known as alpha, alpha is the learning rate of your formula. You can think of the learning rate as the speed at which you are going down the water slide. If you are going to fast, then you will fly right over the bottom of the water slide. If you are going to slow, you may never reach the bottom of the water slide. Same goes for a function. If the learning rate is too high, you’ll overshoot the minimum or if the learning rate is too low, then you will never reach the minimum.

Can we talk about Anime again?

Yes! Now that you are a black belt in the art of Gradient Descent you can implement it in code. The problem we will be solving is determining the rating of an Anime based on the size of its fan club.

Data Preparation

Let’s use the Anime Recommendation Database. Within this directory, you are going to use the anime.csv file. You can get this dataset by clicking here. If you have the Kaggle API installed on your machine you can download the dataset in your terminal with this command:

kaggle datasets download -d CooperUnion/anime-recommendations-database

Now you can set up your project! The libraries you will be using are math, matplotlib, numpy, and pandas.

import math
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

Next, we will set up a data frame. A data frame is an object similar to a CSV file that allows you easily access your data in a structured manner. Think of each row as being its own entity with the columns being the rows features. Pandas is an extremely popular data analysis library in Python and it lets you easily create data frames.

data = pd.read_csv('anime.csv')

A common coding convention for pandas is using pd. You can find the correlation between fan club size and rating by running

print(data.corr())

The results were

          anime_id    rating   membersanime_id  1.000000 -0.284625 -0.080071rating   -0.284625  1.000000  0.387979members  -0.080071  0.387979  1.000000

members and rating has a 39% positive correlation.

Now lets graph your data. To avoid any issues of missing values let’s check for nulls

print(data['members'].isnull().values.any()) # Prints False
print(data['rating'].isnull().values.any()) # Prints True

Since, there are null values in the rating column. You must clean your data. So for now let's just remove any rows without a rating.

members = [] # Corresponding fan club size for row 
ratings = [] # Corresponding rating for row
for row in data.iterrows():
if not math.isnan(row[1]['rating']): # Checks for Null ratings
members.append(row[1]['members'])
ratings.append(row[1]['rating'])
members = np.asarray(members)
ratings = np.asarray(ratings)

Now every element in the members array has a corresponding element in the ratings array, you can think of them as ordered pairs. Also, by making them numpy arrays they are given some extra features like being able to multiply every element in the array by a number.

Initial Thetas

You can set your initial theta values

theta0 = 0.00001 # Random guess
theta1 = 0.00001 # Random guess
error = 0 # Overall error of our model.

Since you are dealing with very large numbers, choosing smaller thetas will be easier to graph. Also, the error variable will be used to track the overall error of our model.

Graphing Time

You can finally graph your points!

plt.scatter(members, ratings)
plt.show()

Let’s Generate your line of best fit

You can generate the line of best fit by implementing your hypothesis function.

def hypothesis(x, theta0, theta1):
return theta0 + theta1 * x

Now you can draw the line of best fit through your graph.

fig = plt.figure(dpi=100, figsize=(5, 4))
plt.scatter(members, ratings)
line, = plt.plot(members, hypothesis(members,theta0, theta0))
plt.xlabel('Members')
plt.ylabel('Ratings')
plt.xlim((0, max(members)))
plt.ylim((0, max(ratings)))
plt.show()

Optimization Time

Our final goal is to find the best fitting line. As explained earlier this can be done through Gradient Descent. Let’s implement our cost function first, this function will be used to track our model’s error.

def costFunction(x, y, theta0, theta1, m):
loss = 0
for i in range(m): # Represents summation
loss += (hypothesis(x[i], theta0, theta1) - y[i])**2
loss *= 1 / (2 * m) # Represents 1/2m
return loss

Finally, we can implement Gradient Descent!

def gradientDescent(x, y, theta0, theta1, alpha, m,iterations=1500):
for i in range(iterations):
gradient0 = 0
gradient1 = 0
for j in range(m): # Represents summation
gradient0 += hypothesis(x[j], theta0, theta1) - y[j]
gradient1 += (hypothesis(x[j], theta0, theta1) - y[j]) * x[j]
gradient0 *= 1/m
gradient1 *= 1/m
temp0 = theta0 - alpha * gradient0
temp1 = theta1 - alpha * gradient1
theta0 = temp0
theta1 = temp1
error = costFunction(x, y, theta0, theta1, len(y))
print("Error is:", error)
return theta0, theta1

Time to run this

Choosing your alpha, aka learning rate, will be the toughest part of this code. However, since this dataset deals with extremely large numbers, it would be wise to choose an extremely low alpha.

alpha = 0.000000000001 # Learning Rate
m = len(ratings) # Size of the dataset
print('Our final theta\'s', gradientDescent(members, ratings, theta0, theta1, alpha, m))

Results

Our Gradient Descent function outputs

Error is: 19.0094143945Error is: 19.0064845906Error is: 19.0035746131Error is: 19.0006843276Error is: 18.9978136011Error is: 18.9949623011Error is: 18.9921302961Error is: 18.9893174557Error is: 18.98652365Error is: 18.9837487504Error is: 18.9809926288Our final theta's (1.0008743726505416e-05, 4.1218948214535892e-05)
The movement of the line is harder to see due to the scale of the graph

Great! We, have now fitted our model and have an error rate of 19%.

Full Code

Click here for github link

In conclusion

Thanks to Gradient Descent you are now able to find the most optimal line of best fit, allowing your model to produce even more accurate predictions!

--

--

Tyree Stevenson

Incoming Twitter APM| Hulu Product Manager Intern | 2x Former Google Intern