Anime ratings and Gradient Descent
What the heck is Gradient Descent?
Gradient descent is an optimization technique used to find the minimum value of a function. So what does that mean in English? So image you are sliding down a bowl water slide,
you will continue to slide down until you reach the bottom of the slide. That is exactly what Gradient Descent is doing for a function! Essentially you are a point on the slide, the slide is a function, your sliding motion is gradient descent, and the minimum point of the function is the bottom of the slide. In conclusion, Gradient Descent allows you to find the minimum value of a function aka the bottom of the water slide.
Why is this important?
Machine Learning and Deep Learning! Machine Learning and Deep Learning is essentially one big optimization task. You give the computer some data and expect it to make the most accurate prediction, classification, categorization, etc. So how do you ensure your predictions are accurate? You do this by minimizing the error in your prediction. This is known as the loss. The loss is the difference between your prediction and the actual value. In machine learning, you will use lose functions to assess the accuracy and fine tune your model. What is the best way to minimize this function? Gradient Descent of course.
Where’s the Anime?
In this blog post, we will be using the number of members in an Anime’s fan club to predict the rating of that Anime. The predictive model we will be using is Linear Regression.
Linear Regression Explained!
Linear Regression is a statistical model that allows you to predict continuous values, you can do this by using some independent variable x to predict some dependent variable y. A real-world example is predicting ice cream sales based on the temperature outside.
Your x value is the temperature and your y value is ice cream sales. As the temperature increases, the ice cream sales increases, thus your x and y are positively correlated.
The Magic Line of Predictions aka The Line of Best Fit
The magic line of predictions, formerly known as the Line of Best Fit, is used to predict values in your dataset. The Line of Best Fit is the bread and butter of Linear Regression, you can use this line to predict ice cream sales based on the temperature. We can calculate Linear Regression with this formula
h(x) is your prediction and x is your input. Using these points(x, h(x)) we can generate the Line of Best Fit. Why is this line not going through every point? That is because we can never have 100% accuracy in our predictions so we strive to get close enough. The space between our points on and Line of Best Fit is known as the error.
Minimizing Error
Once again, error, or loss, is the difference between our prediction value and the actual value. We can calculate this using our cost function.
The cost function tells you the cost of every prediction you make and the cost of a prediction is the loss. So how will you ever minimize this function? With Gradient Descent of course! So what exactly are we minimizing? Theta! The smaller your theta values, the lower your cost will be.
Everything is moving along in one big circle!
Calculating Gradient Descent
Let's Calculate Gradient Descent!
Remember our water slide example, for every iteration or amount we move down our water slide, theta is updated This goes on until we reach the bottom.
Learning Rate
The squiggly circle in your formula is known as alpha, alpha is the learning rate of your formula. You can think of the learning rate as the speed at which you are going down the water slide. If you are going to fast, then you will fly right over the bottom of the water slide. If you are going to slow, you may never reach the bottom of the water slide. Same goes for a function. If the learning rate is too high, you’ll overshoot the minimum or if the learning rate is too low, then you will never reach the minimum.
Can we talk about Anime again?
Yes! Now that you are a black belt in the art of Gradient Descent you can implement it in code. The problem we will be solving is determining the rating of an Anime based on the size of its fan club.
Data Preparation
Let’s use the Anime Recommendation Database. Within this directory, you are going to use the anime.csv file. You can get this dataset by clicking here. If you have the Kaggle API installed on your machine you can download the dataset in your terminal with this command:
kaggle datasets download -d CooperUnion/anime-recommendations-database
Now you can set up your project! The libraries you will be using are math, matplotlib, numpy, and pandas.
import math
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
Next, we will set up a data frame. A data frame is an object similar to a CSV file that allows you easily access your data in a structured manner. Think of each row as being its own entity with the columns being the rows features. Pandas is an extremely popular data analysis library in Python and it lets you easily create data frames.
data = pd.read_csv('anime.csv')
A common coding convention for pandas is using pd. You can find the correlation between fan club size and rating by running
print(data.corr())
The results were
anime_id rating membersanime_id 1.000000 -0.284625 -0.080071rating -0.284625 1.000000 0.387979members -0.080071 0.387979 1.000000
members and rating has a 39% positive correlation.
Now lets graph your data. To avoid any issues of missing values let’s check for nulls
print(data['members'].isnull().values.any()) # Prints False
print(data['rating'].isnull().values.any()) # Prints True
Since, there are null values in the rating column. You must clean your data. So for now let's just remove any rows without a rating.
members = [] # Corresponding fan club size for row
ratings = [] # Corresponding rating for rowfor row in data.iterrows():
if not math.isnan(row[1]['rating']): # Checks for Null ratings
members.append(row[1]['members'])
ratings.append(row[1]['rating'])members = np.asarray(members)
ratings = np.asarray(ratings)
Now every element in the members array has a corresponding element in the ratings array, you can think of them as ordered pairs. Also, by making them numpy arrays they are given some extra features like being able to multiply every element in the array by a number.
Initial Thetas
You can set your initial theta values
theta0 = 0.00001 # Random guess
theta1 = 0.00001 # Random guess
error = 0 # Overall error of our model.
Since you are dealing with very large numbers, choosing smaller thetas will be easier to graph. Also, the error variable will be used to track the overall error of our model.
Graphing Time
You can finally graph your points!
plt.scatter(members, ratings)
plt.show()
Let’s Generate your line of best fit
You can generate the line of best fit by implementing your hypothesis function.
def hypothesis(x, theta0, theta1):
return theta0 + theta1 * x
Now you can draw the line of best fit through your graph.
fig = plt.figure(dpi=100, figsize=(5, 4))
plt.scatter(members, ratings)
line, = plt.plot(members, hypothesis(members,theta0, theta0))
plt.xlabel('Members')
plt.ylabel('Ratings')
plt.xlim((0, max(members)))
plt.ylim((0, max(ratings)))
plt.show()
Optimization Time
Our final goal is to find the best fitting line. As explained earlier this can be done through Gradient Descent. Let’s implement our cost function first, this function will be used to track our model’s error.
def costFunction(x, y, theta0, theta1, m):
loss = 0
for i in range(m): # Represents summation
loss += (hypothesis(x[i], theta0, theta1) - y[i])**2
loss *= 1 / (2 * m) # Represents 1/2m
return loss
Finally, we can implement Gradient Descent!
def gradientDescent(x, y, theta0, theta1, alpha, m,iterations=1500):
for i in range(iterations):
gradient0 = 0
gradient1 = 0
for j in range(m): # Represents summation
gradient0 += hypothesis(x[j], theta0, theta1) - y[j]
gradient1 += (hypothesis(x[j], theta0, theta1) - y[j]) * x[j]
gradient0 *= 1/m
gradient1 *= 1/m
temp0 = theta0 - alpha * gradient0
temp1 = theta1 - alpha * gradient1
theta0 = temp0
theta1 = temp1
error = costFunction(x, y, theta0, theta1, len(y))
print("Error is:", error)
return theta0, theta1
Time to run this
Choosing your alpha, aka learning rate, will be the toughest part of this code. However, since this dataset deals with extremely large numbers, it would be wise to choose an extremely low alpha.
alpha = 0.000000000001 # Learning Rate
m = len(ratings) # Size of the dataset
print('Our final theta\'s', gradientDescent(members, ratings, theta0, theta1, alpha, m))
Results
Our Gradient Descent function outputs
Error is: 19.0094143945Error is: 19.0064845906Error is: 19.0035746131Error is: 19.0006843276Error is: 18.9978136011Error is: 18.9949623011Error is: 18.9921302961Error is: 18.9893174557Error is: 18.98652365Error is: 18.9837487504Error is: 18.9809926288Our final theta's (1.0008743726505416e-05, 4.1218948214535892e-05)
Great! We, have now fitted our model and have an error rate of 19%.
Full Code
Click here for github link
In conclusion
Thanks to Gradient Descent you are now able to find the most optimal line of best fit, allowing your model to produce even more accurate predictions!