Pokemon Stats and Gradient Descent For Multiple Variables

Tyree Stevenson
9 min readApr 23, 2018

--

Is Gradient Descent Scalable?

Yes! Gradient Descent can be used to minimize multiple parameters of a function. If you are not familiar with Gradient Descent click here. In this post, you will use Python to implement Multivariable Gradient Descent for a Multivariate Linear Model. This will all be done from scratch! We will use this model to predict the catch rate of a Pokemon given it’s Total and Special-Attack stats.

What is Multivariate Linear Model?

Let’s talk about Multivariate Linear Regression. If you are not familiar with Linear Regression click here. Recall that Linear Regression allows you to predict dependent y values using independent x values. For example, by using temperature outside you can predict the sales of ice cream.

However, Multivariate Linear Regression allows you to use multiple x’s (features) to predict a y value. A great example, of this, would be using temperature, distance from ice cream shop, and distances from smoothie shops to predict ice cream sales.

Multiple features? How do I graph this?

Great question! Once again features are Machine Learning terminology for variables. So how can you graph this model? Well, each feature of our dataset represents a dimension. Including the y variable! You are using features x to predict feature y.

3D model

This model contains 3 features Radio, Sales, and TV. Each feature correlates to a dimension that results in this model having 3 dimensions.

Multivariate Linear Regression as a function

Recall the original hypothesis function for Linear Regression.

So we can easily assume the function for Multivariate Linear Regression is:

However, the most efficient representation is:

What are these hieroglyphics on the screen?

Vectors and Matrices Explained

A matrix is a rectangular array that holds values and we discuss the dimension of an array by its Rows x Columns.

For example, this is a 2 x 4 matrix. For the sake of this post, you can define a vector as an Amount of Rows x 1 matrix.

What are Feature Matrices?

Great Question! In Machine Learning we can represent a group of data in a matrix. Each row of the matrix is an individual data point and the columns of that row are the data points’ features. For example, if we had a matrix of Pokemon data

        Name       HP       Attack   Defense   1      Bulbasaur  45       49       49       2      Ivysaur    62       63       80 3      Venusaur   82       83       90

Our matrix has 3 rows(aka 3 data points). The features of those data points are Name, HP, Attack, and Defense. What are the numbers on the side? You can think of those numbers as unique indexes this allows us to distinguish between data points. We can call the matrix above pokeMatrix now we can utilize our indexes. If you were to type

pokeMatrix[1]

you would get back that data point

{'Bulbasaur', 45, 49, 49 }

Let’s Dig Deeper into the columns of our pokeMatrix

       Name       HP       Attack   Defense1      Bulbasaur  45       49       492      Ivysaur    62       63       803      Venusaur   82       83       90

Recall that we defined a vector as a Amount of Rows x 1 matrix. Lets visualize this

       Vector1    Vector2  Vector3  Vector4
Name HP Attack Defense
1 Bulbasaur 45 49 492 Ivysaur 62 63 803 Venusaur 82 83 90

Before scrolling down, on a sheet of paper write the vector for the HP feature

HP Vector
45
62
82

Great! These values are on a separate line, because they each belong to their own rows.

Data Point vs Feature

Whats is the difference between indexing the pokeMatrix

pokeMatrix[1]

and grabbing the HP feature?

HP Vector
45
62
82

When you index the pokeMatrix, you are grabbing a specific data point. However, when you are grabbing the HP feature you are getting every HP value in the matrix, thus resulting in you getting a vector containing those values. You can think of a matrix as a collection of vectors, so you are essentially grabbing one of those vectors within the pokeMatrix.

Feature Matrix Jargon

m = number of rows in your matrix aka number of data points

Data Points are also known as training examples. Our pokeMatrix has four rows thus it has four data points which means it has 4 training examples.

N = number of columns aka number of features

Hypothesis Function

Recall our function for Linear Regression

Let’s dive deep into the value of x. The x in our equation is an independent variable or independent feature we are using this independent feature to predict the dependent feature y. Recall our pokeMatrix

       X1         X2       X3       X4
Name HP Attack Defense
1 Bulbasaur 45 49 492 Ivysaur 62 63 803 Venusaur 82 83 90

Let’s say we have a pokemon HP feature that depends on the pokemon’s Defense feature. So we could use Vector4 to predict Vector2. Now let’s imagine that a pokemon’s HP depends on its Attack, Defense, and some number of other features. Instead of writing this pesky function

we can now put all of our feature vectors into a vector.

And just as we can place our features in a vector we can do the same for our theta variables

Vectors within Vectors?

Let’s reimagine our x vector from above, each entry within this x vector is a data point, and each data point will be an 1 x N vector. In our case we can imagine this as the amount of features. So let's make an x vector containing the HP, Attack, and Defense for each of our data points within the pokeMatrix.

[
[45, 49, 49] Represents row 1 of the pokeMatrix,
[62, 63, 89] Represents row 2 of the pokeMatrix,
[82, 83, 90] Represents row 3 of the pokeMatrix,
]

Does this look familiar? Yes, our vectors within vectors have transformed into a feature matrix.

What is this Theta times X you speak of?

In this section, I will have to recommend Basic Linear Algebra for Deep Learning by Niklas Donges. To complete these mathematical operations you will need to understand vector/matrix multiplication, vector/matrix transpose, and vector/matrix addition and subtraction. Unfortunately, those topics are so dense I would need to write an entire medium post just explaining how to complete these operations. If you are not familiar with those operations please click on the suggested blog and re-join us! Due to time constraints, I must continue on to Gradient Descent.

Gradient Descent for Multiple Variables

Great, now that we understand matrix/vector operations let’s talk about gradient descent! Recall that Gradient Descent is an optimization technique which allows you to find the minimum value of a function. In our case, we are looking for the minimum theta values that will give our Multivariate Linear Model the smallest loss value. Remember that loss is the difference between our predicted value and the actual value. So if we predicted a Pokemon’s attack power was 76 when in actuality the attack power was 81 then we have a loss of 5.

We can use the function below to calculate Gradient Descent for multiple thetas:

However, this is not scalable at all and would require us to hardcode in each theta. Instead we use the implementation bellow, which is scalable and easier to code:

By the power of vector operations, we can compute each theta simultaneously.

Your Thetas are your model

We are in search of the perfect thetas! Our goal in Gradient Descent is to find the optimal data values that for any given x we can perfectly predict y. This is a point I see glossed over in many tutorials! I want to stress the point that we are training our thetas and with these trained thetas we should be able to predict any y given, a x value that we have never seen before. In conclusion, we are training our theta values to accurately predict y when given an x value we have never before seen.

Let’s Talk Pokemon

Now that you understand the math we can talk about the more important things in life such as pokemon! In this post we will use Special Attack and Total stats to predict a Pokemon’s Catch rate. The dataset we are using is pokemon_alopez247.csv you can use it by clicking here. If you have the Kaggle API installed on your machine, you can download the dataset in your terminal with this command:

kaggle datasets download -d alopez247/pokemon

Let’s Graph the Data

Since your data is already clean we can automatically plot it. Now you can import these libraries

import math
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import numpy as np
import pandas as pd

Let’s convert our csv file into a dataframe(feature matrix)

data = pd.read_csv('pokemon_alopez247.csv') 

We can grab our desired features from the data variable:

total = np.asarray(data['Total'])
special_attack = np.asarray(data['Sp_Atk'])
catch_rate = np.asarray(data['Catch_Rate'])
temp = np.asarray([[tot, spec_atk] for tot, spec_atk in zip(total, special_attack)]) # Gets our features# Splits our features into training and testing
training_features = temp[:int(len(temp) * 0.7)]
test_features = temp[int(len(temp) * 0.3):]

temp = np.asarray([rate for rate in catch_rate]) # Gets our y's
# Splits our y's into training and testing
training_output = temp[:int(len(temp) * 0.7)]
test_output = temp[int(len(temp) *0.3):]
theta = np.random.uniform(0.0, 1.0, size=2)

Using matplotlib.pyplot we can build a 3d figure

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')

Let’s plot our features.

def generateZValues(x, theta):
z_values = []
for i in range(len(x)):
z_values.append(hypothesis(x[i], theta))
return np.asarray(z_values)
def graph(features, output, theta, figure_name):
x = []
y = []
for feature in features:
x.append(feature[0])
y.append(feature[1])
z = generateZValues(feature, theta)

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
plt.scatter(x, y, z, c='r', marker='o')
#Ensure that the next plot doesn't overwrite the first plot

ax = plt.gca()
ax.hold(True)
plt.scatter(x, y, output, c='g', marker='d')ax.set_xlabel('Total')
ax.set_ylabel('Special Attack')
ax.set_zlabel('Catch Rate')
plt.savefig(figure_name)

Now we can graph our initial thetas

graph(features, output, hypothesis) # SUDO Code

Pretty close, right? Our end goal is to have those red circles covering as much of our data points as possible.

You can now set your learning rate.

alpha = 0.0001

You can finally implement multivariable Gradient Descent

def gradientDescent(x, y, theta, m, alpha, iterations=1500):
for iteration in range(iterations):
for j in range(len(theta)):
gradient = 0
for i in range(m):
gradient += (hypothesis(x[i], theta) - y[i]) * x[i][j]
gradient *= 1/m
theta[j] = theta[j] - (alpha * gradient)
print(theta)
return theta

Let’s train your model

gradientDescent(training_features, training_output, theta, len(training_output), alpha)

Now that we have our newly trained thetas lets graph the results,

Wow, your model is working like a charm!

Theta Reminders

Remember your thetas are your model! You went through this elaborate process to find the optimal thetas that, when paired with an unseen x value, you can accurately predict a y value.

The Code

Final Words!

Great, through this tutorial you have learned about feature matrices and the importance of theta. Thank you for reading and please comment if you have any questions.

--

--

Tyree Stevenson

Incoming Twitter APM| Hulu Product Manager Intern | 2x Former Google Intern