Gotta train ’em all!

Published in

ACM VIT

5 min readJun 23, 2019

We’ve all been obsessed with Pokémon at some point in our childhood, and I know how difficult it is to remember every single Pokémon (around 800!). In the show, Ash didn’t have this problem because of the pokedex Professor Oak gave him. So why not build our very own, cool Pokedex?!

In this post, I will teach you how to build a program which identifies different types of Pokémon using a very simple Machine Learning algorithm from scikit learn.

The Dataset

The most essential part of any ML program is acquiring and preprocessing a dataset. For our pokedex, I am going to start off with building a dataset for 5 types of Pokémon :

I successfully built a dataset of 20 images for each type. Now the next step is preprocessing the data. To avoid any complications, resize all the dataset pictures to the same dimensions (I changed them to 300x300). And then convert it to grayscale to make the process easier.

You can easily do this step using the PIL library, or even OpenCV.

Finally, create the matrix of features ‘X’ — this would contain all the images in our dataset. Our dependent variable ‘y’ — would be the corresponding label.

Don’t forget to encode your string variables!

Our classification model might not accept string values, so we need to convert them to a significant label.

Here 0=Eevee, 1=Gengar, 2= Jigglypuff, 3=Pikachu, 4=Squirtle

from sklearn.preprocessing import LabelEncoder
labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y)

Let’s finally build our model!

We have a total of 100 pictures in our dataset, let us fit our model on 80 pictures, which would be our train set, and test our model on the rest, which would be our test set.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = 
train_test_split(X, y, test_size = 0.2, random_state = 0, shuffle = True)

Fitting the model

Since our challenge is to identify the type of Pokémon, it is a classification problem, and we need to fit a classifier. You can try and fit any classifier you want, however, after some trial and error, I have come to the conclusion that SVC would yield the highest accuracy.

Support Vector Machines (SVMs) generally provide higher accuracy compared to other classifiers, like Logistic Regression. A Support Vector Classifier (SVC) tries to fit as many instances without violating margin conditions.

Hence we fit the SVC classifier with the help of GridSearchCV for hyperparameter tuning — it will help us determine optimal values for SVC.

from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
param_grid = {'kernel': ['rbf','linear'],  
            'C': [1000,5000,10000,50000,100000], 
            'gamma': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.1],}classifier = GridSearchCV(SVC(class_weight='balanced'), param_grid)grid_result = classifier.fit(X_train, y_train)
grid_result.best_params_

Running grid_result.best_params_ would return the optimal parameters for the best result. For example,

{'C': 1000, 'gamma': 0.005, 'kernel': 'rbf'}

You can update your param_grid variable with these values and run it again.

Applying Principal Component Analysis

Now that you’ve fit your classifier to your train set, you must’ve realized that it takes a hell lot of time to do so. That is because our Pokémon dataset has about 90,000 features! To reduce the time taken and make a simpler model, we could use PCA.

Principal Component Analysis, or PCA, is a dimensionality-reduction method that is often used to reduce the dimensionality of large data sets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set.

So let’s use PCA!

from sklearn.decomposition import PCA
pca = PCA(n_components = 20, whiten=True).fit(X_train)
X_train = pca.transform(X_train)
X_test = pca.transform(X_test)

This reduces our features from 90,000 to 20 — which makes our lives much easier. Apply PCA to your X_train and X_test before fitting your classifier.

And now we’re almost done! All there is left to do is to test our model

y_pred = classifier.predict(X_test)

From the confusion matrix, you can find out how many predictions were true and how many were false. A confusion matrix is a table used to describe the performance of a classifier.

from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test,y_pred)

Your Pokedex is now ready!

If you want you can complete the Pokedex with all types of Pokémon and improve it — You could take an input image from the user, use classifier.predict to identify the type and display it to the user.

I used matplotlib to display the final output

You can check out my GitHub for the complete code and dataset.

References

A step by step explanation of Principal Component Analysis

The purpose of this post is to provide a complete and simplified explanation of Principal Component Analysis, and…

towardsdatascience.com

An introduction to Grid search

This article will explain in simple terms what grid search is and how to implement grid search using sklearn in python.

medium.com