Rock-Paper-Scissors Image Classification Using CNN

Farah Amalia
Geek Culture
Published in
6 min readMay 14, 2021

--

This is my first time learning image classification as a part of ML Developer learning path in Dicoding Academy, so I decided to post it to give a better picture of image classification especially for beginners.

You can access the complete code on GitHub.

Understanding the Dataset

The dataset is provided by Dicoding Academy that can be accessed on this link. The dataset contains a total of 2188 images with 300 x 200 pixels, that correspond to the hand gestures of the Rock-Paper-Scissors game: ‘Rock’ (726 images), ‘Paper’ (710 images), and ‘Scissors’ (752 images). Here are few images taken from the dataset:

Loading the Dataset

We will use wget command to load the dataset. We also need to import the libraries we will use in this project.

Data Pre-processing

First we extract the file that has been downloaded.

The next step is using data augmentation to help with the training process. The dataset is quite organized so we need to add some variation to it.

Depending on the parameter, the above code can generate random images with different variations. Here are the explanations of what we just wrote[1]:

  • rescale is a value by which we will multiply the data before any other processing. Our original images consist in RGB coefficients in the 0-255, but such values would be too high for our models to process (given a typical learning rate), so we target values between 0 and 1 instead by scaling with a 1/255 factor
  • rotation_range is a value in degrees (0-180), a range within which to randomly rotate pictures
  • horizontal_flip is for randomly flipping half of the images horizontally --relevant when there are no assumptions of horizontal assymetry (e.g. real-world pictures)
  • shear_range is for randomly applying shearing transformations
  • fill_mode is the strategy used for filling in newly created pixels, which can appear after a rotation or a width/height shift
  • validation_split is to split the training dataset into train and validation

Note that data augmentation should only be applied to train dataset, not on validation dataset.

Convolutional Neural Network (CNN)

A regular Neural Network representation (source)
CNN representation (source)

Convolutional Neural Network (CNN for short) is one of the most popular technique in image classification.

CNN are very similar to ordinary Neural Networks from the previous chapter: they are made up of neurons that have learnable weights and biases. Each neuron receives some inputs, performs a dot product and optionally follows it with a non-linearity. The whole network still expresses a single differentiable score function: from the raw image pixels on one end to class scores at the other. And they still have a loss function (e.g. SVM/Softmax) on the last (fully-connected) layer and all the tips/tricks we developed for learning regular Neural Networks still apply.

The difference is CNN architectures make the explicit assumption that the inputs are images, which allows us to encode certain properties into the architecture. These then make the forward function more efficient to implement and vastly reduce the amount of parameters in the network[2].

Model Construction

After we are done pre-processing and splitting our dataset, we are ready to build our CNN model.

Here is the quick explanation of above code:

  • The first layer consists of 16 number of filters each of size 3, including the input shape with size of 100 x 150.
  • ReLU (rectified linear unit) is used for activation function
  • Next we have 2 x 2 max-pooling layer. Max-pooling is a technique used to help reduce overfitting by providing an abstracted form of the representation[3].
  • Dropout technique with rate 0.2 is used to minimize the effect of overfitting within a trained network[4].
  • The same layers are repeated, with an increase in the number of filters 32 and 64
  • Flatten layer is used since the input of fully connected layers should be two dimensional.
  • Dense hidden layer with 128 units
  • The last is output layer which contains 3 units as we have 3 classes of output. The activation function for this layer is softmax.

Callbacks, Compile and Fit

Next we will use a callback, and then compile and fit the model that we have constructed.

The callback (ReduceLROnPlateau) is used to reduce learning rate when a metric has stopped improving[5].

For the model compiling, we use Adam algorithm, a stochastic gradient descent method that is based on adaptive estimation of first-order and second-order moments[6].

Model evaluation

Now let’s evaluate the performance of our CNN model. I have created a function to plot the accuracy and loss for each train and validation dataset.

I also created a function to show the confusion matrix as well as the classification report.

Next we just need to call the function to show the plot and report. You can see the results below:

Accuracy and Loss Plot

The result above shows that the accuracy is increasing steadily while the loss is also reducing, for both train and validation dataset.

Confusion Matrix and Classification Report

Next we evaluate the validation data and we got 96% accuracy, which is pretty high. We can test it by uploading new images and let the model predicts whether it is a paper, rock, or scissors gesture. Here is the function I have created to predict new images.

As we can see, the CNN model can predict images correctly to their respective gesture category. Now we also want to test the model to other hand image as shown below.

It is interesting to see that the model classifies the image as paper, while in fact it is shown that the hand does not imply any of rock, paper, nor scissors gesture. The model predicts it as paper anyway because it looks more like paper gesture.

Conclusion

Our CNN model is able to detect the rock-paper-scissors hand gesture with 96% accuracy. In the next post I will write about the implementation of transfer learning using ResNet50V2 on the same dataset.

--

--