Applied Machine Learning: Part 2

Convolutional Neural Networks for Image Recognition

Check out the price prediction project here, Applied Machine Learning: Part 1, if you haven’t checked it yet. Like the previous article, the model in this article is built using python in Spyder IDE. Also, check out the first article to learn to set up your development environment.

Now let’s move on to the second application of machine learning. This time we explore the domain of image recognition. For this purpose, I have chosen the ‘Sign Language MNIST Dataset’. The idea here is to build a model to recognise what alphabet is being referred to in the sign language. The Sign Language MNIST dataset has images of hand gestures each representing one of the 24 alphabets. You are encouraged to choose your own dataset and create your own problem statement. The method of implementation will remain similar.

You can find a variety of datasets on Kaggle or by using the Google dataset search tool.

Reference for Sign Language

A note to the reader- If you wish to understand the technical basis of how a CNN works, you can refer to this article, Performance Analysis of Deep Learning Algorithms: Part 1.

This article primarily focuses on the practical implementation of a CNN on a non-standard dataset with a unique application.

Without any further delay, let’s get started with our project:

  • Given a hand gesture image, I want my model to recognize the corresponding sign language alphabet.
  • In most images recognition problems, using a Convolutional Neural Network [CNN for short] could work pretty well. Keep in mind that a CNN can be used for any image recognition problem just like how we use linear models like Regression for prediction problems. Here I will implement the same. Firstly, let’s import all necessary libraries.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
#Libraries for implementing a CNN
from keras.layers import Conv2D, Flatten, MaxPooling2D, Dense, Dropout
from keras.models import Sequential
from keras.utils import to_categorical
  • Next step is loading the dataset. Make sure that the datasets are copied into the main project folder. In my case, I have the training and testing dataset as separate files.
train = pd.read_csv('sign_mnist_train.csv')
test = pd.read_csv('sign_mnist_test.csv')
  • Now, we need to represent the data in such a way that it can represent the image. Note that the raw data has all pixel values stored as an array. It needs to be converted into a 28*28 matrix in this case. The ‘label’ column in the dataset gives us the information about what alphabet the image represents. Some input images are visualized below.
labels = train.pop('label')  #Pops the label column and stores in 'labels'
labels = to_categorical(labels)
train = train.values
train = np.array([np.reshape(i, (28,28)) for i in train])
train = train / 255
plt.imshow(train[0])
plt.imshow(train[6])
  • Next, we create the training and validation sets in the usual 70:30 ratio. Try to experiment with changing the ‘random_state’ parameter.
X_train, X_val, y_train, y_val = train_test_split(train, labels, test_size=0.3, random_state=41)
#Reshaping the training and validation sets
X_train = X_train.reshape(X_train.shape[0], 28,28,1)
X_val = X_val.reshape(X_val.shape[0], 28,28,1)
  • We are now ready to build our CNN. But how exactly do we do that? Follow the steps below. Understand that for the ‘input_shape’ parameter, you will have to use the dimensions of your input image. The basic architecture is the same. You can try experimenting with changing the numbers used to tweak your CNN. Make a note of ‘relu’ and ‘softmax’, both of which are the activation functions.
  • Notice that the Dropout set to 0.4 at the end. You can tweak that value as well. At the backend, it essentially reduces overfitting, i.e the phenomenon when the model performs really well on the training data but miserably fails with the test data. In the final line, you can see ‘25’ under the ‘Dense’ bracket. 25 here signifies the number of outcomes (or the number of classes) for my dataset. It may vary depending on the shape of your dataset.
  • This is how a typical CNN architecture looks like in code. Remember that this same architecture can be used in any dataset with just changing the input and output variables.
#Building Our CNN
model = Sequential()
model.add(Conv2D(8, (3,3), input_shape=(28,28,1), activation='relu'))
model.add(MaxPooling2D(pool_size=(2,2), strides=2))
model.add(Conv2D(16, (3,3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2,2), strides=2))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.4))
model.add(Dense(25, activation='softmax'))
model.summary()
Layer (type)                 Output Shape              Param #   
=================================================================
conv2d_3 (Conv2D) (None, 26, 26, 8) 80
_________________________________________________________________
max_pooling2d_3 (MaxPooling2 (None, 13, 13, 8) 0
_________________________________________________________________
conv2d_4 (Conv2D) (None, 11, 11, 16) 1168
_________________________________________________________________
max_pooling2d_4 (MaxPooling2 (None, 5, 5, 16) 0
_________________________________________________________________
flatten_2 (Flatten) (None, 400) 0
_________________________________________________________________
dense_3 (Dense) (None, 128) 51328
_________________________________________________________________
dropout_2 (Dropout) (None, 128) 0
_________________________________________________________________
dense_4 (Dense) (None, 25) 3225
=================================================================
Total params: 55,801
Trainable params: 55,801
Non-trainable params: 0
_________________________________________________________________
A typical CNN architecture (Source: Wikipedia)
  • Once you have your CNN ready, you can train your model by feeding in the training set we made earlier. There are several optimizers that can be used but ‘adam’ is the preferred one here. The number of epochs and batch size can be decided by you. Try tweaking with those values. A larger number of epochs leads to more training time but can produce better accuracy.
model.compile(loss = 'categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
#Code for Training our Model
history = model.fit(X_train, y_train, validation_data = (X_val, y_val), epochs=50, batch_size=512)
  • Wait for the model to train. Once it’s done, we can plot the variation of accuracy with the epoch to visualize how our model is improving with each epoch.
plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.title("Accuracy")
plt.xlabel('epoch')
plt.ylabel('accuracy')
plt.legend(['train','test'])
plt.show()
Accuracy vs Epoch
  • Now it’s time to test how well our model performs on the test dataset. Observe all lines of codes carefully instead of blindly copying it. You will surely be able to make some good sense of the flow of the program.
y_test = test.pop('label')
y_test = to_categorical(y_test)
y_test.shape
X_test = test.values
X_test = np.array([np.reshape(i, (28,28)) for i in X_test])
X_test = X_test / 255
X_test = X_test.reshape(X_test.shape[0], 28,28,1)
X_test.shape
#Recognizing images on the test dataset
predictions = model.predict(X_test)
test_accuracy = accuracy_score(np.argmax(y_test, axis=1), np.argmax(predictions, axis=1))
print("The test accuracy is: ", test_accuracy)
#Result
The test accuracy is: 0.9223368655883993

The trained model got an accuracy of about 92.23%. That is pretty good. It is also possible to get much higher accuracy by fine-tuning the parameters we used.

What does this signify in reality? Our model can correctly recognise what alphabet a hand gesture refers to in sign language almost 92 times for every 100 images! Given a 28*28 image of a hand gesture as input, it is highly likely that our model will identify what it is correctly.

A more advanced version of this approach can be used to build a system where sign language can be converted into text in real time, and that text converted into speech, enabling dumb people to speak effectively. Such is the potential of Machine learning.

With that, we can conclude this project! If you had been following this series, by now your second application would have been ready. Using the same approach you could have built a flower recognition model, or an animal recognition model, etc. The possibilities are limited to the dataset you choose.


In case of any doubts or clarifications in applied machine learning or if you get stuck somewhere in implementing your model, feel free to ask down in the responses below.

Stay tuned for the next article where we will explore more diverse models and their application in real life scenarios.

Clap and share if you found this useful and do follow ‘The Research Nest’ for more insightful content.