Classifying the ASL Alphabet using Machine Learning

Ishaan Guha
Quick Code
Published in
8 min readAug 26, 2019

The aim of this project was to classify images of the American Sign Language alphabet from a Kaggle dataset and to build a neural network that could classify the images with a high level of accuracy.

Data

The ASL alphabet dataset contains 87,000 images spanning 29 classes containing 3,000 images each. 26 of these classes are the letters A-Z and the other 3 are the signs for nothing, space and delete. These 87,000 images were divided into 78,300 images that would be fed into the model as training data and 8,700 that would be used as validation data. In addition to splitting the data into training and validation data, a generator was also used to augment the images(rotate them, shift them sideways, etc.) so that the data became less similar and the model would have to generalize rather than just memorizing certain images.

Methods

The code is laid out in a way such that all the work is split up into functions which are called in order at the end of the code. The training, validation and test data was all run through two models, one consisting of fully connected layers and the other being a convolutional neural network, so all the functions were run twice(once for each model) and written so that depending on the model type the data is processed and run through the model accordingly. The fully connected model is a standard deep learning model with two hidden layers, the input layer having 4096 nodes in in order to take in the grayscale 64*64 image. The convolutional neural network was adapted from Running Kaggle Kernels with a GPU, with the addition of Batch Normalization, Dropout, and kernel regularizers in an attempt to reduce overfitting.

The Models

Fully Connected Neural Network Representation
Convolutional Neural Network Representation

Keras was used to create the models as it was more efficient than other options such as Tensorflow or making a neural network from scratch and it gave much better results than those two as well.

Results

The main metric monitored in the models was validation loss as the main goal was for the model to perform the best in predicting data and generalizing, so a feature was added which made it possible to save the weights from the epoch that had the lowest validation loss and load the model with those weights. The lowest validation loss for the fully connected network was 0.15 and for the convolutional neural network it was 0.37. The fully connected network had a maximum validation accuracy of 94% and the convolutional model had a maximum of 92%. Although the fully connected network had lower validation loss and higher accuracy, the data fed into the fully connected model was grayscale and not augmented, which made it easier for the model to learn the patterns as the training data seemed to be taken from the same hand in a series of burst photos. On the other hand, since the training data in the convolutional model was more varied than in the fully connected model even though it had higher validation loss it still learned to generalize better, which shows in the results against outside data(pictures of my hand).

After training the model with the training data, it was fed two sets of testing data, one from the dataset and the other being a set of images of my own hand. Both sets of data were also rotated to see how the model performs against rotated images. As seen above, in both cases the models performed very well on the dataset’s testing data as that data was also taken from the same hand in the same lighting. Both models predicted 28 out of the 29 classes of test data. The fully connected The results against the pictures of my own hand , however, were not that great: the fully connected model had no success correctly predicting any of the pictures of my own hand to the point where it was just randomly guessing letters and while the convolutional model had more success(it predicted 10 out of the 29 unrotated classes image) to where it at least was not randomly guessing it got easily confused between similar looking signs and predicted a few classes more than others. This happened because the data fed into the model was so similar is that the model started memorizing the images entirely rather than just trying to learn the shape of the hand for each letter, which led to its prediction capabilities against outside data being pretty low. Furthermore, neither models were able to predict rotated images from either the dataset’s testing data or my own testing data. My surmise as to why this happened is that the signs of the alphabet largely depend on orientation and changing the orientation of the sign a bit can change its letter value(for e.g. — the letters I and J), so the model was not able to understand rotated signs.

Fully Connected Neural Network Results against my own Test Data
Convolutional Neural Network Results against Database Test Data
Convolutional Neural Network Results against my own Test Data

Conclusion

While this model does perform well with the given dataset, the results do not translate to more real world scenarios and further work is needed in order to determine how to take such similar data and generalize it to any situation.The next steps would be to use to better augmenting techniques to maybe change the lighting and background rather than just shifting and rotating the images. Also, using preprocessing techniques that can detect the hand in the image and crop it out of the image so that the background and other interferences no longer remain may help as with such similar training data the similar background may play a part in the model’s overfitting.This can be done by implementing object-detection algorithms such as Mask R-CNN and RetinaNet. Further research on object detection algorithms can be seen here. Implementing additional methods such as feature selection, F-Score selection and Recursive Feature Elimination, which help find the most important features in a sample and eliminate the other features, will further help increase training speeds and reduce overfitting.

Learnings and Experiences

This project was my first time delving into the world of machine learning, and up to this point the extent of my knowledge related to this topic had been seeing videos about different types of artificial intelligence on youtube and hearing about neural networks whenever the topic of machine learning and AI came up. So when I started this project I had essentially no idea about how to go about building a neural network, not to mention coding one. Additionally, even though I had prior programming experience with Java, I had never used Python so I had to first learn Python before I could start learning more about the topic at hand. However, this did not take long as most of what I had to learn what just the different syntax, as the main concepts remained the same. Only once I had learnt Python syntax sufficiently well was I able to move on to learning about machine learning. Since I didn’t know anything about machine learning I decided to start from scratch by learning the basic machine learning methods such as Linear Regression and K-Nearest Neighbors, which were surprisingly straightforward as they relied on basic geometry such as the slope-intercept form of a line and euclidean distance making them easy to understand and code. After that I began learning about Support Vector Machines, which is where I got stuck for quite a while. Although I understood the concepts of the best separating hyperplane and the support vectors, the math behind it was pretty hard to grasp. Since I had not yet taken calculus in school, things like gradients and using Lagrange multipliers was far beyond my current knowledge. However, MIT OpenCourseWare had this great video about Support Vector Machines which explained the algorithm in a way in which the derivation of the formulas made sense to me and I wasn’t completely lost. It did take me a lot of time to understand SVM’s, though. I spent around a week and a half on learning SVM’s and learning to implement them in Python. After that, I finally got to neural networks. In order to understand neural networks, I used Michael Nielsen’s free online book Neural Networks and Deep Learning, which was really thorough and gave me a good grasp of neural networks. Some of the math still evaded me, but the explanation was thorough enough that I could understand the gist of what was happening. It took me less time to understand neural networks then it did SVM’s just because after learning about SVM’s I already had some experience with the complex math such as using gradients. However, I would say actually coding neural networks was much more troublesome than SVM’s simply due to the time it takes to run a neural network. At first, I tried to run raw tensorflow code without using Keras into a Jupyter notebook. I failed miserably: not only did I get bad results, but since my laptop does not have a good GPU, running a model took forever. So, after a few more tries, I decided switching to Keras and using Kaggle Kernels which have an inbuilt GPU function would be the most efficient option and give me the best results. Now, even though Keras is a pretty simple library to use, it still requires a good amount of knowledge about neural networks in order to use it to its full potential. Learning about neural networks from scratch helped me make a few tweaks without which my model would have failed or not worked as well. Also, learning how to properly preprocess data, which I believe is one of the most complicated parts in all of machine learning, is another major skill that I wouldn’t have been able to do well without learning how neural networks work from scratch. Even the algorithms I before starting neural network played a big role in helping me create the models as they gave me a much needed foundation in machine learning, which made it much easier once I started neural networks. All in all, I don’t think anything I learned along this journey was a waste, as even if I didn’t directly use it in this project I will definitely use it in projects to come.

--

--