Sign Langauge to Text Conversion

8 min readJan 8, 2022

Authors: Bhavesh Sood, Vishwajeet Kumar, Ajeet Yadav

Machine Learning (CSE343, ECE343) from Indraprastha Institute of Information Technology, Delhi.

For interaction between normal people and Deaf & Dumb people a language barrier is created as sign language structure which is different from normal text. So they depend on vision based signs and gestures for communication and interaction. If there is a common interface that converts the sign language to text, then gestures can be easily understood by other people. Research has been made for a vision based interface system where deaf and mute people can enjoy communication without really knowing each other’s language.

Introduction

Our aim is to develop a user-friendly human-computer interface (HCI) where the computer understands the human sign language. There are various sign languages all over the world, we will be using American Sign Language (ASL) for this project.

Literature Survey

This project used American sign language to convert into speech and text using the techniques of image segmentation and feature detection. The system goes through various phases such as data capturing, sensor, image segmentation, feature detection and extraction etc.[1]
This project is based on Creating a desktop application that captures a person signing gestures for American sign language (ASL), and translate it into corresponding text and speech in real time.[2]
This project builds a machine learning model which can classify the various hand gestures used in sign language. In this model, classification machine learning algorithms are trained using a set of image data.[3]

Dataset

Our dataset is a large database of drawn representations of different gestures for American Sign Letters.[4] The database contains 27,455 training images and 7172 testing images each of size 28x28. In our data, we have 784 columns of pixel1, pixel2,…, pixel784 which represents a single 28X28 image. These images contain matrices of pixel values and each pixel value is in the range 0–255. Target column has labels integers between 1–26 corresponding to English Alphabets A to Z. Our dataset does not contain any data for 9 = J and 25 = Z because in sign language these alphabets need motion.

All these pixels values can be presented directly to our model but this can result in challenges during modeling such as slower than expected training of the model. Instead, we believe it can be of great benefit in preparing the pixel values before doing any modeling such as standardization.

Standardization of a dataset is a common requirement for many machine learning estimators: they might behave badly if the individual features do not more or less look like standard normally distributed data (e.g. Gaussian with 0 mean and unit variance). Standardization scales each input variable separately by subtracting the mean (called centering) and dividing by the standard deviation to shift the distribution to have a mean of zero and a standard deviation of one. Each pixel value is the feature of the dataset and since each pixel is important for an image so we need not do any feature selection/reduction. Our dataset does not have any missing, NaN, noisy or inconsistent value so we feel we need not do any kind of Data-cleaning

Methodology

The goal is to make a model such that the image of the sign gets correctly converted to one of the 26 letters of the alphabet. The problem is to classify the images(input) into one of the 26 labels(26 Alphabets of English language).

1. Logistic Regression

We started with one of the basic classification techniques. We tried training a simple Logistic Regression model. Logistic regression trains the data using the sigmoid hypothesis function and gradient descent algorithm. We trained the Logistic regression model of sklearn to train our data. We set the penalty field of the parameter to none so that there is no regularisation. Our problem is multiclass classification. For multiclass classification, this model uses a one-vs rest scheme. We trained the model for 400 iterations.

1.1 Logistic Regression with L2 regularisation

After training logistic regression we normally tried to add some penalty to decrease the variance. So we first trained our data with L2 regularisation(Ridge regression). We used the same linear regression model of sklearn but changed its penalty parameter(which is l2 by default). We trained this data also for 400 iterations.

1.2 Logistic Regression with L1 regularisation

After trying out L2 regression we tried L1 regularisation(Lasso regression) as it is robust to outliers. This time we changed the penalty field of the logistic regression model to ’l1’ and trained the training data again for 400 iterations.

2. Decision Tree

We then tried to make a Decision Tree for classifying the images. A decision tree is a very direct, rule-based way to classify the data. It makes a tree considering some column as its root and starts to classify the data by selecting some other column as its child. We used the ’DecisionTreeClassifier’ from sklearn to train our data. It uses GINI Index(as default) for Impurity measurement and chooses the column for root on its basis. As we were trying pure decision tree so we left the max depth parameter of the tree as it is(default value NONE). This means nodes are expanded until all leaves are pure or until all leaves contain less than min samples split samples. After training the model on train data we calculated the accuracy of our model on the test data.

3. Ensembling on DT model (Random Forest Classifier)

Our Decision Tree was not giving good accuracy, so we decided to try the ensembling technique. Since RF classifier is an ensemble method that trains several decision trees in parallel with bootstrapping followed by aggregation, so we trained an RF model with default parameters. Our test accuracy got significantly improved as compared to the Decision Tree from 43.83% to 80%.

3.1 Hyper-parameter tuning of RF model

So far we have received the best accuracy of 80% with the RF model, we further tried to improve the accuracy by tuning the hyper-parameters of it. Following were the hyperparameters and their values from which we tried to find the best params using grid search CV: ‘n estimators’: [80, 100, 120, 200, 300], ‘criterion’: [‘entropy’, ‘gini’], ‘max depth’: [2, 4, 10, 18, 30] After running for 5 cross-validations, grid Search CV gave the following combination of best hyper-parameters, n estimators=300, max depth=30 and criterion as “entropy”. With these parameters, the accuracy got improved from 80% to 81.49%.

4. SVM

Before moving to Neural Networks we tried SVM, which is an l2 norm soft margin classifier. SVM is a supervised ML algorithm that can be used for both classification and regression. We used the ‘RBF’ kernel for this, with no limit on max iterations since SVM always converges. The main advantage of SVM is that it is highly effective in high-dimensional spaces. We trained the model on trained data, tested on test data, and got an accuracy of 84.88%. Results of the SVM model on our classification problem were best so far.

5. MLP

After trying the above common ML algorithms, we decided to try Artificial Neural Networks for our classification problem. An artificial Neural Network is a connection of neurons, replicating the structure of the human brain. Each connection of neurons transfers information to another neuron. Inputs are fed into the first layer of neurons which processes it and transfers to another layer of neurons called hidden layers. After processing information through multiple layers of hidden layers, information is passed to the final output layer. After trying many combinations of hidden layer sizes, solvers, max iters and activation functions, and other hyperparameter tunings we created an MLP classifier with hidden layers sizes as (600, 650, 700) with activation function ‘relu’ and after training and testing, the model gave an accuracy of 84.54% on test data. We calculated the confusion matrix and found weighted precision, recall, and F1 score values for further confirmation, and those all values were about 0.85. So we reached the conclusion that 85% is the nearby best that we can get from MLP.

6. CNN

Unlike regular Neural Networks, in the layers of CNN, the neurons are arranged in 3 dimensions: width, height, depth. The neurons in a layer will only be connected to a small region of the layer (window size) before it, instead of all of the neurons in a fully-connected manner. Moreover, the final output layer would have dimensions (number of classes), because by the end of the CNN architecture we will reduce the full image into a single vector of class scores. In our model, we have used three -2D convolutional layers each followed by max pool to reduce the spatial dimensions. Then we compiled the model with tuned parameters and trained it keeping a validation set in hand of 8:2 to keep a check on the overfitting problem. Here is the summary of our CNN model.

Results

Accuracies

Logistic Regression

(a) Simple: 65.29%
(b) Ridge: 70.65%
(c) Lasso: 63.30%
Decision Tree : 43.83%
Random Forest : 81.49%
Support Vector Machine : 84.88%
MLP classifier : 84.53%
CNN classifier : 95.50%

Conclusion

Analyzing all the models with their accuracy and loss curve for training, validation, and testing data, CNN was giving a test set accuracy of 95.5% which is good enough. Other models were also giving decent accuracy but were not above 85% even after hyperparameter tuning.

Also in CNN, we tried several combinations of different parameters and with a learning rate of the order 0.001, it was converging really fast. While in the case of 0.0001 the accuracy was dropping to 92%. So we took the middle value of learning rate 0.0005 and trained the data for 20 epochs. This provided us with a good accuracy and loss curve.

For example, we took the below image(fig 6) as input from the live camera and converted it to a grayscale image(fig 7) which is shown below the original image. Our code predicted label 2 for the below image which is the label for C ( as labeling starts from 0 for A, 1 for B, and 2 for C).

So finally after trying all the models with different combinations of parameters we reached the conclusion that CNN is the best suitable model for our problem with an accuracy of 95.5%.

References

Victoria A. Adewale and Dr. Adejoke O.Olamiti. Conversion of Sign Language To Text And Speech Using Machine Learning Techniques.
Ankit Ojha, Ayush Pandey, Shubham Maurya, Abhishek Thakur, and Dr. Dayananda P. Sign language to text and speech translation in real-time using convolutional neural network. 2014.
Muskan Dhiman. Sign language recognition. 2017.