Sign Language Recognition using MediaPipe and Random Forest— Intel oneAPI Optimised Scikit-Learn Library

9 min readMar 19, 2023

Image Source: https://ritme.com/wp-content/uploads/2021/03/Bannie%CC%80re_INTEL_ONEAPI.png https://upload.wikimedia.org/wikipedia/commons/thumb/7/7d/American_Sign_Language_ASL.svg/1200px-American_Sign_Language_ASL.svg.png

American sign Language, is a natural language that serves as the predominant sign language of Deaf communities. But, deaf people often have difficulties talking to abled people because not everyone knows all the alphabets of sign language. So, we need a mechanism to automate this task. One way to do this is by using Sign language recognition technology.

In this blog, we will explore how to detect the alphabet associated with the hand sign using Hand landmark model(using MediaPipe), Random Forest Classifier and Intel oneAPI-optimized Scikit-Learn library.

Image Source: https://miro.medium.com/v2/resize:fit:828/format:webp/1*loqTWz8bcVAvVhmE1wUDVA.png

First of all, we need to know what is Hand landmarks. Hand landmark refers to the process of detecting and tracking the positions of specific points on a person’s hand using computer vision technology. These points, also known as landmarks or keypoints, can include the tips of the fingers, the base of the fingers, the wrist, and other points on the hand. Landmarks can be used to identify the different signs being made by the person. To detect the hand landmarks we can use MediaPipe library.

Mediapipe is a popular open-source framework for building computer vision and machine learning pipelines. It includes a pre-trained hand landmark model that can detect and track the positions of specific points on a person’s hand. MediaPipe Hands is a high-fidelity hand and finger tracking solution. It employs machine learning (ML) to infer 21 3D landmarks of a hand from just a single frame. MediaPipe Hands utilizes an ML pipeline consisting of multiple models working together: A palm detection model that operates on the full image and returns an oriented hand bounding box. A hand landmark model that operates on the cropped image region defined by the palm detector and returns high-fidelity 3D hand keypoints. Hand landmark model performs precise keypoint localization of 21 3D hand-knuckle coordinates inside the detected hand regions via regression, that is direct coordinate prediction. The model learns a consistent internal hand pose representation and is robust even to partially visible hands and self-occlusions.

https://google.github.io/mediapipe/solutions/hands.html

To use the Mediapipe hand landmark model, you can follow these general steps:

Install MediaPipe: You can install Mediapipe using pip or conda, depending on your preference.(using pip: pip install mediapipe)
Load the Hand Landmark model: You can load the pre-trained hand landmark model using the hands module provided by Mediapipe.
Capture Video: You can use a webcam or pre-recorded video as an input source.
Process frames: For each frame of the video, you can pass it through the hand landmark model to detect and track the positions of the hand landmarks.
Visualize results: You can use the detected hand landmarks to draw on the video stream and visualize the positions of the different keypoints on the hand.

Output: Collection of detected/tracked hands, where each hand is represented as a list of 21 hand landmarks and each landmark is composed of x, y and z. x and y are normalized to [0.0, 1.0] by the image width and height respectively. z represents the landmark depth with the depth at the wrist being the origin, and the smaller the value the closer the landmark is to the camera. The magnitude of z uses roughly the same scale as x.

Now lets understand Random Forest Classifier. The random forest algorithm works by building a forest of decision trees, where each tree is trained on a random subset of the training data and a random subset of the features. During the training process, the algorithm randomly selects a subset of features and constructs a decision tree based on that subset. This process is repeated many times, creating multiple decision trees that are then combined to form a random forest.

To make a prediction, the random forest algorithm takes a new data point and passes it through each of the decision trees in the forest. Each tree then independently predicts the class of the data point, and the class with the most votes is selected as the final prediction.

Here are the main steps involved in how a random forest classifier works:

Data Preparation: The first step is to prepare the dataset by splitting it into training and testing sets. The training set is used to build the random forest model, while the testing set is used to evaluate the model’s performance.
Random Sampling: During training, the random forest algorithm randomly selects a subset of the data for each decision tree to be built on. This technique is called bootstrap aggregating or bagging. Each tree in the forest is trained on a different subset of the data.
Feature Sampling: In addition to random sampling of data, the algorithm also randomly selects a subset of features for each decision tree. This helps to create diversity among the decision trees and reduce overfitting.
Building Decision Trees: Each decision tree in the random forest is built using a subset of the data and features. The algorithm recursively splits the data into smaller subsets based on the values of the selected features, creating a tree-like structure.
Voting for Prediction: Once the decision trees have been built, the algorithm uses them to make predictions on new data points. The random forest classifier combines the predictions of all the decision trees by taking the majority vote of the class predictions.
Model Evaluation: Finally, the performance of the random forest classifier is evaluated on the testing set. Common evaluation metrics include accuracy, precision, recall, and F1 score.

Now we will be implementing these into our project.

The Random Forest algorithm is defined in the Scikit-learn package. Scikit-learn is a Python module for machine learning. Intel® Extension for Scikit-learn seamlessly speeds up your scikit-learn applications for Intel CPUs and GPUs across single- and multi-node configurations. Acceleration is achieved through the use of the Intel® oneAPI Data Analytics Library (oneDAL). Intel(R) Extension for Scikit-learn contains scikit-learn patching functionality that was originally available in thedaal4py package. This extension package dynamically patches scikit-learn estimators while improving the performance of your machine-learning algorithms.

The extension is part of the Intel® AI Analytics Toolkit (AI Kit) that provides flexibility to use machine learning tools with your existing AI packages. Using Scikit-learn with this extension, we can speed up training and inference by up to 100x with the equivalent mathematical accuracy.

Prerequisites:

Python 3.7 or above.
Packages Required : scikit-learn-intelex, mediapipe, opencv-python, numpy, pickle

Installation of scikit-learn-intelex library:

You can create a virtual environment or create a new conda environment in your linux system(To create: conda create — name env1 ;To activate: conda activate env1)to install these packages. Write down the following code on a cell of the Jupyter Notebook or in the linux terminal:

pip install scikit-learn-intelex

Then run:

from sklearnex import patch_sklearn
patch_sklearn()

If successfully installed and enabled you would get this

Dataset:

The dataset is made manually by collecting images from the webcam for all the alphabets in the American Sign Language. For each alphabet in ASL separate directories are created to store the respective alphabet in ASL. Each directory contains 100 images of the respective alphabet in ASL.

import os
import cv2

DATA_DIR = './data'
if not os.path.exists(DATA_DIR):
    os.makedirs(DATA_DIR)
number_of_classes = 24
dataset_size = 100
ap = cv2.VideoCapture(0)
for j in range(number_of_classes):
    if not os.path.exists(os.path.join(DATA_DIR, str(j))):
        os.makedirs(os.path.join(DATA_DIR, str(j)))

    print('Collecting data for class {}'.format(j))

    done = False
    while True:
        ret, frame = cap.read()
        cv2.imshow('frame', frame)
        if cv2.waitKey(25) == ord('q'):
            break

    counter = 0
    while counter < dataset_size:
        ret, frame = cap.read()
        cv2.imshow('frame', frame)
        cv2.waitKey(25)
        cv2.imwrite(os.path.join(DATA_DIR, str(j), '{}.jpg'.format(counter)), frame)

        counter += 1

Data Preprocessing:

Now, we will process the data.

import the modules:

import os
import pickle
import mediapipe as mp
import cv2
import matplotlib.pyplot as plt

Load the Hand Landmark model:

mp_hands = mp.solutions.hands
mp_drawing = mp.solutions.drawing_utils
mp_drawing_styles = mp.solutions.drawing_styles

Read image from directory(our dataset):

DATA_DIR = './data'
for img_path in os.listdir(os.path.join(DATA_DIR, dir_)):
        data_aux = []

        x_ = []
        y_ = []

        img = cv2.imread(os.path.join(DATA_DIR, dir_, img_path))

Convert image to RGB:

img_rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

Process image and detect hand landmarks:

results = hands.process(img_rgb)

Storing the hand landmarks(x and y) of the images in an array. This array will represent the image. And the labels will be the name of the directory the image is present in:

if results.multi_hand_landmarks:
            for hand_landmarks in results.multi_hand_landmarks:
                for i in range(len(hand_landmarks.landmark)):
                    x = hand_landmarks.landmark[i].x
                    y = hand_landmarks.landmark[i].y

                    x_.append(x)
                    y_.append(y)

                for i in range(len(hand_landmarks.landmark)):
                    x = hand_landmarks.landmark[i].x
                    y = hand_landmarks.landmark[i].y
                    data_aux.append(x - min(x_))
                    data_aux.append(y - min(y_))

            data.append(data_aux)
            labels.append(dir_)

Saving the data in filename data.pickle and creating a dictionary with keys ‘data’ and ‘labels’:

f = open('data.pickle', 'wb')
pickle.dump({'data': data, 'labels': labels}, f)
f.close()

Model Training:

Now, we will train the model

Required libraries:

import pickle
from sklearnex import patch_sklearn
patch_sklearn()
from sklearnex.ensemble import RandomForestClassifier
from sklearnex.model_selection import train_test_split
from sklearnex import metrics
import numpy as np

Loading the data and converting the ‘data’ and ‘label’ list into numpy arrays:

data_dict = pickle.load(open('./data.pickle', 'rb'))
data = np.asarray(data_dict['data'])
labels = np.asarray(data_dict['labels'])

Splitting the datasets into training and testing sets. The training set is used to build the random forest model, while the testing set is used to evaluate the model’s performance:

x_train, x_test, y_train, y_test = train_test_split(data, labels, test_size=0.2, shuffle=True, stratify=labels)

Creating the model using random forest classifier and training the model with the training dataset:

model = RandomForestClassifier()
model.fit(x_train, y_train)

Making predictions on new data points:

y_predict = model.predict(x_test)

Computing the accuracy of the model(model evaluation):

def compute_accuracy(Y_true, Y_pred):  
    correctly_predicted = 0  
    # iterating over every label and checking it with the true sample  
    for true_label, predicted in zip(Y_true, Y_pred):  
        if true_label == predicted:  
            correctly_predicted += 1  
    # computing the accuracy score  
    accuracy_score = correctly_predicted / len(Y_true)  
    return accuracy_score  
score = compute_accuracy(y_test,y_predict)

A 100% accuracy was achieved

Model Testing:

Now, we will test the model

Loading the trained model and Capturing the image from webcam:

model_dict = pickle.load(open('./model.p', 'rb'))
model = model_dict['model']

cap = cv2.VideoCapture(0)

Loading the hand landmark model:

mp_hands = mp.solutions.hands
mp_drawing = mp.solutions.drawing_utils
mp_drawing_styles = mp.solutions.drawing_styles

classifying the labels with the letter they are representing:

labels_dict = {0: 'A', 1: 'B', 2: 'C', 3: 'D', 4: 'E', 5: 'F', 6: 'G', 7: 'H', 8: 'I', 9: 'K', 10: 'L', 11: 'M', 12: 'N', 13: 'O', 14: 'P', 15: 'Q', 16: 'R', 17: 'S', 18: 'T', 19: 'U', 20: 'V', 21: 'W', 22: 'X', 23: 'Y'}

Now we will read the captured image, convert it to RGB, process image and detect the hand landmarks, iterating through all the landmarks and storing the(x and y) into array.

Than we will predict and display the character on the screen:


prediction = model.predict([np.asarray(data_aux)])
predicted_character = labels_dict[int(prediction[0])]
cv2.rectangle(frame, (x1, y1), (x2, y2), (0, 0, 0), 4)
cv2.putText(frame, predicted_character, (x1, y1 - 10), cv2.FONT_HERSHEY_SIMPLEX, 1.3, (0, 0, 0), 3,
                    cv2.LINE_AA)

And the Output:

Watch video to see the detection of all ASL alphabets from (A-Z){except ‘J’ and ‘Z’}.

Github link to the project:

GitHub - vatika17/signlanguage_recognition

American sign Language, is a natural language that serves as the predominant sign language of Deaf communities. But…

github.com

Taking the model to the next level:

We can extend the capabilities of this model by training it to detect more ASL phrases and ASL numbers.
Right now, it does not detect the letter ‘j’ and ‘z’ because of the motion associated with their sign. We can train the model to detect those letters too.
We can also train the model to frame sentences using the detected ASL alphabets.
We can detect the alphabet and generate a audio-based output for the user.

This project was showcased by Vatika Agrawal(Myself) at Intel oneAPI-based Machine Learning Hackathon organized by Intel in association with T&P Cell of Sikkim Manipal Institute Of Technology.

Thank you for your time. Here is a token for coming this far.