Sign Language Recognition in XR

Sourav das
XRPractices
Published in
10 min readApr 18, 2023
Image Source : pixabay.com

Communication is a problem for those people who have limited or no hearing abilities, which makes it challenging for them to communicate effectively with the hearing community. Sign language is a mode of communication for them, but it is not universally understood by everyone. This can lead to barriers in communication, social isolation, and difficulties in accessing education, employment, and other essential services.

Sign language recognition using AI is an emerging technology that has the potential to break down communication barriers for the hard of hearing community. Sign language recognition systems use computer vision and machine learning algorithms to translate sign language into text or speech. However, one of the main challenges in sign language recognition is the variability in sign language across regions and individuals.

In this blog, we will explore how advancements in AI and XR are helping to overcome these challenges and improve accessibility for the hard of hearing community.

Highlights

  1. Installing and importing all the dependencies
  2. Gathering data and structuring
  3. Creating and training model
  4. Usage of the model
  5. Example of how we can Integrate the model and make a 3D XR app

Getting started

We will start with creating a new jupyter notebook in the Visual Studio Code. Press (CMD + P) in visual code and search for “Create Jupyter Notebook” to create it. Then add code cells in the notebook to try out the code.

Creating Dataset And Training Model

1. Installing and importing all the dependencies

TensorFlow, Keras, OpenCV, and scikit-learn (sklearn) are powerful tools for training deep learning models. Now let’s install and import all the required modules.

# install this python modules
!pip3 install tensorflow==2.8.1 tensorflow-gpu==2.12.0
!pip3 install matplotlib
!pip3 install mediapipe
!pip3 install opencv-python
!pip3 install numpy
!pip3 install -U scikit-learn
import cv2
import numpy as np
import os
import time
import mediapipe as mp
from sklearn.model_selection import train_test_split
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense
from tensorflow.keras.callbacks import TensorBoard
mp_holistic = mp.solutions.holistic # Holistic model
mp_drawing = mp.solutions.drawing_utils # Drawing utilities

2. Capturing data and structuring

Now we need to create the folder structure to store the data. For the first frame of the first video, the file path will be like ./OurProject/data/Hello/video1/frame1.npy

Here ‘Hello’ is the sign action for which we are collecting the data and we are storing multiple sign action data inside the ‘data’ folder in the project.

DATA_PATH=os.path.join("Data")

# Actions that we want to train
actions = np.array(['Hello', 'How are you'])

# Thirty videos worth of data
no_of_videos = 30

# Videos are going to be 30 frames in length
FPS = 30

# Creating the folder structure
for action in actions:
for sequence in range(1, no_of_videos + 1):
try:
dir_name = os.path.join(DATA_PATH, action, "video{}".format(str(sequence)))
os.makedirs(dir_name)
except:
pass

Now in this step, we need to record some video to train our model. It is important to make sure that the video is accurate, then our model can perform well.

We can use openCV to capture the video and extract the key points using MediaPipe. The extracted key points will be saved using NumPy save() function.

def mediapipe_detection(image, model):
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
image.flags.writeable = False
results = model.process(image)
image.flags.writeable = True
image = cv2.cvtColor(image, cv2.COLOR_RGB2BGR)
return image, results

def extract_keypoints(results):
pose = np.array([[res.x, res.y, res.z, res.visibility] for res in results.pose_landmarks.landmark]).flatten() if results.pose_landmarks else np.zeros(33*4)
face = np.array([[res.x, res.y, res.z] for res in results.face_landmarks.landmark]).flatten() if results.face_landmarks else np.zeros(468*3)
lh = np.array([[res.x, res.y, res.z] for res in results.left_hand_landmarks.landmark]).flatten() if results.left_hand_landmarks else np.zeros(21*3)
rh = np.array([[res.x, res.y, res.z] for res in results.right_hand_landmarks.landmark]).flatten() if results.right_hand_landmarks else np.zeros(21*3)
return np.concatenate([pose, face, lh, rh])


# Capturing videos usign openCV
cap = cv2.VideoCapture(0)
with mp_holistic.Holistic(min_detection_confidence=0.5, min_tracking_confidence=0.5) as holistic:
for action in actions:
for sequence in range(1, no_of_videos + 1):
for frame_num in range(FPS + 1): # skipping frame 0, so FPS + 1
_, frame = cap.read()
image, results = mediapipe_detection(frame, holistic)

# showing a preview
cv2.putText(image, 'Collecting frames for {} Video Number {}'.format(action, sequence), (15,12), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 0, 255), 1, cv2.LINE_AA)
cv2.imshow('Captured image', image)
if frame_num == 0:
cv2.putText(image, 'STARTING COLLECTION', (120,200), cv2.FONT_HERSHEY_SIMPLEX, 1, (0,255, 0), 4, cv2.LINE_AA)
cv2.imshow('Captured image', image)
cv2.waitKey(500)
continue

# Export keypoints
keypoints = extract_keypoints(results)
npy_path = os.path.join(DATA_PATH, action, "video{}".format(str(sequence)), "frame{}".format(str(frame_num)))
np.save(npy_path, keypoints)

# Break gracefully
if cv2.waitKey(10) & 0xFF == ord('q'):
break

cap.release()
cv2.destroyAllWindows()

3. Creating and training the model

Next, a deep learning model can be built using Keras, which provides a user-friendly API for building neural networks. One common approach is to use the Sequential model in Keras, which allows for easy stacking of layers. For example, the model can include multiple LSTM layers for sequence modeling, followed by one or more Dense layers for classification. LSTM layers are particularly useful for modeling temporal dependencies and long-term memory in sequential data, which is important for sign language recognition. Dense layers can be used to map the LSTM output to a classification label.

# importing necessary modules
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense
from tensorflow.keras.callbacks import TensorBoard
# configuring the logging directory
log_dir = os.path.join('Logs')
tb_callback = TensorBoard(log_dir=log_dir)
#creating model
model = Sequential()
model.add(LSTM(64, return_sequences=True, activation='relu', input_shape=(30,1662)))
model.add(LSTM(128, return_sequences=True, activation='relu'))
model.add(LSTM(64, return_sequences=False, activation='relu'))
model.add(Dense(64, activation='relu'))
model.add(Dense(32, activation='relu'))
model.add(Dense(actions.shape[0], activation='softmax'))

model.compile(optimizer='Adam', loss='categorical_crossentropy', metrics=['categorical_accuracy'])

Before training the model, it is important to split the dataset into training and validation sets. Scikit-learn’s train_test_split() function can be used for this purpose, which randomly splits the dataset into a specified ratio of training and validation data. This helps to prevent overfitting and provides a more accurate estimate of the model’s performance on unseen data.

from sklearn.model_selection import train_test_split
from tensorflow.keras.utils import to_categorical

actions_map = {action:action_index for action_index, action in enumerate(actions)}

# Loading the data from file
sequences, labels = [], []
for action in actions:
for sequence in np.array(os.listdir(os.path.join(DATA_PATH, action))).astype(int):
window = []
for frame_num in range(FPS):
res = np.load(os.path.join(DATA_PATH, action, "video{}".format(str(sequence)), "frame{}.npy".format(frame_num)))
window.append(res)
sequences.append(window)
labels.append(actions_map[action])


# Splitting the data into train and test
X_train, X_test, y_train, y_test = train_test_split(
np.array(sequences),
to_categorical(labels).astype(int),
test_size=0.05
)

Let’s train the model now using the splitted data.

model.fit(X_train, y_train, epochs=2000, callbacks=[tb_callback])

Once the model has been trained, it can be saved in a .h5 file format using Keras. This allows the model to be easily loaded and used for future predictions without the need for retraining. The saved model can be loaded using the load_weights() function provided by Keras.

model.save('model.h5')

"""
to load the model again from the h5 file,
delete the existing model object by using 'del model'
then recreate the model and instead of train the model again, use
model.load_weights('action.h5')
"""

Usage Of The Created Model

We can use openCV to capture the camera feed or we can use URL to get the video input. Then we need to extract the key points using MediaPipe and send the data to the model for further prediction.

predictions = []
sequence = []
threshold = 0.99

with mp_holistic.Holistic(min_detection_confidence=0.5, min_tracking_confidence=0.5) as holistic:
cap = cv2.VideoCapture(0) # Here we can provide Streaming URL or camera

if not cap.isOpened():
exit()

while True:
_, frame = cap.read()
if frame is None:
cap.release()
break

_, results = mediapipe_detection(frame, holistic)
keypoints = extract_keypoints(results)
sequence.append(keypoints)
sequence = sequence[-30:]
cv2.imshow('video', frame)

if len(sequence) == 30:
res = model.predict(np.expand_dims(sequence, axis=0))[0]
predictions.append(np.argmax(res))

if np.unique(predictions[-10:])[0] == np.argmax(res):
accuracy = res[np.argmax(res)]
if accuracy > threshold:
words = actions[np.argmax(res)]
print(words)
# do something with detected word

if cv2.waitKey(10) & 0xFF == ord('q'):
cap.release()
cv2.destroyAllWindows()
break

Integrating all in an XR app

The trained sign language recognition model can be integrated into a variety of applications. For instance, the model can be integrated into a mobile app or a wearable device that can capture video data and translate them into corresponding text or speech output. The model can also be integrated into a smart home device that can recognise sign language commands and control home appliances such as lights, thermostat, or TV.

Steps about how we can create an AR application, it tried running it in phone as well as smart glasses such as Lenovo ThinkReality

1. Installing IP Webcam

We will start by downloading the IP Webcam application. It will help to get the camera feed from phone/smart glass, we will use IP Webcam app which is available on the Playstore. After downloading, Open and click on the “start server” to start the server.

2. Setting up the script for sign language detection

We will write a class SignLanguageDetection which take the Streaming URL and the socket. Then it will fetch the camera feed from the URL and start predicting. If any action matched, the we will use the socket to send back the word.

Save this code into detectAction.py file.

import cv2
import numpy as np
import mediapipe as mp
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense
from tensorflow.keras.callbacks import TensorBoard

class SignLanguageDetection():
def __init__(self):
self.mp_holistic = mp.solutions.holistic
self.mp_drawing = mp.solutions.drawing_utils
self.actions = np.array(['hello', 'How are you'])
model = Sequential()
model.add(LSTM(64, return_sequences=True,
activation='relu', input_shape=(30, 1662)))
model.add(LSTM(128, return_sequences=True, activation='relu'))
model.add(LSTM(64, return_sequences=False, activation='relu'))
model.add(Dense(64, activation='relu'))
model.add(Dense(32, activation='relu'))
model.add(Dense(self.actions.shape[0], activation='softmax'))
model.compile(optimizer='Adam', loss='categorical_crossentropy',
metrics=['categorical_accuracy'])
model.load_weights('model.h5') # provide the h5 file path here
self.model = model

def extract_keypoints(self, results):
pose = np.array([[res.x, res.y, res.z, res.visibility] for res in results.pose_landmarks.landmark]).flatten(
) if results.pose_landmarks else np.zeros(33*4)
face = np.array([[res.x, res.y, res.z] for res in results.face_landmarks.landmark]).flatten(
) if results.face_landmarks else np.zeros(468*3)
lh = np.array([[res.x, res.y, res.z] for res in results.left_hand_landmarks.landmark]).flatten(
) if results.left_hand_landmarks else np.zeros(21*3)
rh = np.array([[res.x, res.y, res.z] for res in results.right_hand_landmarks.landmark]).flatten(
) if results.right_hand_landmarks else np.zeros(21*3)
return np.concatenate([pose, face, lh, rh])

def mediapipe_detection(self, image, model):
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
image.flags.writeable = False
results = model.process(image)
image.flags.writeable = True
image = cv2.cvtColor(image, cv2.COLOR_RGB2BGR)
return image, results

def predict_image(self, url, send_data_func):
predictions = []
sequence = []
threshold = 0.99
holistic = self.mp_holistic.Holistic(
min_detection_confidence=0.5,
min_tracking_confidence=0.5
)
cap = cv2.VideoCapture(url)

if not cap.isOpened():
return
while True:
_, frame = cap.read()
if frame is None:
cap.release()
break

_, results = self.mediapipe_detection(frame, holistic)
keypoints = self.extract_keypoints(results)
sequence.append(keypoints)
sequence = sequence[-30:]

if len(sequence) == 30:
res = self.model.predict(np.expand_dims(sequence, axis=0))[0]
predictions.append(np.argmax(res))

if np.unique(predictions[-10:])[0] == np.argmax(res):
accuracy = res[np.argmax(res)]
if accuracy > threshold:
words = self.actions[np.argmax(res)]
send_data_func(words, accuracy)

holistic.close()

3. Setting up Python server for sign language detection —

We will use Flask to create the server and flask_sock to implement the web-socket. We can create /start-prediction API to start the prediction process. This API will be a web-socket, so the connection will not end and we can send back the predicted word whenever we will predict any.

Create a file called app.py with this code.

from flask import Flask
from flask_sock import Sock
from detectAction import SignLanguageDetection
import requests

app = Flask(__name__)
socket = Sock(app)


def FetchData(send_data_func):
"""
start the IP Webcam application
and change the url with the url provided by IP Webcam app.
/shot.jpg should be there at the end of the URL
"""
ip_webcam_streaming_url = "https://192.0.0.0:8080/shot.jpg"

signDetection = SignLanguageDetection()
signDetection.predict_image(ip_webcam_streaming_url, send_data_func)


def send_detected_word_closure(sock):
def send_detected_word(word, accuracy):
sock.send({"word": word, "accuracy": accuracy})
return
return send_detected_word


@socket.route('/start-prediction', methods=['GET'])
def process_frame(sock):
FetchData(send_detected_word_closure(sock))
sock.close()

app.run(host="0.0.0.0", port=1234)

Start the server using “python3 app.py”. You can find the URL where the app is running. We will use this URL for creating the android app.

4. Creating the app to display the result

Fire up the unity hub and create a project with ARCore template. (for ThinkReality A3, we need to use Snapdragon Spaces SDK to build the app)

  • Add WebSocketSharp package to the project.
  • Add a canvas and a text box to the scene. Change the Render mode of canvas to Screen camera.
  • Create a script called FetchData.cs and attach it to the canvas and drag drop the text box gameObject in the script. This script will call the /start-prediction API and process the predicted word recived from the python server. Here is the code for the FetchData.cs file —
using System.Collections;
using System.Collections.Generic;
using Newtonsoft.Json;
using TMPro;
using UnityEngine;
using WebSocketSharp;

public class FetchData : MonoBehaviour
{
[SerializeField] private TextMeshProUGUI textBox;

private WebSocket _ws;
private int _id;
private Queue<Prediction> _predictions = new();

void Start()
{
StartCoroutine(FetchPredictions());
}

private void Update()
{
UpdatePrediction();
}

private void UpdatePrediction()
{
if (_predictions.Count is 0) return;

var prediction = _predictions.Dequeue();
textBox.text = prediction.word;
_id = new Random().Next();
StartCoroutine(ClearText(_id));
}

private IEnumerator ClearText(int id)
{
yield return new WaitForSeconds(2f);
if (_id == id)
{
textBox.text = "";
}
}

private IEnumerator FetchPredictions()
{
const string url = "ws://192.0.0.0:1234"; //replace it with the python server URL
var uri = $"{url}/start-prediction";
_ws = new WebSocket(uri); // calling the /start-prediction API
_ws.OnMessage += OnWebSocketMessage;
_ws.OnError += OnWebSocketError;

_ws.Connect();
while (!_ws.IsAlive)
{
textBox.text = "Waiting for server";
yield return new WaitForSeconds(2f);
_ws.Connect();
}
OnWebSocketOpen();
}

private void OnWebSocketMessage(object sender, MessageEventArgs e)
{
_predictions.Enqueue(JsonConvert.DeserializeObject<Prediction>(e.Data));
}

private void OnWebSocketError(object sender, ErrorEventArgs e)
{
// do something
}

private void OnWebSocketOpen()
{
Prediction prediction = new()
{
word = "Connected to server",
accuracy = 1f
};

_predictions.Enqueue(prediction);
}

private void OnDestroy()
{
_ws.Close();
}
}
  • Create Prediction.cs to parse the JSON data with this code —
using System;

[Serializable]
public class Prediction
{
public string word;
public float accuracy;
}

That’s all. We can take the build and use it.

Here is a quick demo of the concept.

Result

Conclusion

This article shows possibility of building usecase for sign language detection in XR, and use technology inclusively. With 5G and edge computing support such usecases will be more real.

Acknowledgement

All the hard work for the development of this Sign Language Recognition was done by Sindhu Rathod, Sai Charan Abbireddy, and myself. We would like to thank Kuldeep Singh, Neelarghya, Raju K for constant guidance.

--

--