Slovo: Russian Sign Language Dataset

Alexander Nagaev
4 min readMay 30, 2023

This paper introduces the new dataset Slovo for Russian Sign Language (RSL) recognition. The motivation to create it, its description, and example of the usage to learning neural networks are produced below. In conclusion, you can find resources with more information about the dataset, links to download it, and pretrained models.

Introduction

The deaf community still struggles in many situations in their everyday life. Since only a few structures have a sign language interpreter on staff, SL natives can be misunderstood in extreme cases, such as interaction with healthcare providers or consultants in banks, government institutions, airports, public places, etc. Moreover, sign language users live through social isolation, an education gap with the hearing population, and difficulties in finding employment.

Sign Language Recognition (SLR) systems can accomplish more transparent communication between people with different hearing and speaking abilities and be integrated into human-computer interaction systems. Developing a sign language learning app or embedding a feature in video conferencing apps can simplify education and employment processes.

One of the main problems of SLR is data collection due to the complexity of finding signers who know RSL. In our paper, we introduced the dataset to solve the sign language recognition task in Russia. We also provided a pipeline for creating the Slovo dataset.

Dataset Description

Dataset Content. Slovo is the most heterogeneous in terms of subjects for Russian Sign Language (RSL) recognition — 194 crowdworkers participated in the video recording. It contains 20,000 videos divided into 1,000 frequently used glosses and short phrases in RSL. The dataset was collected mainly indoors and varied in scenes and lighting conditions.

Video Quality. The videos were recorded primarily in HD and FullHD formats. About 86% of the videos are oriented vertically, 13% are oriented horizontally, and 1% are in square format. The average video length is 1.67 seconds, and the overall duration of the dataset is about 19.81 hours.

Dataset Splitting. The data was split into training (75%) and test (25%) sets, containing 15 and 5 video samples for each class, respectively. The subjects in training and test sets equal 112 and 174, respectively. Note that groups of subjects in these two sets intersect; however, we tried to minimize it by filling out the test set with inactive users.

Time Interval Annotation. Collected videos may contain uninformative frames at the beginning and the end of the video, where workers turn the camera on and off and prepare to show the gesture. Therefore, as necessary, we annotated the gesture’s start and end time on the video. After cutting off the gestures, we had the cuts at the beginning and the end of the video where no gesture is shown, and we used them as “no event” objects in training to predict the absence of action. For this, our dataset was expanded by 400 samples related to this special class.

Russian Sign Language Recognition

In this section, we provide a tutorial to show how to use the Slovo to build an SLR system that can recognize Russian Sign Language gestures.
There are 3 versions of the dataset proposed: Trimmed HD, Original HD, and Original 360. Let’s download the trimmed version:

We have already divided the dataset into training (15,300 videos) and test (5,100 videos) sets. Each video has its id, user id, and meta information (height, width, and length).

For our experiments, we used mmaction2 as it provides an off-the-shelf solution for many computer vision tasks. So let’s install mmaction2:

mmaction2 uses its own dataloader, so we need to prepare annotation files for training and testing:

Slovo is divided into 1,000 classes and one “no event” class.

The next step is creating config file for model training. To demonstrate the use of our dataset, we decided to finetune MViT Small pretrained on the Kinetics400 dataset.

You can find the full code for creating the config in our notebook on Kaggle.

You can experiment with the SampleFrames options:

  • clip_len – number of input frames
  • frame_interval – frame sampling step from video
  • num_clips – number of resamplings from one video (set greater than 1 if input videos are long enough)

And start learning

python tools/train.py <path_to_config>

Resources

The whole dataset, the trial version with 100 images per class, pre-trained models, and the demo are publicly available in the repository.

Other links: arXiv, Kaggle, Habr

--

--