GSoC 2023- Red Hen Lab

Dhruv Tyagi
13 min readMay 6, 2023

--

Dhruv Tyagi — Extraction of Gesture Features

Extraction of Gesture Features

This blog is maintained by Dhruv Tyagi about the updates on the progress of the GSoC 2023 project with Red Hen Lab.

Introduction

Hello there! I am an undergraduate student in my 3rd year (as of May 2023) pursuing electrical engineering at Harcourt Butler Technical University (HBTU), Kanpur. I have a keen interest and decent experience in Python, Machine learning, Artificial Intelligence and Computer Vision. This summer, as part of Google Summer of Code 2023, I am grateful for the opportunity to make a valuable contribution to Red Hen Lab.

As a student developer, I will be working for Red Hen Labs and this blog is to document my weekly progress on the project that I have proposed to work on.

Connect with me :

About Red Hen Lab

The International Distributed Little Red Hen Lab™ is a global big data science laboratory and cooperative for research into multimodal communication. Red Hen’s main goal is the theory of multimodal communication. See Overview of the Red Hen Vision and Program.

Red Hen builds tools across a range of tasks, including automated data acquisition, distributed data storage, data enhancement, joint text, sound, and vision parsing, statistical analysis, multimodal search engines, user interfaces, presentation tools, publishing platforms, and pedagogical applications. We develop open-source software at RedHenLab on GitHub.

Project Details

The Project that I am working on is ‘Extraction of Gesture Features’.

For the GitHub repo of the work on this project, check here: https://github.com/Dhruv-0001/RedHen-Gesture-Features

Check the proposal @ GSoC website: Click Here

Mentors :

Abstract: The way humans interact with each other occurs in multimodality. We not only articulate words but also, show them. Expressing different concepts such as time, place, and emotion comes with speech and some movements called body gestures. This project aims to detect these commonly occurring phrase and gesture combinations to get meaningful patterns and insights. For this, I propose a multi-modal multi-phase pipeline that captures different patterns of body gestures and speech phrases aligned with an articulated time. The three-stage pipeline begins with pose feature extraction in which both supervised and unsupervised approaches shall be incorporated to classify and cluster various gesture classes. In the next stage phrase features are extracted. These extracted features are then passed through the next stage. This stage will produce a synchronized dataset by aligning extracted features in time by applying late fusion. Lastly, the Apriori Algorithm is applied to identify the frequent phrase and gesture combinations. Further, the project can be merged into production in a Red Hen pipeline.

Project Goals

The main aim of this study is to find the relationship between the body language (gesture) and the sentence (phrase) of a speaker. For this, we are meant to find frequently occurring gesture and phrase combinations. After identifying common gestures associated with each phrase, we can create a gesture dictionary. This dictionary would be used to recognize and interpret the gestures of speakers in real-time. To accomplish this task, the following objectives need to be met:

Gesture Pipeline -

Stage 1 (Extracting Keypoints)

• Rebuild the Elen DeGeneres shows’ video dataset.

• Dataset Pre-processing and shot segmentation for better accuracy.

• Create a singularity sandbox from openpose.def file.

• Extract pose key points using OpenPose containers.

• Track and assign unique IDs to the poses of each individual.

• Reshaping 3D tensor data to 2D tensor.

Stage 2 (Supervised Approach)

• Apply the LRCN approach for supervised gesture prediction.

• Pass the 2D through the combination of Convolution and LSTM layers in a single model.

• Store the result in a time series array.

Stage 3 (Unsupervised Approach, conditional)

• Normalize and pre-process key points according to the view-invariant transformation.

• Employ RNN propagation with the architecture of Encoder-Decoder on the processed coordinates.

• Implement the K-nearest neighbour classifier with k = 1 to compute the action recognition accuracy and evaluate the performance of the model.

• Store the result in a time series array.

Phrase Pipeline -

• Extract audio (.mp3) from videos.

• Create a text transcript using an automatic speech recognition tool.

• Segmenting video/audio based on the spoken phrases.

• Implement a Speaker to identify the speaker in case of multiple persons.

• Extraction of relevant phrases.

• Store the result in a time series array.

Synchronization Pipeline -

• Create a synchronized dataset by aligning the extracted set of features in time.

• Create a final feature array by late fusing the above two pipelines.

• Identifying frequent combinations using the Apriori Algorithm.

• Store the results in a gesture dictionary for further usage.

Community Bonding Period (May 4–28)

Starting: I received the proposal acceptance from Red Hen Lab on May 5th. After that, I contacted my mentors to discuss more about the details of the project. We also tackled the challenge of time zone differences by establishing a regular meeting schedule accommodating everyone’s availability. Mentor Swadesh, provided me with a sample video from the dataset and a .eaf file with annotations. I started to work on that to get more insights into the project requirements.

Setting up CWRU HPC Account: Further, I also received my Case Western Reserve University (CWRU) ID to gain access to the High-Performance Computing (HPC) clusters. I successfully set up the connection to the HPC clusters through VPN access. This video helped me to set up my server.

Welcome Meet: Attended the welcome meet by the Red Hen Mentors and Founders. I got to know more about the Red Hen and the whole team.

Coding Period

The long-awaited coding period for GSoC 2023 has finally begun, and I couldn’t be more thrilled and filled with anticipation!

Week 1 & 2 (29 May — 11 June 2023)

Unfortunately, I fell ill with a fever during this period. As a result, I was unable to work during the initial days as planned. But I still managed to do the work discussed below.

I had a meeting with one of my mentors Swadesh. He explained the docker, sandbox, OpenPose and how to build it. Later, I checked out the singularity containers and have some idea of how they work.

I then mailed Mark Turner (co-director of Red Hen Lab) to get access to the Dataset. The dataset consisting of Elen DeGeneres shows’ videos and gesture annotations are available through Google Drive. However, some of the videos are not available to the public on any online platform. Out of the 30 videos in the dataset, 19 are present online (in YouTube, Dailymotion), 11 are not available out of which 3 are private links while 8 are not found.

The dataset contains the video files as well as the Elan Annotations in .eaf format that can be read using Elan Software. ELAN is computer software, a professional tool to manually and semi-automatically annotate and transcribe audio or video recordings. I spent some time learning about it. The following article and Video could be helpful.

I also got to know about the use of screen and tmux. Tmux allows helps to manage sessions efficiently, enabling us to detach and reattach to sessions at will. This feature is particularly handy when we need to switch between different tasks or disconnect from a remote machine while preserving our work environment. Know more about Tmux from Here.

In the meantime, I also started to work on shot segmentation to distinguish between camera changes, i.e., changes in scene and shot angle. I decided to use shot segmentation before tracking as it helps to identify different shots, scenes, or angles in a video that would improve the accuracy and reliability of the tracking results.

Week 3 (12–18 June 2023)

This week I worked on subtitle generation and shot segmentation.

For shot segmentation, I used ’PySceneDetect’ which is a Python library for detecting shot boundaries and dividing a video into scenes based on changes in the visual content. I used the code given below -

from scenedetect import detect, ContentDetector
from scenedetect.video_splitter import split_video_ffmpeg

scene_list = detect('File_Path.mp4', ContentDetector())

with open('output.txt', 'w') as f:
for i, scene in enumerate(scene_list):
print(' Scene %2d: Start %s / Frame %d, End %s / Frame %d' % (
i+1,
scene[0].get_timecode(), scene[0].get_frames(),
scene[1].get_timecode(), scene[1].get_frames(),), file=f)

split_video_ffmpeg('File_Path.mp4', scene_list)

You can check the output text and sample video files here.

Later this week I had a meeting with my mentor Swadesh and we discussed the next tasks I need to work on. We also discussed the tracking of speakers and assigning of ids.

Actually, In this project, we intend to detect the speakers and allot each speaker a unique id throughout the video so that we can detect which person/id is speaking and making a gesture at a given instance. For this, we can build a CNN model which would extract the features of each id and save them in a database to allot the next id according to features. This process is known as Person Re-Identification. Check out the following papers and models to know more.

But the above process would take up a lot of our time, So I and mentor decided to do the tracking on segmented videos. In this way, we need not worry about the changes in the ids of the speaker.

Going forward, I started to work on the subtitle generation. For this, I did the two steps-

  • Converting Video to Audio (.avi format).
  • Using Open AI whisper to generate subtitles with time stamps from the audio.
  1. To convert video to audio, I used ‘ffmpeg’.
cd [Address of the folder containing the MP4 Video file]
ffmpeg -i [Video File Name].mp4 [Audio File Name].mp3

Here is a good article for the instructions. Also initially I faced problems installing the ffmpeg and setting it up. For installation, you can use the following articles — article 1, article 2.

2. To extract subtitles from audio, I used Open AI Whisper. I used the large Multilingual model to get the best accuracy.

To know about the installation and usage check the official documentation of Whisper. Below is the code that I ran on Google Colaboratory -

!nvidia-smi # Make sure you are using GPU

! pip install git+https://github.com/m-bain/whisperx.git

from google.colab import files
uploaded = files.upload()

!whisperx output_audio.mp3 --model medium.en --output_dir . --align_model WAV2VEC2_ASR_LARGE_LV60K_960H --highlight_words True

!ffmpeg -i Bethenny_Frankel_01.mp4 -vf "ass=output_ass.ass" new_video_out.avi

Check out the sample output files here.

Week 4 (19–25 June 2023)

I updated mentor Swadesh Jana with the work done till now. We decided on the work for the week that is -

  • Subtitle Segmentation
  • Building OpenPose container to extract key points.

I also got to know about the HPC OnDemand web portal, which is a hassle-free way to access the cwru hpc server.

I started by working on subtitle segmentation. For this, I just wrote a simple code to segment the subtitles for each segmented video. It gives the output.srt file as shown below -

1
00:00:00,929 --> 00:00:01,049
<u>We're</u> back with Bethenny Frankel, and switching gears a little bit.

2
00:00:01,049 --> 00:00:01,069
We're back with Bethenny Frankel, and switching gears a little bit.

You can check out the function made and the complete sample output.srt file here.

Now after this, I started working on building the openpose container. I came across this amazing blog on Setting up OpenPose in CWRU HPC. But due to not having the Linux environment and no GPU on the local machine, I was not able to build it.

Week 5 (26 June–2 July 2023)

I started building the OpenPose container on Google Colaboratory. I used the following code. But I encountered the following error —

F0701 08:33:31.779273 18219 syncedmem.cpp:71] Check failed: error == cudaSuccess (2 vs. 0)  out of memory
*** Check failure stack trace: ***
@ 0x7f671495c1c3 google::LogMessage::Fail()
@ 0x7f671496125b google::LogMessage::SendToLog()
@ 0x7f671495bebf google::LogMessage::Flush()
@ 0x7f671495c6ef google::LogMessageFatal::~LogMessageFatal()
@ 0x7f67146ac59a caffe::SyncedMemory::mutable_gpu_data()
@ 0x7f6714545336 caffe::Blob<>::mutable_gpu_data()
@ 0x7f6714582b60 caffe::BaseConvolutionLayer<>::forward_gpu_gemm()
@ 0x7f67146e5ac1 caffe::ConvolutionLayer<>::Forward_gpu()
@ 0x7f671466c2d2 caffe::Net<>::ForwardFromTo()
@ 0x7f6714fec2c6 op::NetCaffe::forwardPass()
@ 0x7f6715009222 op::PoseExtractorCaffe::forwardPass()
@ 0x7f67150042db op::PoseExtractor::forwardPass()
@ 0x7f6715001dd0 op::WPoseExtractor<>::work()
@ 0x7f6715032b1f op::Worker<>::checkAndWork()
@ 0x7f6715032cab op::SubThread<>::workTWorkers()
@ 0x7f67150407cd op::SubThreadQueueInOut<>::work()
@ 0x7f6715037831 op::Thread<>::threadFunction()
@ 0x7f6714c77de4 (unknown)
@ 0x7f67143f8609 start_thread
@ 0x7f6714ab3133 clone

The error is mentioned on GitHub issues. Check Here. It suggests the following command to use while building the container.

!cd openpose && rm -rf build || true && mkdir build && cd build && cmake .. -DUSE_CUDNN=OFF && make -jnproc

But for me, even this doesn't work out. So I informed my mentor about the issue. In the meantime, I tried the TensorFlow MoveNet model. I got good results on it. I used the following code.

Later this week I had a meeting with my mentor. He told me that the results obtained from MoveNet are not upto the mark as we need more accuracy. Additionally, both of us made attempts to address the problem with OpenPose, yet it remained unresolved. So sir told me to write an email to get access to his work on HPC in order to save time.

The project I am working on is actually an extension of the project mentor Swadesh worked on in GSoC 2021. Check out his work here.

I wrote the mail for access and got access to the required key points.

Week 6 (3 — 9 July 2023)

As we enter the final week before mid-term evaluations, I have begun working on speaker Diaraization and speaker tracking using deep sort. Once this is completed, stage 1 of the Gesture Pipeline and Phrase Pipeline will be finished.

I used the model of speech brain with OpenAI whisper for speaker Diaraization. The model was able to segment the subtitles but it have some glitches. I updated the sir about the same and he suggested trying NVIDIA NeMo.

Mid-Term Evaluations (10–14 July 2023)

Successfully Passed my Mid - Term Evaluations and started to work on the project by taking inputs from the mentor’s feedback.

Week 7 & 8 (17–30 July 2023)

The results of NVIDIA NEMO were not great, so I started looking for other methods. I came across Deepgram. It is an Open Source platform for speech-to-text conversions. I tried the following code and after fine-tuning the results achieved were good.

I also completed the tracking of IDs in the segmented videos. I discussed with my mentor how to implement it. I then used the neck keypoint for tracking. If the coordinates of the neck are within the threshold of 5 pixels across the frames then they are of the same person.

I also started to work on the LCRN approach i.e. using LSTM & CNN in a single model.

Week 9 & 10 (31 July–13 Aug 2023)

My Classes also started on 1st August, which initially gave me a hard time managing the time. I started to define the various different types of Hand gestures. For this, I used the Body key point coordinates to define the gestures. The defined gestures include various movements such as horizontal and vertical movements of both arms, arms touching each other, etc. I also defined a movement as ‘Expressed’ for the movements in which the hands first come close and then go away from each other, as it was a repeated gesture I observed in the dataset.


# Compute gesture label, indexing from zero

right_wrist_y = keypoints[13]
right_wrist_x = keypoints[12]

right_elbow_y = keypoints[10]
right_elbow_x = keypoints[9]

left_wrist_y = keypoints[22]
left_wrist_x = keypoints[21]

left_elbow_y = keypoints[19]
left_elbow_x = keypoints[18]

The above are the indices of various body points starting from zero as the first index.

Talking about the model's progress, I first used the model developed by my mentor as the base. It is a simple Tensorflow deep-learning model made with ConvLSTM2D and Conv3D layers and finally a Dense layer with sigmoid activation as the output. It gave me an accuracy of 70 to 78%.

I further tried different combinations and LSTM layers but the accuracy dropped to 60% in some cases. Since it is 3D data, direct LSTM layers are not compatible with it. Conv3D layers in a VGG-like architecture seemed to give the best results so I removed the use of LSTM from the model.

Week 11 & 12 (17–28 Aug 2023), Final Weeks

The final 2 weeks of GSoC begins. My mentor asked me to start wrapping and tidying up everything and prepare to showcase the projects. I started to create a pipeline for merging speaker segmentation, the Gesture recognition model and determining the type of Gesture. I generated a final inference video having all the components for showcasing to the mentors.

With this, my stage 1 and stage 2 of the Gesture Pipeline are completed. The phrase pipeline also got completed and synchronized all the pipelines together.

I also tried the encoder-decoder model for the very last time. It was built successfully. Tried to reduce the var_loss as much as possible. The model was trained successfully and I built the classifier as well to check the accuracy. The accuracy came only 50 to 55% which was not acceptable. So I used my earlier LSTM + CNN model as the final model.

Post Coding Period

I had a meeting with mentor Peter. I gave him a presentation regarding my work. We discussed the things that can be improved such using pyscene detect to remove unwanted detections when there is a change in the scene. Also, we scheduled a meeting with ML engineers to look more into my autoencoder model.

I will try to keep contributing to the Red Hen Lab. That's what is the essence of Opensource, staying with your organization even after GSoC.

Work to do to improve the model and the overall project:

  • Trying the Transformer model, attention layers, different combinations of RNN, GRU etc. (methods discussed with mentor Swadesh)
  • Building a singularity container after the model has been finalized.
  • Build a different model for Speaker Diarization or try active speakers (which mentor Peter told me about)

Final Evaluation

I am Both Nervous and Excited for the Final Evaluations. Let's see what happens! Hoping for a positive response.

Post GSoC Work

GSoC is just the beginning. After spending a summer on this project, I believe that the project can be improved upon a lot and have great applications in the future. I will further work on the things left i.e. Stage 3 of my Gesture Pipeline, can apply the voice modulation model, improve the accuracy of the model, etc.

Acknowledgements

I am immensely thankful to my mentors, especially Swadesh, for their invaluable guidance and unwavering support during my GSoC project. The Red Hen Lab community’s collaborative spirit has been a constant source of inspiration, and I extend my appreciation to them. I also want to express my gratitude to the GSoC team for providing me with this incredible opportunity to contribute and learn.

Conclusion

In conclusion, the GSoC period has been a thrilling ride of learning and growth. From tackling challenges to crafting solutions, every step has added to my skills. I’m excited about the positive impact my contributions will have on the Julia community. This experience has deepened my commitment to open-source collaboration, and I’m eager to continue contributing and learning from it. The GSoC journey has sparked a passion for exploration and teamwork, and I’m enthusiastic about my future open-source endeavours.

💖Thank you for joining me on this journey, and look forward to more updates as I continue contributing to the Open Source community.

--

--