Emotion identification using upper body language detection with Microsoft Kinect v2 and machine learning

Published in

Empathic Labs

6 min readApr 23, 2020

Pepper, the empathic robot made by Aldebaran.

Hey, I’m a software engineer pursuing my PhD in computer science in Switzerland. This article is about my master’s thesis project which was intended to detect emotions by analysing the person’s upper body language. Being able to detect emotions allows machines to understand us better and eases our communication with it (which is, by the way, the whole purpose of Empathic Labs so I suggest you take a look at what those people do).

Introduction

Studying nonverbal communication emerged with Darwin stating that emotion expressions started with pre-human nonverbal displays and evolved in humans displays [1]. Whether it is a facial micro-expression, body language feature or the person’s voice, it can represent an emotion that might be wanted hidden. Reading emotions is becoming more and more crucial for our daily lives. Feelings have always been the major specialty in the field of psychology and neuroscience but with technology and advancements in cameras, image processing and machine learning, machines are starting to be able to detect them.

Human-Machine interactions are becoming recurring, with autonomous cars, smart assistants, chat bots, robot police, customer’s service or sales representatives; even cooperating with machines that need to detect emotions like in interviews, investigations or healthcare will become necessary.

Objective

The goal of this project is to build a system that detects the upper body language in order to recognise hidden emotions. The main objectives are body detection, a skeletal representation, body language features extraction and finally emotion detection based on these features. A Microsoft Kinect v2’s API was used for the body detection, its Gesture Builder software was used to study the position of the head and the upper body to detect features and an additional machine-learning algorithm was applied to compare the features and identify the emotion.

The solution is divided into 3 main parts:

Body detection and representation

The body detection and representation implemented using Microsoft Kinect library [2] for C#.

In order to implement the body detection and representation, a body mask (the isolation of the pixels belonging to the body. This feature is already implemented in the adopted library) and two classes [3] are required:

BodiesManager a class that renders and draws the tracked bodies in the frame.

BodyInfo a class that contains the detected body’s joints and bones information in order to couple it with graphical information like lines, circles and colors.

By default, each body has 25 joints, with a position and orientation for each joint. Since the project’s main focus is upper body language, some joints and bones will be ignored which keeps 15 joints to each body (head, neck, SpineShoulder, SpineMid, SpineBase, left and right shoulders, elbows, wrists, hand tips, thumbs). The body data collected by the Kinect is saved in a Body array processed by the bodies’ manager class that will draw the joints and bones.

Gesture recognition

The gesture recognition was made thanks to Kinect Visual Gesture Builder, a tool that uses videos recorded with Kinect Studio. The Kinect SDK offers two applications called KStudio [4] and Visual Gesture Builder [4]. KStudio is used to record clips featuring a gesture. A gesture is then defined in Visual Gesture Builder, the joints it relies on are then chosen (lower body can be ignored, same for the left or right arm) and then the timestamps representing the gesture are marked.

The gesture builder using Adaboost [5] as a machine-learning algorithm to train itself and become able to detect the gesture then analyzes the marked videos and generates a file with the features defining this gesture. 19 gestures were recorded and recognized.

Two additional classes are required for the project GestureDetector and GestureResultView [3]

After the database is generated, it is added to the Visual Studio project. The GestureResultView generates a confidence for each gesture that goes between 0 and 1 (1 if it is 100% confident that the gesture is being done). A logging function was implemented after the detection was successful; it captures the timestamp and all the gestures’ confidence and appends them on a logging file (the work is done by an asynchronous thread). The logging function is called every time a gesture’s confidence is modified. The logs will be used as our data to detect the emotion.

Gesture builder test showing the confidence it has for each defined gesture (hands behind head, head backward and spine backward are at confidence 1)

Emotion identification

The emotion identification was made using multi-output regression machine learning algorithm [6] using the log file as data.The gesture detection application was used to generate the training datasets. Three persons contributed and 1540 samples representing a single emotion each were extracted. Eight emotions were represented in total: Happiness, sadness, pride, guilt, defensive, interest, boredom and impatience. The link between the gestures and the emotion was based on articles [7] read and on Allan and Barbara Pease’s book [8].

Link between the gestures and the emotions

The datasets are stored in two different text files, X and Y: The first file contains the confidence of each gesture, with each line containing the data extracted at a specific timestamp knowing that the gestures are representing a specific emotion. The second file has values between 0 and 100, with 8 values per lines, each value representing how much an emotion is represented and each line representing the gestures expressed in the same line of the first file.

Datasets used for the training of the system (on the left is the confidence for each gestures on the right is the % of each emotion)

The script keeps the logging file opened in read, the timestamp is saved on a third file for plotting purposes and then the gestures’ confidence are stored in an array till the script reaches another timestamp (19 lines later) which marks the end of the logged set. It then analyses the content of the array and extracts its belonging to each emotion. The result is stored in the plotting file and the dominating emotion is sent to the C# application in order to be represented in real-time. Another script is used to plot the stored emotions.

Plotting file (showing the timestamp + confidence towards the emotions)

And that’s the application running:

Upper body detected in real time with hands behind the head, the head and spine backward which represent pride.

Conclusion

The project was successfully developed. The emotions can be identified, the project is able to detect up to six bodies in parallel, the emotion is displayed in real time and the whole session can be plotted. However, it still needs some future improvements like optimising the logging/emotion detection system in order to reduce the lag. Gesture recognition can also be improved by using the gesture builder software and training on multiple persons and from multiple points of view. Finally, merging this system with a facial, voice recognition application and even using the lower body to identify standing emotions would build a perfect detection scheme.