Using Machine Learning To Analyse Body Language And Facial Expressions

Frederik Calsius
jstack.eu
Published in
6 min readAug 23, 2019

During these last weeks I’ve been working on a system that can help analyze people their emotional state, by looking at their body postures, facial expressions and their speech signals. In a series of 2 posts, I’ll go over the techniques and toolkits that are used to create a system like this. In the first post of the series, we’ll go over the Computer Vision part. This includes recognition of body postures and facial expressions.

Disclaimer: in no way am I claiming that this system is flawless. It is purely a proof of concept that explores the current capabilities of machine learning algorithms that are, at the moment, considered “standards” in the field.

Project Introduction

The initial idea for this project was to create a system that could assist people to prepare for job interviews. Since body language is considered the biggest factor in our communication, it is of interest to get a system working that can analyse your posture and give some feedback on this over a given period of time (eg. during the entire interview). If you are interested in reading up on body language and get some tips on how to improve yours, I’d suggest this article.

Next to the body posture that a person takes on, the expression on ones face also tells a lot about how a person is feeling. By having a system that can recognize facial expressions, as well as body postures, we aim to create a product that recognize more specific emotions.

Posture Recognition: Considered Approaches

Neural Networks

To recognize body postures, different approaches were considered. A classifier, based on a convolutional neural network (CNN) seemed to be a possible solution. CNNs have proven very high accuracy on recognizing objects from images (since we’re working with video, we can consider each frame in the video as an image). The idea was that different body postures would be seen as “objects” for the system to recognize and distinguish. However, there was a catch… Training a CNN requires huge loads of data. A good dataset that fit the needs for this project couldn’t be found after some intense Google sessions, so we considered to create our own dataset where different body postures were “acted” by different people.

From some of my previous works I have found out that acted datasets are not the way to go. Generally, datasets that are acted tend to not capture the subtle, yet important, features that would be contained by real-life data. This meant we put the idea of using CNNs in the freezer 🥶❄️.

Rule-Based System

One of the methods that did not need a big dataset, and seemed as a reasonable candidate was a rule-based system. Side note: Rule based systems used to be a popular method in the field of AI, but has been overpowered by the machine-learning craze that came to life in these last years.

Is a rule-based system really AI?

Posture Recognition: Implemented Approach

In order to implement a rule-based system that can recognize body postures, the PoseNet toolkit is used. This toolkit is based on the TensorFlow-Lite framework that allows for real-time processing on lightweight devices (eg. smartphones). By using lightweight devices, the system can be used in pretty much any environment. It does however bring some flaws with it, which will be discussed later. The rules in our system are based on the position of the different body joints of the human subject that appears in the camera.

PoseNet example, taken from the TensorFlow-Lite repository (source: Github)

As can be seen from the image above, PoseNet is able to extract a total of 17 different joints. In turn, the location of each joint (x, and y-coordinates) are stored in memory for each new frame. The way we do that is shown by the following block of code:

Example on how to get the x and y values for the recognized joints. This loop is repeated for each frame (and for all the 17 joints).

Initially there were at least 4 different body postures that we wanted to recognize with this project. Namely; open, closed, relaxed and in-control. For each posture, we inspected the way the joints were positioned and if there was any coherence between them. We then casted our findings in the form of a rules that could be programmed into the system.

An example of one of the rules that we came up with, to detect if someone is closing him-/herself off from the environment.

During the entire session that the camera/webcam is running, we count the amount of time that the human subject is assuming a certain posture. By doing this, we are able to give the user insight on the impression they are transfusing to their surroundings.

Facial Expressions: Implemented Approach

Since the body posture recognition methods is using tf-js, ideally the facial expression system would also use tf-js. This to create a nice coherent system.

Luckily for us, there is the face-api. This is an api made for tf-js, that allows for face detection, face recognition, face landmark detection and even an out-of-the-box feature for classifying facial expressions in real-time.

The model that face-api provides for recognizing facial expressions uses depth-wise separable convolutions and densely connected blocks. Even though separable convolutions have their own limitations, they can be very powerful. One of the major advantages is that they allow for much faster calculations. To read-up on what (depth-wise) separable convolutions are, I’d suggest this article.

Example of the Face-API

Current Issues

One of the flaws that occur by using lightweight devices is that depth cannot be measured. A depth camera such as the Kinect can distinguish between a person having their arms in front of their body, or holding their arms behind their body. Regular cameras cannot do this, although there is a machine learning solution for this (here).

According to different sources, this does make a difference when we are strictly looking at body language. The first position, arms crossed in front of the body means that a person is closing himself/herself off from the environment. On the contrary, when holding the arms behind the back a person tends to take up more space, have their shoulders straight and their chest forward. This is generally seen as a dominant posture.

← Left: Closed pose → Right: Power pose

Recap Of PoseNet And It’s Implementations:

Since body language is considered the biggest human communication element, it is in our best interest to be able to analyse and classify different body postures.

To capture and distinguish the different body postures that the human subjects take on, a rule-based system is implemented.

PoseNet is the technology used to setup the rule-based system, whilst also allowing to run the system locally on lightweight devices.

Currently 4 different body postures are recognized. The postures implemented in the system tend to reflect emotional states that vary intensely.

Face-API allows us to recognize a total of 7 different facial expressions.

By not having a depth-camera in the system, some flaws exist. However, since this system is a proof-of-concept, we currently accept these flaws.

Furthermore, PoseNet can be used in many other ways. It can be used to create interactive games, to create a ‘gesture recognition’ system, to teach people a certain dance (where their movement is compared with a pre-recorded video), and many more possibilities.

The combination of the Face-API and PoseNet allows us to know the aggregation of facial expressions and body postures of the human subject. By looking at the combination of these two features, we can get a way more accurate view of an emotion.

--

--