How I Created A Motion Capturing System In Unity3D

Facial Landmark Detection Vector Art (Credits: Me!)

During my freshman year, I met a graphics professor who got me hooked on the idea of live virtual avatars. We were unsatisfied by current ways of controlling player characters in immersive games. Eventually, we decided that a better solution was needed for avatar locomotion and expression. With the help of vision-based technologies such as Microsoft Kinect and facial landmark recognition, we set out to create our own motion capturing system. I am writing this article to share my implementation of this system. This article is inspired by my final research paper.

There is something really powerful about being the protagonist of a story instead of the spectator. In the future, I imagine a multi-player campaign mode in video games that hands over control of the story to the players (a lot like Ender’s Game). This is possible through virtual avatars that mirror their body movements and facial expressions.

My “vision-based” approach to virtual avatars attempts to map visual sensory data onto a character model. I implemented it in a popular game engine called Unity3D. It handles two scenarios: locomotion and expression.

This was definitely the best part of the project. I spent a lot of time browsing Mixamo for character models. Finally, I ended up choosing a character from Mixamo Face-Plus, a popular package from Unity’s Asset Store.

My chosen character model (Credits: My paper, p. 23)

It has a fully-rigged skeleton for locomotion AND facial blendshapes for face animation. So it was perfect for my project. (I recently learned that this package is discontinued in Unity 5 😥.)

First, I had to figure out how to get the avatar moving. Here I took advantage of the pre-defined character rig. A rig consists of joints and bones which can be translated and rotated. Given a stream of images, a neural network should be able to learn to track the player’s joints and bones over time. However, RGB images lack depth information, which is crucial for animating the avatar in 3D. Fortunately, Xbox Kinect v2 does this through its array of cameras.

Xbox Kinect v2 (Credits: Silvio Giancola)

It uses a time-of-flight technique. Essentially, the depth sensor emits light signals and then measures how long it takes for them to return. This information is used to construct a depth map, which is used to segment the player’s body into areas of interest.

That sounds like a lot of low-level coding. Fortunately, all of this is implemented and available to the public via Kinect SDK. Even better, Kinect v2 Examples with MS-SDK and Nuitrack SDK ($25 💩) handles the interfacing between Kinect and Unity. This made my life a lot easier!

An Example of Kinect V2 Skeleton Tracking (Credit: Steve Dent)

By the way, if you are wondering how Kinect tracks a person’s skeleton in 3D, check out Microsoft’s research paper on human pose recognition here.

The ability to express oneself is a key component of any multiplayer game. Most online games have global chats. For my project, facial animations acted as the primary mode of communication.

My inspiration for live facial animations came from a very popular game called VRChat. In this game, players exist inside a social metaverse through their favorite avatar. The avatars can make any expression the player desires. Only problem is that players have to manually create those animations and port them to the game. This puts too much burden on the player. Plus, not all players are good animators, which kind of breaks the illusion of the game…

There needed to be an out-of-the box solution. I decided to make the avatar mirror the player’s facial expressions! Points on a person’s face can be tracked and mapped onto the avatar’s mesh. For this task, I used Deformable Shape Tracking (DEST). It is an open-source library that detects points on the face, also known as landmarks.

DEST in action (Credits: Christoph Heindl)

To incorporate DEST into Unity, I used the Native Plugins feature. I imported DEST code from C++ into Unity via a Dynamic-link library (DLL) and was able to access its functions in Unity’s scripts.

It turned out that I couldn’t just change the vertices of the face mesh as I desired. Facial expressions are really complex and involve many moving parts. Instead of animating individual vertices, I used blendshapes.

According to a paper published in Eurographics 2014:

“A blendshape generates a facial pose as a linear combination of a number of facial expressions.”

Expressions are defined into states which are morphed into animations. Together they are called a blendshape. (Credits: Wikipedia)

Essentially, my avatar needed to know its state to make the corresponding facial expression. I used a blendweight to represent facial states. As the name suggests, a blendweight assigns a weight to a part of the face. Together with other blendweights, my avatar could form just about any expression. The steps below show blendweight calculation for one facial expression. There should be N blendweights to support N number of expressions.

/* Step 1: Kinect detects a face, sends image to Unity */
aspectRatio = 1 / frame.width
/* Step 2:
* Unity sends image to DEST
* DEST returns landmark positions (landmark1Pos, landmark2Pos)
* Compute normalized pixel distance between points
dis = Vector2.Distance(landmark1Pos, landmark2Pos) * aspectRatio
/* Compare min and max to previous frame */
min = Mathf.Min(dis, min)
max = Mathf.Max(dis, max)
prcnt = (dis - min) / (max - min)
/* Compute blendweight (k is some constant) */
blendweight = prcnt * k

Kinect scales its detection resolution by how far away the player is. aspectRatio is multiplied by the distance between two landmarks for scale invariance. This makes dis the normalized pixel distance between two landmarks.

Different facial expressions have different min and max values. In some cases, min will never be 0 because two points of the face never intersect. For example, points located at the corners of the mouth never intersect with one another. But, the distance between them tells us “how much” a person is smiling. In this example, prcnt is closer to 1 when dis between the two points reaches its max and vice-versa. Then, prcnt is interpolated between min and max, and scaled by some k to give the final blendweight.

This shows how a mouth may move over time. The red dots are landmarks used to determine the prcnt value. They tell us how much the mouth is horizontally stretched, ie smiling. (Credits: My paper, p. 21)

I had to stop here due to time constraints. At this point, the system could track a person’s skeleton in 3D and mirror their facial expressions. Originally, I wanted to explore how an avateering system could be created for immersive mutliplayer games. But, I also wanted to show that it could replace current implementations of player expression, like in VRChat. So, I tried to incorporate VR into the project. Unfortunately, the face tracking stopped working. Turns out DEST does not work on faces that are partially occluded by giant black rectangles.

Sigh. (Credits Kaia Bennett)

Also, landmark detection accuracy decreases drastically as the image resolution decreases. In fact, there are scenarios where a vision-based system may not work at all. For example, if we wanted to pick up objects with the avatar’s hands, a depth image will not be enough to track fingers. VR headsets solve this issue via hand controllers. That may be an option until new VR glove technology like this becomes more affordable. But what if the player is turned away from the camera? How can the facial expressions be captured if the entire system relies on one source for its input?

Thankfully, NVIDIA may have come up with an alternate solution. In a recent paper, their researchers found a way to train a neural network to infer facial animation from speech. You can read more about the training process in their paper. But check out the results:

Audio-driven facial animations (Credits: Tero Karras Fl)


This is something I intend on exploring in the future. Since vision-based technology often involves high-end sensors and cameras, like the ones in Kinect, an audio-based system can be cheaper and lighter. That’s not to underplay the value of good vision systems. Some things, like skeleton tracking, will still need high-end sensors and cameras. So, vision-based systems are here to stay. It’s up to us, developers and engineers, to figure out creative ways of supplementing the shortfalls of vision with other types of technology.

Researcher @ Cornell University

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store