Hidden Layers of Technology for an Expressive Avatar in VR

Real-time facial motion capture process through BinaryVR Dev Kit V1

7 min readNov 9, 2017

Inside of virtual reality, the possibilities are endless. Most of what we imagine already has taken place in VR including life-like games, education, socializing and even sales.And at the center, there is always a virtual avatar. Trying to communicate with others and making connections within virtual reality would be a difficult task without an avatar. An avatar represents your existence in VR as your body represents your presence in reality.

BinaryVR aims to vitalize the avatar system through an expressive avatar. We want users to feel that they are alive in VR and socialize with others as they do in reality. Then, what is an expressive avatar and how would BinaryVR enable it?

What is an expressive avatar?

**The movie ‘Avatar’ 2009** Source: http://www.avatarmovie.com/images.html#15

The first one that comes up with ‘an avatar’ is the movie ‘Avatar.’ The ultimate goal of an expressive VR avatar is similar to the ‘Avatars’ in the movie. It should be another body where you can show your facial emotions and body gestures. It should be your alternative identity. The main way this immersion is achieved is through the use of facial expressions. Facial expressions allow users to deliver the message with subtle nuances.

To achieve real-like avatars, the movie production team captures facial expressions of actors/actresses and maps them onto their CG characters afterward. Unfortunately, it not only requires a considerable budget, but most importantly, an extensive post-production process is needed, disabling real-time communication. That is why BinaryVR develops real-time facial motion capture technology. Our SDK is optimized for general users — No heavy computing power or overpriced equipment is needed. We aim to commercialize our facial motion capture technology to give an opportunity for everyone to enjoy social engagements with their alternative identities! It is not only restricted to VR; for those who wanted to create your own Animoji like Apple, BinaryVR can support building it in mobile with a depth camera.

Animoji Karaoke!

So how does BinaryVR SDK work?

In this article, we will dig out the hidden layers of technology underneath the BinaryVR SDK — how one’s own facial expression transfers to an avatar’s face.

**When you smile, your avatar smiles in real-time.**

(From left) Holotech Studios’ two avatars from the FACE project, BinaryVR avatar, High Fidelity avatar

BinaryVR also develops AR facial tracking technology with the same core technology in mobile. If you are interested, check out BinaryFace SDK!

Behind the Scene: Hidden Layers of Technology

How BinaryVR Dev Kit V1 enables an expressive avatar

**The pipeline of real-time motion capture**

Personalized Expression Model Building Process

Before the expression tracking process, BinaryVR’s SDK calibrates according to a user’s face and builds a personalized facial expression model in 3D. The SDK scans one’s initial neutral face and constructs a personalized expression model upon the personalized neutral face.

**Personalized neutral model based on a user’s face**

Since everyone has different facial characteristics, a user’s facial expression varies. By building a personalized expression model, the algorithm simulates which facial expressions a user can make and achieves high fidelity expression tracking quality. Once the user calibrates the face and creates a personalized model, the SDK utilizes the model until a new user comes in. You can understand this step as a learning stage for the SDK of each user’s own facial expressions!

Real-time Data Refinement Process

The next process is refining raw scan data which will later be used as an input for the expression tracking. BinaryVR Dev Kit includes a depth camera that captures those IR 2D images and 3D depth maps. We call those IR 2D images and 3D depth maps as raw scan data.

First, from the IR 2D image data, the SDK utilizes its computer vision algorithm and localizes 2D landmarks on a user’s mouth. In layman’s terms, tracking 2D landmarks captures how your mouth moves.

Next, the SDK applies bilateral smoothing to the depth map to obtain smoother data. The refined depth data provides 3D distance from the camera to the facial skin — such as lips, cheeks, and chin — by matching the depth information on each pixel of the IR image.

Combining the depth information and 2D landmark information, the SDK reads off 3D motions and volumetric skin deformations such as kiss, puff, and jaw movement. For example, when the 2D landmarks indicate ‘lip pucker’ and the depth information indicates a short distance from the camera to the lips, the expression would be ‘kiss.’ On the other hand, although the 2D landmarks indicate ‘lip pucker,’ if the depth information indicates a short distance from the camera to the cheeks, then the expression would be predicted as ‘puff.’ The next process — expression tracking — utilizes this combination data of landmarks and depth as an input like this. Now, we are ready for the expression tracking process!

Expression Tracking Process

From the previous steps, we obtained the personalized expression model and pre-processed data. The goal of the expression tracking process is measuring a set of expression types values frame by frame.

BinaryVR’s expression tracking algorithm includes 22 expression types regarding the lower part of the face. We will soon develop up to 60 facial expression types as our upper facial tracking algorithm is perfected. Each distinct expression type is measured as a continuous value between 0 and 1 for a natural movement of the face. A person’s facial expression is determined as a combination of the values in 22 expression types. For example, when a user opens the mouth and smiles, the combination of ‘mouth open’ value and ‘smile’ value will be tracked as one’s facial expression.

Then how would the algorithm determine a set of expression parameters? The SDK tries to find the most accurate combination of values will allow the personalized model to have user’s real expression matching the pre-processed data. Figuratively speaking, it is quite similar with solving an equation to find X.

f(X;W) = Y
X: a set of expression parameters, W: personalized model
Y: pre-processed data
f: relationship — minimize the error in depth and 2d image space

One thing crucial in implementing the expression tracking algorithm is to set up the total number of expression types optimally. When there are too many, the result will be noisy and unstable due to values at variance. This eventually causes facial expressions to be extremely trembled, not to mention, significant delays, which is an obstacle to a real-time process. On the opposite end of the spectrum, low variety of types can cause critical limitations in grasping subtle facial expressions. In conclusion, we should control and optimize the sorts of expression types.

Avatar Remapping Process

The last step is vitalizing a virtual avatar to smile. BinaryVR SDK applies clarified facial expression values into a rigged avatar and the avatar shows the same facial expression as the users had.

The essence of being a virtual avatar is becoming someone else. People want to make an avatar’s own unique facial expression. So, the avatar should be well rigged in its own degree to allow one to be a genuine character. Here’s an extreme example of varying facial expressions of puffing one’s face.
(From the collaboration project of 360channel, Holotech Studios, FOVE and BinaryVR)

Although the expression tracking value is the same, two characters have different kinds of facial expressions. So how the avatar animation is rigged with each expression types drives how the avatar experience would be like.

Why is real-time facial motion capture technology important?

**The evolution of medium: Text — Photo — Video — VR/AR** source: https://www.youtube.com/watch?v=yJBoXQNVX3Q

Mark Zuckerberg claims that people will continuously seek for more immersive mediums. At the same time, ‘live contents’ trend is hyping up as users prefer a sense of realism at the very moment.

source: https://www.dailyrindblog.com/live-streaming-youtube-mobile/ https://play.google.com/store/apps/details?id=com.snapchat.android&hl=en

BinaryVR believes that real-time social interaction in VR is at the intersection of immersion and live trends. Live VR contents such as the Jim Jam Show from High Fidelity will be continuously created and users will participate in the contents with their own expressive avatars.

Jim Jam Show from High Fidelity

In the future of social interaction in VR, BinaryVR’s real-time facial motion capture is an inevitable step. For the days to come, BinaryVR has been undertaking upper face tracking development to enable full facial motion capture. We hope our technology will support VR contents creators to provide high quality real-time social experiences inside of VR!

Follow us and check out our blog to be updated on BinaryVR’s vision and technology! BinaryVR brings humanity into the virtual world from the very front line of cutting-edge technology.