ControllerPose: A New Solution for VR Full-Body Tracking?

Published in

deMISTify

8 min readAug 29, 2022

Written by Charles Yuan. A review of the paper titled “ControllerPose: Inside-Out Body Capture with VR Controller Cameras”.

Introduction

Virtual reality may perhaps be the most effective method of escapism currently available. Personally, I’ve been spending more time in virtual reality (VR) than I’d like to admit. Ever play VRChat? Well, it’s a great place to escape to when your mental health takes a nosedive larger than Meta’s stock did. In fairness to Meta, virtual legs are overrated anyways. I’m kidding, of course, but this is a good segue into the main topic of this article: Full-body tracking!

For those who are unfamiliar with the concept, or with VR technologies in general, there are essentially two types of tracking systems utilized in consumer-grade headsets. The first of which is known as outside-in tracking, in which the headset and other accessories are tracked by an external device. This is the method employed by most traditional VR headsets (e.g. Oculus Rift, HTC Vive, PS VR, Valve Index), albeit with each having slight variations [1]. This is also the type of system that allows for full-body tracking (i.e. Tracking of your legs, hips, etc), as the base stations that are positioned in a room are capable of tracking not only the headset and controllers, but additional accessories attached to the body, as well. It also does so with a high degree of precision and very low latency [1].

A tutorial on setting up Vive trackers for full-body tracking, made by Thrillseeker [2].

The second type is known as inside-out tracking, in which cameras, IMUs, and other sensors integrated into the headset and controllers are able to form a positional tracking system, without the need for external base stations. As you can probably imagine, this allows for more mobility and freedom of movement, subsequently increasing one’s immersion in virtual reality. The problem is, with only a headset and two controllers, VR systems such as the Oculus (Meta) Quest 2 are not capable of full-body tracking apart from simple position estimates, and this really puts a hard limit on the immersion one can feel in virtual reality [3]. Of course, one can simply purchase the base stations and trackers separately, but that would defeat the entire purpose of inside-out tracking.

Diagram explaining how base stations work [4].

Background and Hardware

Fortunately, some researchers from the Future Interfaces Group at Carnegie Mellon University have proposed a solution. Simply put, why not integrate cameras into the VR controllers? VR headsets already leverage them, meanwhile the controllers themselves already possess independent batteries, computational capabilities, and wireless communication [3]. In a small motion capture study they ran, the authors found that users situate their hands in front of the body roughly 68.3% of the time when playing popular titles such as Beat Saber, Superhot, and Pistol Whip [3]. Subsequently, a new pipeline was created, combining the views of multiple cameras on the controllers together, performing 3D body pose estimation, and utilizing the subsequent data to rig a human model for end user applications [3]. In other words, inside-out full-body tracking!

Beat Saber, Superhot, and Pistol Whip. Three very popular titles on the Quest 2 store.

The ControllerPose Pipeline

To achieve this, ControllerPose uses two wireless cameras attached on each controller, one on the upper ring and another on the bottom of the grip. With a resolution of 640x480 pixels, the four raw camera feeds (below, B) are first filtered for unusable frames (C), corrected for fisheye distortion using OpenCV’s fisheye camera model API (D), and then cylindrically projected to preserve relative proportions of the user’s body (E) [3]. For those who are unfamiliar, fisheye distortion is used in wide-angle lenses to achieve panoramic views at the cost of strong visual distortion, and is hence corrected for. Cylindrical projection is essentially the process of unrolling a cylindrical surface into a flat plane, similar to how world maps are made. Finally, the two images from each controller are stitched together into a vertical panorama (F), providing 185° vertical and 150° horizontal field of views [3].

The compositing and unwarping pipeline, which prepares the four camera inputs for the neural network [3].

Using this output, the 3D pose estimation pipeline extracts the 2D pose estimates of the user with v1.7.0 of OpenPose [5]. The output is comprised of 17 key points: Head, shoulders (x2), elbows (x2), hands (x2), torso, mid hip, pelvis (x2), knees (x2), ankles (x2), and feet (x2) [3].

The output of the 3D pose estimation pipeline, comprised of 17 key points [6].

Utilizing these key points from the left and right streams, two skeletons are produced (top two images, G). Then, through a very complicated process involving calculating 136 direction vectors from the 17 joints to use as input for a multi-input neural network, along with even more post-processing, a final pose is created in 3D Cartesian space.

Overview of the entire ControllerPose pipeline [3].

Results and Performance

With this complex pipeline, it is not surprising that the derived framerate would be 7.2 FPS [3]. This is derived from a mean latency of ~297 milliseconds, which can be broken down into the following components:

75 ms to receive video frames as input
63 ms to perform image unwarping and compositing
128 ms to perform body pose estimation
8 ms for neural network overhead
17 ms for Unity to render graphics and run the IK solver

Though this performance is lackluster currently, especially in comparison to the Quest 2’s 60 FPS, the authors expect that some processes would be eliminated or deeply optimized when adapted for commercial applications [3]. Nevertheless, this pipeline is a proof-of-concept that a system like this works, and its current performance should be considered a lower bound on framerate [3].

As for the precision of the system, it still cannot compare with the outside-in tracking performed by the HTC Vive Pro, its 2.0 Lighthouses, and its 2.0 trackers. In another paper by Bauer et al., it was found that outside-in tracking was capable of millimeter-level tracking, while the ControllerPose system resulted in a significantly higher mean 3D Euclidean joint error of 6.98 centimeters [3][7]. Excluding the hip and torso points, which are directly attached to the hip and have low degrees of error, it increases to 8.59 cm [3].

Histograms of the horizontal deviations (a) orthogonal (b) in direction to the lighthouse. Measured 1.8 meters from a single lighthouse [7].

Limitations and Future Improvements

Needless to say, inside-out full-body tracking is nowhere near the precision or robustness of outside-in tracking. This is primarily due to the fact that inside-out tracking relies primarily upon computer vision approaches. Meaning that, apart from the lack of millimeter-level tracking, it will also suffer from issues related to the lower resolution of integrated cameras, lack of good lighting, and occlusion from baggy clothing or controller positioning [3].

However, according to the authors, the pipeline was capable of detecting a multitude of different poses in the composited camera. What it struggled with was the actual 3D pose estimation and inverse kinematic posing, both of which can be improved upon with further research [3]. As for the latency issues, the Quest 2’s hand tracking feature is already capable of 21-point tracking on each hand, for a total of 42-point tracking. The Quest 2’s Qualcomm’s Snapdragon XR2 chipset is also capable of amazing hardware-accelerated AI capabilities [3]. All of this is to say that, with more research into machine learning and computer vision approaches, it may very well be possible to achieve inside-out full-body tracking, as the hardware infrastructure already exists to support it.

ControllerPose tested on unusual poses. Top row: Failed. Middle row: Partial success. Bottom row: Works well [3].

Conclusion

So what’s the takeaway from all of this? For one, full-body movements are perfectly capable of being tracked in virtual reality, so why does the “Metaverse” look like this? The obvious explanation is that, for standalone Quest users, full-body tracking is not available unless they utilize external IMU-based systems such as the HaritoraX or SlimeVR. However, with research being done into systems such as ControllerPose, there is still hope that Meta will eventually release a standalone VR headset capable of full-body tracking. If the goal is increased immersion, then moving your legs in-game is an eventual necessity. Hopefully one day we’ll all be able to dance, drive, and kickbox in virtual reality with the power of inside-out tracking.

Full-Body Tracking Showcase!

11pt Full Body Tracking, demo by ShanamoN_VR
Runner’s Alibi, a short movie by ACMEJack
Driving in VR, Varneon’s Udon Vehicles
Blade and Sorcery Full-Body, fighting demo by Kentypoo
FeetSaber, omotea
Dancing in VR, KoizumiTV

References

Langley, H. (2017, May 3). Inside-out v Outside-in: How VR tracking works, and how it’s going to change. Wareable. Retrieved from: https://www.wareable.com/vr/inside-out-vs-outside-in-vr-tracking-343
Thrillseeker. (2019, Sep 22). FULL BODY Tracking in VRChat with Vive Trackers — Tutorial. Youtube. Retrieved from: https://www.youtube.com/watch?v=yE5NGI3RLUY
Ahuja, K., Shen, V., Fang, C. M., Riopelle, N., Kong, A., Harrison, C. (2022, Apr 28). OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields. ACM Digital Library. Retrieved from: https://dl.acm.org/doi/fullHtml/10.1145/3491102.3502105
nesumtoj. (2016, Jun 26). How to Set Up Your Vive Base Stations for Cabled Sync. tom’s guide. Retrieved from: https://forums.tomsguide.com/faq/how-to-set-up-your-vive-base-stations-for-cabled-sync.111306/
Cao, Z., Hidalgo, G., Simon, T., Wei, S., Sheikh, Y. (2019, May 30). OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields. arXiv.org. Retrieved from: https://arxiv.org/abs/1812.08008
Future Interfaces Group. (2022, Apr 27). ControllerPose: Inside-Out Body Capture with VR Controller Cameras. Youtube. Retrieved from: https://www.youtube.com/watch?v=5p_glarZOdU
Bauer, P., Lienhart, W., Jost, S. (2021, Feb 25). Accuracy Investigation of the Pose Determination of a VR System. MDPI. Retrieved from: https://www.mdpi.com/1424-8220/21/5/1622