“What is that?” — Multimodal Mixed Reality
Can we make interaction in Augmented and Virtual Reality as natural as day to day interaction with the real world around us?
In our everyday lives we interact with each other and our surroundings using speech, gesture, face expression, and a wide range of different input modalities. It’s very natural to point at something and ask “What is that?” and expect people near us to know what we are asking.
The vision of Augmented and Virtual Reality, and Mixed Reality is to enable interaction with virtual content to be as easy as with the real world around us. We should also be able to point at virtual objects and say “What is that?” and have the Mixed Reality system understand us.
That vision is far from reality though. Although Microsoft’s HoloLens supports speech and gesture input, it currently only recognizing limited point, pinch and finger tap gestures. Simarly, although many VR systems support gesture input, it is rare for them to combine this with speech, and almost no systems have eye-gaze input (with the exception of Fove and a couple of others).
There are a number of excellent examples of how natural gestures could be added to AR and VR systems. Microsoft has conducted research for years on how depth sensors can be used for gesture tracking. For example, in 2015 they produced a paper showing how a single depth sensors could be use for gesture capture . They were able to create 3D models of the users hands and tracked them at 30 frames per second, enabling support for rich two handed gesture, moving far beyond simple pointing. The video below shows the system working.
However, one limitation with this technology it that is used the Kinect depth sensor, which is too big to mount on an AR/VR head mounted display (HMD). Other sensors, such as the Leap,Motion sensor or Intel’s RealSense sensor are small enough to attach to a HMD. The video below shows the type of gesture based AR interaction made possible when combing a Leap Motion sensor with an AR see-through HMD. This shows that the technology can be used to create very natural interact with AR content.
However, current gesture systems either work at a large scale over a longer distance, or close to the user’s body. For example, the Kinect depth sensor has a range of 0.8m to 4m, while the Leap Motion works from 25 mm to 600 mm (0.6m). In our recent work we have explored how different sensor technologies can be combined together to create multi-scale gesture interaction . As the video in Figure 3 shows, we combined the Soli radar based sensor with a Leap Motion to combine high speed tracking of fine-scale finger movement with larger hand movement tracking. This enabled more intuitive interaction with AR content shown in a Microsoft HoloLens display.
In addition to gesture tracking, gaze input tracking is also possible for AR or VR displays. A number of manufacturers, such a Pupil Labs or Tobii, make eye tracking systems that can be combined with existing AR or VR displays. This hardware fits into several commercial AR or VR displays and can provide good performance provided the user carefully calibrates the system. The MagicLeap Leap One display and newly announced Hololens 2 are AR displays that have integrated eye-tracking, as does the Fove for VR and the newly announced HTV Vive Pro Eye.
However, research needs to be conducted on how to use eye gaze for input in AR and VR systems. User cannot simply select every object that they look at, because the eyes are continuously scanning the environment. In our work we have explored three different ways to use eye gaze for selection in VR environments . The video in Figure 4 shows these methods in action, showing how natural they can be.
More interesting interactions are possible when different input modalities are combined together to compentation for the limitations of each of them. For example, in  we explored the combination of head pointing and eye gaze for AR selection. Head movements are deliberate and accurate, and provide the current state-of-the-art pointing technique for AR dsiplays. Eye gaze can potentially be faster and more ergonomic, but suffers from low accuracy due to calibration errors and drift of wearable eye-tracking sensors. In the video below we show how it is possible to refine coarse eye gaze and head pointing with fine hand gesture, device gyro or head motion input. Using eye gaze with gesture or head pointing input was almost ten times more accurate than using eye gaze alone.
One of the other important areas for multimodal input is the combination of gesture and speech input. These input modalities are complimentary — gesture is a very natural way to input qualitative information, such as how an object moved, while speech is perfect for inputting quantitative information, such as a precise numerical value. Previous research back to the famous “Put that there.” interface from 1980  has shown that combining speech and gesture in graphical interfaces can significantly improve usability. However, there has been relatively few AR or VR interfaces which do this.
In our own work  we found that using speech plus gesture in an AR interface for furniture arranging enabled users to complete the task about 30% faster that using gesture along, and with higher accuracy. Users oeverwelmingly preferred being able to use both speech and gesture together. In more recent work we developed a system that combined speech and free-hand gesture for AR 3D modelling and scene creation . In a user study with the system people reported that the combined gesture and speech input provided a high level of usability, and very natural interaction.
Today’s AR and VR interfaces support only simple interactions, mostly using handheld controllers or simple gestures or head pointing. However studies like the examples above show that it is possible to have far richer user input. Researchers are on the path to developing truely natural input systems where speech, gesture and gaze can be combined together to provide intuitive input.
Note — This blog post is an extended version of a presentation given on September 20th 2018. The slides from the presentation are available here.
 Sharp, T., Keskin, C., Robertson, D., Taylor, J., Shotton, J., Kim, D., … & Freedman, D. (2015, April). Accurate, robust, and flexible real-time hand tracking. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems (pp. 3633–3642). ACM.
 Ens, B., Quigley, A., Yeo, H. S., Irani, P., Piumsomboon, T., & Billinghurst, M. (2018, April). Counterpoint: Exploring Mixed-Scale Gesture Interaction for AR Applications. In Extended Abstracts of the 2018 CHI Conference on Human Factors in Computing Systems (p. LBW120). ACM.
 Piumsomboon, T., Lee, G., Lindeman, R. W., & Billinghurst, M. (2017, March). Exploring natural eye-gaze-based interaction for immersive virtual reality. In 2017 IEEE Symposium on 3D User Interfaces (3DUI) (pp. 36–39). IEEE.
Kytö, M., Ens, B., Piumsomboon, T., Lee, G. A., & Billinghurst, M. (2018, April). Pinpointing: Precise Head-and Eye-Based Target Selection for Augmented Reality. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (p. 81). ACM.
 Bolt, R. A. (1980). “Put-that-there”: Voice and gesture at the graphics interface (Vol. 14, №3, pp. 262–270). ACM.
 Irawati, S., Green, S., Billinghurst, M., Duenser, A., & Ko, H. (2006, November). An evaluation of an augmented reality multimodal interface using speech and paddle gestures. In International Conference on Artificial Reality and Telexistence(pp. 272–283). Springer, Berlin, Heidelberg.
Piumsomboon, T., Altimira, D., Kim, H., Clark, A., Lee, G., & Billinghurst, M. (2014, September). Grasp-Shell vs gesture-speech: A comparison of direct and indirect natural interaction techniques in augmented reality. In 2014 IEEE International Symposium on Mixed and Augmented Reality (ISMAR) (pp. 73–82). IEEE.