The next generation of augmented and virtual reality: How a new gesture language will control virtual objects, APIs will transfer 3D objects between virtual worlds, and how to (safely) drive with a VR headset on.

Its quite a task to try to predict what the long term effects of AR/VR will be on our society.

Many will look at the technology that exists today — whether that be an HTC Vive, Oculus Rift, or Hololens and assume that this is a rough preview of what AR/VR will look like 15 years from now.

Lets take a different, bolder approach. Lets assume that there will be a device in the next 15 years that can both create a virtual experience and project images into the real world. Lets further propose that this using this device will be a seamless experience, one that does not interfere with our ability to move around the way that a clunky headset might.

When we envision this technology, I think the first question we should explore is: “What do we want to do with it?”

At the beginning, smartphones tried to replicate what was on the laptop but made it smaller. You had the physical qwerty keyboard on the Blackberry and the Motorola Droid (I love the “droid” sound it made when you received an email!). When the iPhone was released with touchscreen typing it was something that truly challenged people. I don’t think everyone was comfortable with the idea, but now we (except for die hard Blackberry users) see this is the standard way that we type on a smartphone. Many of us don’t even think about the fact that we are using this technology. We are not even thinking that we are typing. The processing seems to happen in the back of our heads.

I predict that we are going to see similar trends in the evolution of AR/VR hardware. At the beginning (now), we are going to have applications and hardware that will be very similar to what came before — VR versions of 3D technology for the PC or the Mac. As we move forward and reach a critical mass of VR users, people will demand VR native content. How that content is created will depend on the ability of the developer community to create tools in VR. Hardware will need to evolve from clunky controllers to hand fitting gloves or other haptic devices that would allow us to easily create content in VR. The holy grail of VR interaction will be the ability to manipulate objects and create new objects by just using your hands.

We know this is technically feasible using computer vision. The Hololens recognizes gestures such as putting our thumb and index finger together or making a plume gesture to prompt a start menu.

There is no reason to think that we can’t create a new language — a new sign language — that would translate to actions in VR or AR. Technology that brings the human body as close as possible to the user interface tends to be adopted faster. A gesture based user interface has to be simple and intuitive, but it will take time to learn. You are not born learning how to type. You will need to take the time to learn how to manipulate objects in VR. Developers will have to agree on a common language. For example, not all operating systems will prefer using the plume gesture to bring up a start menu. As the number of operating systems that support VR increases, the challenge of creating common gestures will increase. With Windows Holographic the plume gesture looks nice, but when we have Linux and OSX in VR, those developers might not want you to access the start menu using a plume gesture, in the same way that the big Windows start button doesn’t exist in these operating systems — for good reason.

Taking it a step further, I can envision programming languages based on gestures. You might wonder why you can’t just type on a virtual keyboard? I believe that in order to fully experience immersive VR and have VR native creation, gestures will be essential.

What will this gesture based programming language look like?

Its natural to think about American Sign Language and how that could be a basis for VR gestures, but the idea here is not that a gesture corresponds to a word or a letter. A gesture would correspond to an action such as rotate, drag, transport or move up or down. That doesn’t mean that ASL cant be a good reference. In fact we can re-appropriate ASL gestures and give them new meanings in VR. The analogy with written language would be using the same script but assigning different meaning to the words. Something that you see very often in VR games is the ability to teleport. You obviously cannot physically move through the entire 3D space of a game unless it’s the size of your living room. Teleporting will need to be one of the first gestures that the develop community agrees on. It will need to be something easy to do because it will be used so often. Lets us the ASL gesture for the letter A. As you can see

Sources: http://lifeprint.com/asl101/topics/wallpaper1.htm, http://www.roadtovr.com/hands-on-budget-cuts-inventive-locomotion-is-a-lesson-for-vr-developers/

in the picture above, we can throw a teleport circle into a VR hallway by making a fist. No controller required!

And the same idea works beautifully in AR. I’ve spend hours playing Roboraid on Hololens and after a thousand thumb to index finger gestures, I get really tired of fighting aliens in my living room. Maybe there is a gesture that can translate to “don’t stop my current activity until I switch gesture”. Lets call this a “continue” gesture and assign the ASL letter B to it. You can see in the picture below how much easier it would be to destroy those pesky home aliens with the “continue” gesture.

Sources: http://lifeprint.com/asl101/topics/wallpaper1.htm, https://www.microsoft.com/en-us/hololens/apps/roboraid

Some people will not be totally sold on the gesture based approach — and they will focus more on voice. And I agree that voice will be an important part of interacting with a VR experience. You can see how this has worked with Alexa, Cortana, and Google Home and there is no reason to believe that it wont permeate into VR. But just look at how fun it would be to use the ASL sign for C to rotate Manhattan in Google Earth VR!

Sources: http://lifeprint.com/asl101/topics/wallpaper1.htm, https://uploadvr.com/google-teasing-brand-new-google-earth/

As Natural Language Processing (NLP) improves, we should expect VR applications to give us the option of manipulating objects using your voice. And maybe what will prevail as the ultimate user interface is some kind of combination of voice, gestures, and sensors.

There are already some innovative applications, like the IBM Sandbox on HTC Vive (an awesome app), that allow you to magically bring elephants into existence by just saying the word “elephant”. I hope that the next generation of devices will allow for a combination of all of these user interfaces. Using one of these devices, if you were to look somewhere in VR, that would activate a particular feature of an object — like making it glow. If you used a particular gesture (maybe the C in ASL), that object might rotate. This combination of different user interfaces is going to feel much more like real life (whatever that means!) that our current day headsets and controllers. The key take away here is that VR will provide capabilities for user interfaces that you cannot get outside of VR.

It wont be easy, and at the beginning, when we use VR applications, we will feel that something is missing. If you are used to manipulating something in the real world and you can’t do that in VR, it will be frustrating. We have to understand that this is just a step in the overall direction of having natural interactions with artificial worlds. Once we master these interfaces, it will be difficult to tell what experiences are fully natural and which are enhanced by technology.

The critical element will be the number of frames per second and whether we can match in VR what the eye can do. It might take 10–20 years for that milestone to be reached. But in the same way that graphics have improved in every iPhone, Sega, Nintendo, and Playstation iteration, we will see a similar increase in frames per second for VR (a Moore’s law for VR fidelity). Some applications will not even be possible until this framer rate is reached. For example, if I am driving and I have a streaming 360 camera that is part of my VT headset, the headset’s ability to digest what is being recorded by that camera and display it in VR at a speed that is fast enough for safe driving will be very difficult to achieve. Thus, applications where your vision is occluded and you can simultaneously drive safely are still many years off. But I think eventually we will get there.

This brings me to the idea that AR and VR are on two sides of a occlusion spectrum. The virtual world is a simulation that exists on its own without any real world equivalent — a fully occluded view of a simulation which we call VR, and augmented reality is where you look at the real world and there are some simulated digital objects that are superimposed on that world and interact with it. I think these are just two special effects or “reality filters” that are part of a much larger set of effects and combinations of effects that will vastly increase our range of experience.

Imagine a next (or next next) generation headset with four quadrants. On the left upper quadrant you see unvarnished reality. On the upper right you see a completely virtual world — not related at all to your current real experience. On the bottom left you see the real world with superimposed holograms. And on the bottom right you have some mix of all of these. This illustrates that we will have an endless supply of filters or additions to our reality. Some of these filters will be “replacement” filters that will occlude our vision and create a brand new universe, but that will only be one iteration.

At this early stage, we assume that AR will work better for situations where we need to move about, but that might not be the case. If you don’t have a good sense of navigation, you might want to put on a headset that completely occludes your vision and shows you bright lights where the streets are and very clear indications of where you need to turn without all of the trees and shops to distract you — or the other cars for that matter. In the same way, we might have an AR experience that is completely simulated. We might be looking at the sky and the filter in our headset or glasses will turn clouds into spherical object (a coordinate transformation for math nerds) — this would be a complete distortion of reality, more than just an augmentation of our experience. In our current lexicon this coordinate transformation would still be called “augmented reality” because what you see is tied to images in the real world. Thus, as AR/VR technology evolves, we will focus more on the level of distortion that a headset provides rather than the level of occlusion.

The different filters for perceiving the world around us (AR) or the different worlds we can enter (VR) will only be limited to the creative ability of the developer community. And their role will be to figure out how they can provide a particular experience for the use case that matters to you. Vision is the first thing that comes to mind, but with haptics we can have a bowling ball in a virtual world feel extremely light. That would be just as much of a distortion in the physics of our world as a change in our visual perception. What we choose to be real to us is going to change. We will be moving from worlds that are completely simulated with impossible physics to worlds that are based on our observations but slightly modified with holographic images. We will switch freely from one to the other. The same way that we have apps now on our phones, and these are meant for specific uses — we will have reality filters that will be customizable and change the way we perceive reality. Some of these filters will be tailored for traveling on a plane and others will be made for studying for an exam. In the same way that we have APIs now to connect apps, we will have VR APIs that get information or visualizations from different reality filters without having to superimpose all of those visualizations on one pane. For example, if there is a filter that shows bright yellow lights on the street, which might be helpful for bicycle navigation, you will be able to transfer that particular effect into a different world where you can see holographic manifestations of animals or monsters. This will be a blended filter and the permutations of blended filters will be infinite. We will also need access keys that give us the rights to take a particular object or visualization from a particular world or filter and move it to another. But that is a topic for another article.

Cheers

Eddie