The Essential Guide to Mobile AR Gestures

Design interactions based on the user’s touch, look and proximity

Published in

Inborn Experience (UX in AR/VR)

10 min readMar 22, 2018

ARKit and ARCore have made it easier than ever to build mobile AR experiences, but designing them is harder than traditional Apps. It requires a different way of thinking about gestures and user interactions.

This guide explores the challenges and solutions of adapting touch gestures to mobile AR. Whilst also suggesting some new categories of gesture that can be used in complement to, or as alternatives to touch. Specifically how a user might control an App with their look and proximity.

Touch

Tapping, dragging, swiping; most users are comfortable with an array of touch gestures. However, most traditional Apps have focussed on controlling an object in one or two dimensions e.g. scrolling up/down (Y-axis), swiping left/right (X-axis), dragging around on screen (X&Y).

In AR Apps, we generally want our users to manipulate objects in three dimensions, but with only two dimensions on the screen, conveying intent is difficult.

To illustrate this point, let’s say our user wants to move a virtual ball projected on their table, they drag up 👆 on the screen. Should the ball move vertically up into the sky or back into the distance?

The problem is that the simple 2D gesture doesn’t provide enough information to make a precise manipulation in 3D. There are multiple solutions to this problem, and we’ll explore the merits of some below.

Option 1: Reduce the Dimensions

The simplest option is to reduce the number of dimensions being manipulated so that it can be represented by a lower dimensional gesture. This sounds more complicated than it is, we simply define a 2D plane/surface along which the object can move, and effectively one direction in which it cannot.

Take a furniture App like Ikea Place for example. In the App, users place furniture on the floor, and can move it across the surface of the floor but cannot change its distance from the floor. They have reduced the manipulation down to a 2D plane parallel to the floor.

With movement restricted to a plane, taps and other gestures can now be projected through the camera onto points in 3D space.

Casting a 2D screen point to a 2D plane in 3D space

Key Takeaway: You should consider if your object manipulation actually needs to be in 3D. Most Apps can likely simplify it to two, as in the real-world, we often expect objects to be grounded on flat surfaces by gravity.

Option 2: Multi-touch Gestures

Whilst multi-touch gestures are less familiar to users (the exception being pinch), they do provide a means by which the user can give more information about their intent and thus perform more complex manipulations.

For example, performing a drag with one finger may move an object parallel to the floor, whilst the same gesture with two fingers could be used to move it along the Up-Axis.

Alternatively, multi-touch gestures can be used to control different types of manipulation. As well as moving a 3D object, the user may want to rotate it. Again, this could be achieved by using a single finger to control movement, but two fingers to control rotation or scaling.

Key Takeaway: Multi-touch gestures can be used to convey additional user intent but they are not common. If you include them in your App, good onboarding and instructions will be essential.

Option 3: Use a Virtual Gizmo

If complex 3D manipulations are required but multi-touch gestures are too complicated, then additional onscreen controls can be added. These controls can allow the user to toggle the manipulation they want to apply. This approach is used in 3D modeling packages and games engines, where interactive UI components called Transformation Gizmos are attached to the targeted object.

A Transformation Gizmo is a collection of handles that a user can drag on to perform transformations in a specific dimension, and lock transformation along the other dimensions. For example, in the far left image above pulling up/down on the green arrow will only allow the object to move along the UP-Axis.

Different types of gizmos can be used for different types of manipulations. Typically, there will be ones for movement, rotation, and scale, which the user can toggle between.

Key Takeaway: Gizmos are the standard approach when precision is vital. If your App requires intricate 3D manipulation this is probably the right choice.

Option 4: Use a Real Gizmo

Head-mounted Displays for VR/AR often come with a controller that tracks 3D movement and rotation and so can be used to perform 3D manipulations. Whilst iPhones, Pixels and Galaxys don’t have external controllers, the devices themselves have position and rotation information and can be used in a similar way.

Let’s say our user wants to manipulate a 3D object, they could tap to select it, which would bind its position and rotation relative to that of the device. When the user then moves or rotates the device, the object would move and rotate with it. Once happy, the user can tap again to deselect and release the object to its new fixed location.

Key Takeaway: This is an interesting technique but complex manipulations would probably be quite hard to achieve without falling over, so best for more playful Apps.

Look

Touch gestures will probably still be the main interactions in your Apps, but we can also create new ways of detecting user intent, such as where they are looking.

Actual eye-tracking is not (yet) available on our devices, but we can use the device position and rotation to approximate where the user is looking, much like how touching a position on screen approximates touching an object in real space.

Two categories we can break this into are Visual Field and Focus. The former being about ‘what the user sees’ and the latter being about ‘what the user pays attention to’.

Visual Field

In human vision, the central part of our view is most important, with objects on the periphery gaining only partial recognition or attention. This serves as a useful analogy for us, even though our device is fully contained within our central view.

By dividing our screen into three columns we can categorise objects that are in the central view as more important than those that are on the periphery, and use this as an indication of interest from the user.

As the user moves around, objects will trigger six possible events, they will:

enter the screen
exit the screen
enter a peripheral column
exit a peripheral column
enter the central column
exit the central column

We can simplify these events to create two useful gestures: Glimpse and Look Toward — we name them in a way that implies user intent. We can say that the user:

Glimpsed an object (when it enters the peripheral view from offscreen)
Looked Toward an object (when it enters the central view from either peripheral view or from offscreen)

Much like a press when dealing with touch gestures, we can also add opposing release states to these two new gestures, namely: Look Away and Cease Look. We can say that the user:

Looked Away from the object (when it exits the central view to the peripheral)
Ceased Looking at the object (when it goes offscreen from the central or peripheral view)

glimpse > look towards > look away > look toward > cease look (Faces from Freepik)

Key Takeaway: Whilst these new gestures are unlikely to be the primary gesture in your Apps, they do give us an excellent way to create delightful experiences that are a glimpse into the future.

Focus

Whilst visual field gives a nice hint at user interest, it can be taken a step further by adding the dimension of time. Specifically, if an object is at the center of the screen, we make the assumption that the user is focussing on it. Then we quantify the amount of focus by how long they keep it there.

Like visual field, as the user moves around, objects will trigger events. In this case, there are only three we need care about. Objects will:

enter focus
be in focus
exit focus

By counting the amount of time an object spends in focus, we can create two interesting gestures: Focus and Stare. We can say that the user:

Focused on the object (when it is held in focus for 1 second)
Stared at the object (when it is held in focus for 3 seconds)

The actual amount of seconds requires some experimentation but we can think of these as equivalent to a tap and a press in touch. Also, note that unlike visual field, only one object can be in focus. This means the user can look toward multiple items but only stare at one.

Again we can add a state to indicate when a focus or stare is terminated, namely: Ceased Focus. We can say that the user:

Ceased Focussing on the object (when it exits focus)

focus > stare > cease focus (Faces from Freepik)

Key Takeaway: Unlike visual field gestures, Focus gestures aim to isolate a single object, and as such can work as a nice complement. They can even be used as a form of soft selection (or preview) alongside touch gestures.

Proximity

With the user’s device position and rotation we are able to approximate what they are looking at, we can also use that same information to gauge their distance from virtual objects. We can use this distance to then categorise objects into different levels of interest from the user.

There are a number of possible distances we could use to categorise objects but the three most important for AR are probably:

social range (within roughly 3m)
arms reach (Peripersonal space — within roughly 1m)
intimate range (within roughly 30cm)

Shorter distances do not make sense whilst holding a device, and any other distances are too specific to be fixed and should be dealt with on an App by App basis.

As the user moves around the scene, the objects will:

enter social range
enter arms reach
enter intimate range
exit intimate range
exit arms reach
exit social range

Renaming these to be more colloquial, we can create six gestures: Approach, Reach, Embrace, Leave Embrace, Leave Reach, and Retreat. We can say that the user:

Approaches an object (when it enters social range)
Reaches an object (when it is within arms reach)
Embraces an object (when it is in intimate range)
Leaves Embrace of an object (when it exits intimate range)
Leaves Reach of an object (when it exits arms reach range)
Retreats from an object (when it exits social range)

Note, that many objects may be approached or within reach, whereas most likely only a single object at a time will be embraced i.e. embracing an object can be used as a form of explicit selection.

in approach > reach > embrace > leave embrace > leave reach (Faces from Freepik)

Key Takeaway: Proximity gestures provide a way of encouraging the user to move around a space. Reach is probably the most immediately useful, as in the real world we tend to interact with objects within ‘arms reach’, thus we can use this as a means of enabling or disabling interactions.

Conclusion

There are countless gestures that can be utilised in mobile AR. Over time these will standardise, as the number of Apps increase and certain gestures prove to be more natural. The gestures in this article thus provide a starting point for exploration. In summary, they are:

Glimpse — object enters the user’s peripheral view from offscreen
Look Toward — object enters user’s central view
Look Away — object exits the central view to the peripheral
Cease Look — object exits the screen
Focus — object stays at screen center for 1 second
Stare — object stays at screen center for 3 seconds
Cease Focus — object moves away from screen center after being in focus
Approach — object enters social distance (3m)
Reach — object enters arm’s reach distance (1m)
Embrace — object enters intimate distance (30cm)
Leaves Embrace — object exits intimate distance (30cm)
Leave Reach — object exits arm’s reach distance (1m)
Retreats— object exits social distance (3m)

Jeremiah is the founder of wiARframe — the world’s first dedicated AR prototyping tool. The gestures demonstrated is this article were prototyped in the upcoming release of wiARframe. You can read more about Effortless AR Prototyping and also signup for further updates and early access.