Why gesture-based interfaces haven’t lived up to the hype

Minority Report — 2002

A recent Twitter discussion on the hidden costs of touchscreens inevitably led to the question of gesture-based user interfaces. Or, as the topic often comes up at design conferences, “You know, like in Minority Report.” Spielberg’s movie has been a canonical reference point for designers since its debut in 2002. In March 2018, Magic Leap, the company that continually promises to build a way to “interact with digital devices in a completely visually cinematic way” closed an additional 500 million in funding without a viable product. What works in film is rarely feasible, useful, or necessary for reality.

In 2008, I was invited to visit MIT’s Media Lab, and I begged one of the Media Lab researchers to let me try out Oblong’s g-speak:

If the interface looks similar to the one in Minority Report — you’re right. John Underkoffler, one of Oblong’s co-founders, was the science advisor on the film, and he evolved work done at the lab since the 90s into the g-speak project.

From the moment I put on the motion tracking gloves and learned a few basic gestures, I loved the g-speak. For the first time, I felt like I had real control over digital media. And with so much space, I felt like I was finally freed from the confines of a tiny screen. Finally, the gesture control interface looked dramatic, and I felt incredibly cool while using it.

But there were some problems. To get accurate motion tracking, the ceiling of the lab was filled with at least a dozen cameras. I also had to wear black gloves with white dots. Imagine if my hands were smaller or larger. How would a child use the system? Would different gloves need to be made for them?

The second issue was the learning curve for interactions. Many people can click a mouse, but not everyone has the fine motor control to learn a variety of precise gestures and perform them with accuracy. In addition, the system must process gestures while filtering out false positives. This takes up more computing resources than a simple button or multi-touch system.

Finally, I realized that my arms were getting tired. A lot of my movements involved gesturing with my hands above my heart. Less blood was pumping into my wrists and hands; I was putting in more effort than necessary. This kind of motion is great for gaming (for instance, I loved the Nintendo Wii), but for a professional setting, I couldn’t see myself using it for more than an hour a day.

We often inherit aesthetic ideas about technology from cinema. We see something in a film and we want to interact with it in the same way, but we don’t realize how difficult that might be for daily use. A Xerox PARC researcher and father of ubiquitous computing put it nicely:

“A good tool is an invisible tool. By invisible, we mean that the tool does not intrude on your consciousness; you focus on the task, not the tool.” — Mark Weiser , 1993

Business machines look boring for a reason. Any extra steps to complete an objective quickly become tedious. This is why Google search is successful and some of its flashier, more visual search engine competitors were not. They required too much interaction before getting to the goal.

Left: The Cooliris search engine (acquired by Yahoo in 2014) featured visual search results that required significant cognitive and network resources. In contrast, Google’s search service focuses on speed and repeated use, providing visual results based on contextual information.

Just like a good film or book slowly engages us by revealing bits of itself over time, a good technology works with our minds to bring us closer to it. A book is a good interface because it amplifies our brain’s imagination. When we add our imagination to a book, we become one with it temporarily. We care about (and know more about) characters in a book more than we even know our next door neighbors. The Tamagotchi was successful because it required human interaction to survive. The act of caring for it created a bond with the technology.

Does a gestural control interface invite imagination? No. It demands precision. Those automatic hand dryers at the airport work against our human imprecision, instead of with it. When poorly implemented, gesture control interfaces can embarrass us. In some cases, they can even ignore us.

For serious interfaces — those that are repeatedly used by a knowledgeable professional and/or in potentially hazardous situations, designers should carefully consider whether gesture recognition improves accessibility or presents a risk.

A hands-free air dryer at a rest stop in Minnesota. Photo CC by Elizabeth Aja.

Accuracy issues: Unlike pressing a physical button, gesture control is never 100% accurate. Analog interface controls, by contrast, are 1 to 1 and easier to get right — complex and counter-intuitive layouts can still be memorized and mastered. (Ask any dedicated Xbox/Playstation/Nintendo Switch gamer about this!) Unlike a physical interface, gesture control provides no tactile feedback.

Computational issues: Gestural user experiences require more processing power than necessary. It must process false negatives, like the random, background motions of kids, balloons, weather systems and other ambient “gesture-like” information.

Reduced inclusivity: Gesture control interfaces can be less accessible. With a controller or set of physical buttons, people can use whatever limb or set of button presses to set the information in place. Imagine someone with a smaller hand than the information the system was originally trained on, or someone with a skin color that the machine wasn’t properly trained on.

Motion requirements: Gesture control interfaces are great for short periods of time, but they quickly fail under repeated use and long timelines. This makes them enjoyable when they’re used for family fun and games in fixed environments like living rooms, but they’re unnecessarily complex for more professional solutions. Tom Cruise would have been much better off fighting pre-crime with a control that looks more like a laser pointer that he could hold at the waist, but it wouldn’t have looked as cool.

But wait, what if we had a solution all along? What if we had something that worked with human affordances, held up to stress, and worked for a wide variety of people without the need for additional computing resources?

Yes. That ubiquitous affordance at the bottom of a stainless steel trash can — the foot pedal.

What happened to the foot pedal?

The foot pedal was the original hands-free interface. It provides an antibacterial way of interacting with the environment, without the necessity of light, sensors, or machine learning.

The foot pedal does not discriminate. It can be pressed with a cane as easily as it can a foot. Children can stomp on it, and people can take their anger out on it.

The pedal came right out of the industrial revolution. It added an extra affordance to working with machines. It was a mainstay of sewing machines, and then in vehicles, and we grew used to it. Even computer mouse inventor Doug Englebart and his team experimented with foot and knee-operated devices! We see foot pedals as indispensable parts of musical performances, for guitar effects and looper pedals, and on pianos and drums! Could you imagine trying to play these instruments with gestures alone?

As Andrew Hinton added on Twitter, “Physical affordances are far superior to virtual / simulated ones. It’s a huge misunderstanding in our industry”.

We need simpler solutions, not more complex ones, and sometimes, we can learn from the past. I like to think of this quote from Virginia Wolf:

“How readily our thoughts swarm upon a new object, lifting it a little way, as ants carry a blade of straw so feverishly, and then leave it.”

So the next time you’re in a bathroom trying to get the “smart” sink or air dryer to understand your flailing limbs, consider how easy it would be to just stomp on a foot pedal.

Thanks to @Loflyt for posing the gestural interface question that inspired this post!