Talking Hands

Vilde Reichelt
Bakken & Bæck
Published in
7 min readFeb 15, 2021

--

With immersive technologies on the brink of a breakthrough, what would it take for machines to read and understand our gestures?

Human beings communicate by making sounds and moving body parts — a combination that helps us connect the inside of our minds with the outside world. We show our spatial conceptualisation to each other, simply by the wave of a hand. However, we are not able to gesture when we interact with machines. Yet.

Louder than words

To be able to fully interpret us, speech robots need to learn more than the words we use. Body language is fundamental — both for understanding others and ourselves. Try explaining how a lock works whilst sitting on your hands. It’s difficult. You will most likely use your shoulders, head and eyes to compensate. That’s because gestures are a spontaneous mode of expression, directly connected to speech. Pointing, gesticulating, miming or using conventional signs, such as thumbs-up, is useful both for the speaker and the listener; you gesture to arrange your own cognitive processes and to direct the listener’s attention to the message you want to convey.

Gestural communication is ubiquitous throughout the animal kingdom, and because hands are so helpful to our own expressive abilities, we gesture whether we can see the person we’re communicating with or not — for example when talking on the phone — because it helps the speech signals along. Psycholinguists study this unbreakable bond between hand movements and speech, i.e. we don’t learn the signs separately from learning words. Even people who are born blind, and have never seen gesturing, use gestures — both to listen and to visualise intricate ideas.

Every speaker, of every language, in every culture we know, gestures to express thoughts. As an integral part of interaction, we talk to each other — not only with sounds and facial expressions but in a large part, with our bodies. Winking or sticking out our tongues, moving arms and hands alongside speech, all contribute meaning. Without saying a word, you can request something by holding out your hands or hook people to a reference like “as you said earlier” by swinging your arms toward them.

Spatial cognition

Our hands tend to mirror language structures, and by using gestures — alone or accompanied with words — we express linguistic and visual knowledge that sounds cannot convey. Human beings understand this because we inhabit the same types of bodies and have the same physical restrictions as other humans. The way we relate to the world is by mapping this sense of embodiment onto things around us. For example to inanimate objects, claiming that a bottle stands or lies on the table — picturing the spout as a head and the bottom as its feet.

In early stages of language acquisition, children go through a one-word-plus-one-gesture phase, where they might say “(I want) biscuit” and do a grabbing movement. Watching an item move from one side to another, e.g. “the ball rolled downwards (from left to right)”, our gestures will more than 90% of the time show the same direction and the object we once observed.

Flipping you the bird

For computer systems to get a grasp of how humans communicate through body language and gestures, we have to annotate and program what our range of different experiences really mean — not only the words the physical movement represents.

Since machines don’t have real intelligence, immersive technologies try to support how we communicate. We can do it by training algorithms to recognise, read and analyse different objects. Like we did with human pose estimation in Improving your Tennis Game with Computer Vision and manipulated the furniture in IKEA’s Everyday Experiments; basing the algorithms on the objects’ features, as well as how we use our body language.

As the text technologies are getting better, the preciseness of natural language processing (NLP) will improve — hopefully making XR an accessible part of interaction. For these systems to work, we must recognise the different structures of body language. For example that gesticulating to emphasise a point is a gesture, but that moving a cursor on a computer mouse is not; the meaning-bearing information in the finger’s motion is not relevant — only the key that was pressed. Psycholinguists like to divide gesturing into five different dimensions of meaning:

  1. Signs are language specific, with their own linguistic structures and norms. They are not spontaneous and don’t require being coordinated with speech.
  2. Gesticulation is the most frequent hand- and arm gesture, but where the head, legs and feet can take over to co-express an idea, together with phonation.
  3. Speech-framed gestures are hand movements that fills a slot in the syntax and thus works as grammatical objects, indicating words such as “here”, “him”, “fast”. In all natural language, these are anything but redundant, and so interconnected with verbal speech that conversations without them will miss important parts of information.
  4. Pantomime is at the extreme end of the continuum of gestures, conveying a narrative story without any speech signals, e.g. miming “bends something back” to reinforce the direction of where something is moving.
  5. Emblems, symbols and icons are all the conventional signs — thumbs-up or peace-sign, that we have emojis for. Like words, they have non-universal meaning that we have to learn. The “quotable gestures” mostly occur in spoken conversations, but they can be unaccompanied by sound.

Further on, the co-speech gestures — gesticulation, speech-framed gestures and emblems — all have subcategories that can coexist and behave in different ways, as human communication always is more complicated than what meets the eye (and the hand).

Gestures are also a way to signalise that you have understood the social norms of the society. For example, you can cover your face with your hands if you want to show that you’re embarrassed to say something, or make scare quotes in the air to distance yourself from a statement. The relevant sections of speech must also be closely linked to the gestures in time, to not be read as ambiguous; you can’t raise your eyebrows a long time before or after you’ve said something sarcastic.

Pointing with the brain

In the same way that natural language processing is hindered by lacking context, computer interfaces don’t yet read gestures. Instead of all this useful information going to waste, we could make more intelligent and accessible dialogue services, for people with visual- or hearing impairment, motor disabilities, or autism — e.g. guidance to interpret emotional input.

There is a lot of unsolved potential in combining text technology with spatial computing through gestures. Especially characterising certain phenomena, like motion or context for ‘the person we talked about earlier’ — by pointing, instead of people having to describe all concepts explicitly.

This is not that far fetched. Machines are advancing in revolutionary ways — learning better sensory detection, 3D models and image recognition, to echo us. Some growing technologies have started focusing on “in-air” gestures, and since visual input is lacking in dialog systems today, the future is looking bright for more spatial interaction. Look for our prototype for IKEA’s Everyday Experiments.

If we can train machines to detect and read this sensory input from movement — the different angles, objects and facial expressions through various materials — and still keep information private, there are reasons to believe that after years of learning their ways of interacting, machines will soon adapt to our innate communication. Wave hello to them when they do.

All illustrations by Nicolas Vittori

Interested in learning more about machines and language technology? Read:

--

--

Vilde Reichelt
Bakken & Bæck

Linguist and UX writer – it’s all semantics to me.