Robotics in Real Life: What makes robots maneuverable in social environments?

Published in

MLPurdue

9 min readJan 24, 2023

When the phrase “social robotics” come to mind, lots of us have a humanoid image in mind: maybe C3-PO, maybe sky-net backed terminators (if you’re quite the cynic). Despite this humanoid image, as college students, we interact with social robotics almost every time we step outside. Here at Purdue we have “starship” robots that deliver food and groceries to students, taking all the same routes on the sidewalk and crosswalk as students. Just like students, roboticists in this area have to ask questions like “how do these robots know to avoid hitting people, and how do they deliver goods while dealing with obstacles and oncoming traffic?”

Beyond just the delivery of goods and the roads, it is natural to wonder how robots will learn to navigate social settings like classrooms, offices, and homes. How will robots begin to learn how to gauge emotions and where people are walking, and give them their appropriate personal space? These are questions answered in the paper we will be going over today, EWareNet by Narayanan et. al.

At a high level, the paper creates robots that are emotionally aware of how people are feeling and create paths that respect their personal space and where they are headed. The paper is not the first to introduce this concept of robotic awareness of humans and traveling alongside them (just think about all the self-driving cars in development), the new approach these researchers took allowed them to beat the State-Of-The-Art algorithms without investing in extremely expensive hardware. This means that this research can be applied to everyday robots, advancing tech everywhere.

Technical Details

What are ANNs?

Artificial Neural Networks are a subset of machine learning whose structure is inspired by the brain. ANN’s have 3 types important layers: input, hidden, and output: each layer processes information from the previous layer and feeds the new information forward into the next layer.

The core ingredient of ANN’s is the neuron, also called the perceptron. Each of the neurons processes and transmits information, allowing the network to learn how to make intelligent decisions.

The computation of each neuron is simply the multiplication of the input (1 or 0) of a feature and its weight. After summing these values, we simply check if this result exceeds our threshold: we output 1 if it does, and 0 otherwise. An example of this can be seen below:

This paper uses multiple different types of ANN’s, including Convolutional Neural Networks (CNN) and Transformers.

Convolutional Neural Networks (CNN’s):

Convolutional Neural Networks are ANN’s that operate on visual data and are the backbone of modern AI and Computer Vision. You can understand CNN’s as scanning over images multiple times and figuring out which features contribute most to an output. Consider an example where you had to figure out whether the animal in the picture is a cat or a dog: features like triangular ears and whiskers may be highly characteristic of cats, and long tongues and floppy ears may be a telltale sign of dogs.

Transformers

Unlike the rest of the ANN’s introduced today, transformers are a relatively new architecture. Much of the ANN’s we see in use were developed decades in the past, the reason for their resurgence comes mainly from the rapid development of computational power and data resources [d2l].

Functionally, Transformers are a type of neural network that are used to process data that occurs in a sequence — like a sentence, or steps in someone’s stride. Transformers work based on the principle of attention. Before I explain what self-attention is and how it works, let me explain how transformers operate on the principle: imagine that you have a sentence in Spanish that you want to translate into English, and each word in the sentence is assigned attention based on its importance. Taking this in, the final output of our transformer simply just has a probability assigned over every word in English — we simply are just picking the most likely word based on what has come before (probabilities will change each time we choose a word).

Self-attention can be thought of as a way to weigh the importance of different elements in a sequence. Imagine you are trying to summarize a physics textbook. The self-attention mechanism would assign a weight to each sentence in the book, based on how important it is to understand the overall meaning of the book. When we learn from these textbooks, we remember certain formulas, principles, and theorems because they represent the core of the knowledge of the textbook — these are the ‘sentences’ that would have the largest assigned weight from the self-attention mechanism.

What is RL?

Reinforcement Learning (RL) is a learning paradigm that learns how to optimally take steps of sequential decisions. RL aims to mimic how we as humans learn: specifically, RL learns which actions to take at any point in time that leads it to its goal (aka the highest reward). Let’s break down the components of RL with analogies to Mario Brothers:

State space (also called observational space): This describes all available info and features of the environment that we can use to make a decision. In Mario, this would include all of the possible occurrences that could happen at a position (enemy, pit to doom, red mushroom, etc.)
Action space: Decisions you can take in each state of the system. In Mario, this means left, right, jump, or crouch.
Reward Signal: This signals the performance of the current action, which is taking from our current state into the next state, which leads us to always consider the path that is giving the highest rewards (e.g. on a flat course in Mario Bros, going right towards the flag will always have a higher reward than going left). Discounted rewards are an approach that forces RL algorithms to optimize their trajectory towards the path (i.e. if Mario has to take 200 states to get to the flag compared to 10, the discounted reward signal should obviously reward the actions of the latter).

High-Level Overview

As with all research papers, there are multiple components that come together to make this whole thing novel. Here we are breaking down each of the parts and the fundamental questions they answer for our robotic friend.

Human Pose Extraction: “Where are the people, and what’s their body language like?

This is the backbone of our entire AI pipeline. Although not as direct of a factor as emotions or path trajectory, these factors are usually very heavily influenced by body language and how we position ourselves. We’ll explore this further in the next few points. This was achieved by using a CNN trained off images of people and their skeletal poses.

Human intent prediction: “Where is this person going?”

In each of the robots we’ve seen, an essential feature of these robots is that they never run into or interrupt a human on their path. Internally, robots accomplish this by predicting trajectories of where people are heading based off their past few strides. While this may seem dramatic, people do the exact same thing when crossing the street: when we’re walking on the left side of the sidewalk and someone is running on the right side of the walk, we tend to assume they’ll stay on that side. This understanding of predicting paths was accomplished using a transformer-based Neural Network, which was trained off series-based data of a person’s strides, and then the subsequent trajectory that resulted from that stride.

Proxemic Constraints Modeling: in a phrase, just think “personal space.”

This part of the paper was aimed at being able to guess how people are feeling based entirely on their body language and facial expressions. As humans, we intuitively know that someone’s personal space is usually based on how they feel, and interrupting anyone’s personal space changes their mood (usually for the worst). Our AI needs to know how people are feeling in order to move without stepping into their personal space, and also understand that if they must get close and personal, what are the consequences of their actions? Here, AI is a CNN that took in inputs of people and their body language (remember pose estimation?) and then outputs predictions of their emotions and proxemic constraints (personal space).

This part is trained primarily by a CNN that inputs images of people and outputs predictions of emotions and proxemic constraints (their personal space).

Intent-Aware Navigation: I have all this information, now what?

This brings all the components together. This part of the experiment sets up an RL-environment that factors in the paths people are taking, their emotions, and how the next action of the robot will take will affect these constraints. We now have pretty much all of the setup for the RL game from the previous points, let’s just match them and see how our robot will optimize:

State space: This would be all of the information the robot can take in before making its next move. Information like where people are, where they are going, and how they are feeling (and consequently, what is their personal space) are all factors our robot needs in order to take its next action.
Action space: Most intuitive of the three: where is our robot going? The robot has options to move in any direction or sit still, just like any other person walking also has.
Reward Signal: rewards for the robot are dependent on how people are feeling. There are things that are obvious to make rewards, like the robot moving closer to its destination, not invading personal space, or interrupting people walking. There are also features that the paper adds in order to support a more cohesive environment, an example of this is negatively rewarding the robot for jittery movements, as that will likely confuse everyone around the robot.

By optimizing for our game, we get an AI-powered bot that moves around in its room from the beginning to the destination that minimizes the time to arrive as well while respecting personal space, smooth routes, and where people are headed. Now let’s see how it performs :^).

Analysis and Takeaways

With all this experimental design, it is imperative we also look at our results. Since the novel parts of this paper deal with how to predict the trajectories of others as well as the performance of an RL-controlled robot, those are the results we’ll discuss, as well as the dataset it operates on.

Trajectory Projection. Using a dataset of 3.6 Million photos of people posing, EWareNet was shown to beat many of the top trajectory prediction networks, while significantly lowering the waiting time for prediction on each. In the paper, they credit the utilization of the Emotional Predictor network — the CNN we discussed earlier — in order to add to the trajectory prediction.

Comparison of EWareNet’s prediction times for trajectories prediction algorithms

When comparing the RL policy of EwareNet with other State of the Art systems, we see that there is a major reduction in the amount of personal space robots invading. In comparison with the rest of SOTA, EWareNet performs with approximately half the personal space intrusion as its counterparts.

Comparison of EWareNet with other popular robotic routing algorithms.

Takeaways

We introduce a novel approach to guessing the trajectory of humans using transformers based on previous cycles of their footsteps.
Based on these trajectories, we create a planning algorithm based on reinforcement learning in order to find optimal ways for the robot to reach its end destination with minimum disturbance to human counterparts.
We also begin to guess how much space they need based off of our (perceived) understanding of how they feel.

Sources:

EWareNet — https://arxiv.org/pdf/2011.09438.pdf
Emotional Learning for Robotics Settings — https://openaccess.thecvf.com/content_CVPRW_2019/papers/MMLV/Bera_The_Emotionally_Intelligent_Robot_Improving_Socially-aware_Human_Prediction_in_Crowded_CVPRW_2019_paper.pdf
transformers explained — dhttps://www.youtube.com/watch?v=4Bdc55j80l8
Transformer explanation (and just general great AI utility) — https://d2l.ai/chapter_attention-mechanisms-and-transformers/index.html
IBM tutorials, very useful for some quick definitions in AI and SWE — https://developer.ibm.com/learningpaths/get-started-automated-ai-for-decision-making-api/what-is-automated-ai-for-decision-making/