Using RL to teach a drone to guide a human-controlled vehicle

Ido Glanz
Analytics Vidhya
Published in
9 min readSep 21, 2020
AirSim running on UnrealEngine with drone and car

Tackling a Human-Robot Interaction (HRI) task in the shape of an autonomous drone learning to lead a human-controlled vehicle in the AirSim environment and using a Reinforcement Learning algorithm

A joint work together with Matan Weksler

Amongst the recently rising tasks for autonomous agents, and more specifically the ones with an interaction with a human-being, there is a family of tasks involving the need to guide a human through a path or a task. It could be to just cross a busy road, learn a new task like done by Ali Shafti et. al in their work on “A Human-Robot Collaborative Reinforcement Learning Algorithm” or even follow it along in a robot-led museum tour like done by Wolfram Burgard et.al in their work on The Interactive Museum Tour-Guide Robot.

Nevertheless, such algorithms are bound to get more and more popular with robots becoming a bigger part of our daily life and their ability to assist us in a variety of tasks.

Having said the above, thought we’d share our recent project in the field of Human-Robot Interaction (HRI) in the form of robotic guidance, and namely the task of an autonomous drone needing to guide a human agent through a simulated 3D grid-world environment.

The task as we grasp it is the need for the robot agent to learn the human control constraints (such as latency, line-of-sight to the robot, manoeuvrability etc.) as to allow the human to properly follow it along the path yet do it in minimal time.

In other words, the drone will need to learn to optimise it’s flight so a human-controlled car can follow it along a path. The drone will monitor their relative state and decide on its action, concluding a successful episode if they both reach the goal within a reasonable lag.

AirSim environment

To even start simulating such a task, we set up the (honestly great) Microsoft AirSim plugin running on UnrealEngine and have it simulate both a car and a drone. While simulating either of each alone, or even two of each together, is quite easy to do, working with both simultaneously is not yet supported and we had to work-around a bit to have it set-up this way. You can follow our guide here if you want to try it out as well.

We worked with the simple 3D blocks environment during our training but swapping to a different environment could be easily done as well if wanted.

Operational backbone

Before we dive into the different modules we implemented, lets briefly describe the working scheme of the algorithm as depicted in the block diagram below:

Roughly speaking, both vehicles land in the 3D grid-world, the drone captures a bird-eye image and plans a path to a given (random) goal, and starts flying to it constantly monitoring the relative state of the vehicle being controlled by a human driver trying to follow it along.

At this point you might be asking yourself so who is learning what and when? During the “game” the drone needs to constantly decide on its action (i.e. slow down, stop, increase its speed) and furthermore it gets rewarded per the state it was in. Finishing each episode (or game) the drones’ decision maker (or policy maker) is trained per the experiences if gathered along the game, towards making better decisions in view of the overall success of the episode. Hang on, we will soon elaborate more on the algorithm.

Path Planning

A second necessity we encountered was the need for the drone to have some path or trajectory to fly along, and while pre-setting trajectories for the sake of training was possible, we wanted to design a more general solution and just supply the drone with a goal and let him do the planning work, also accounting for the specific task (having a human follow you) as you will see below.

Path planning in general is a thoroughly researched topic in robotics, with algorithms ranging from simpler shortest-path planners to complex value-based planners for complex dense environments. We choose to focus on 2 common planners, PRM and RRT, adding a slight augmentation to support an external (later learned) parameter controlling the smoothness of the planned path (thus making it, to our intuition, easier to later follow along)

  1. Smooth RRT — A flavour of the Rapidly-Exploring Random Trees with the addition of limiting the allowed turning angle of a given path. The algorithm is similar to the original one only upon checking whether a new branch is feasible (i.e. in terms of obstacle avoidance) we added a max turning angle threshold check looking at the previous node and calculating the needed turning angle between the two edges and only allowing ones beneath the TH. This hyper-parameter is flexible and will be controlled/learnt by the algorithm per the experience it gained from previous episodes. While this algorithm does not necessarily derive an optimal path, as it runs quickly in a sparse environment, we were able to perform several runs, each generating a trajectory together with a cost and then choose the best one.
Path found by the smooth-RRT algorithm, as can be seen only branches inducing a turning angle of less than 50degrees were allowed thus a smooth path was derived.
  1. PRM with BFS-like Dynamic Programming planner designed to find an optimal path from the start to goal, but with-respect-to both the distance and smoothness of the path. To do so each node has to keep not only the shortest path to it but also the smoothness of traversing from each of its neighbours so that if reached from a new node it could re-calculate what best path would fit from there on. The cost weight of turning compared to distance is controllable and could be adjusted by the algorithm. While in theory the algorithm should derive an optimal path (in the sense described above and w.r.t the graph the PRM generated), the complexity of it sometimes is not worth the time compared to the RRT algorithm which in less dense environments converges rather quickly.
Path found by the smooth PRM-like algorithm. Note the path is only optimal wrt points it randomly generated (thus globally optimal only if number of points tends to infinity)

Disclaimer — After testing both, we decided to continue with the RRT based method for the implementation of the algorithm, mainly due to its sufficient results in the simplified environment and the overall running time of each episode. That said, if need be, both (or any other planning heuristic for that sake) could be plugged in and used as a path planning block.

And now for the Algorithm

We implemented a gym-like environment wrapper and an RL algorithm with an Actor-Critic scheme. Roughly, this means 2 different functions (or neural networks in our case), one outputting a probability distribution over the possible actions; a discrete representation of speeds from 0 to 10, and the second the value of the specific state (the value of the state being a known RL term for how good the state is in terms of the final goal of the game and the ability to get there). The first is used to generate the next action while the second to reduce the variance of the loss term in the learning process. We won’t go into all the details of the working mechanics of the RL algorithm as it’s a known scheme and not our novel work. Our implementation consisted of implementing it so it receives the state vector we extract from the simulator in real-time and return the action to the simulator to execute. A lot of effort also went into the reward shaping of the algorithm, i.e., on what the agent gets rewarded/penalised for and moreover at what scale. Currently, the agent will get a time-penalty of -1 every second (to generate a tendency to complete the path), a -50 penalty if there is no line-of-sight between the agents, a penalty on relative distance, speed and heading of the drone and car and a milestone reward for reaching nodes in the planned trajectory graph (also to generate an incentive to complete the path). We played with the scale of the parameters balancing the tendency to complete the path vs. “politeness” and we think further work could be done to generate more rewarding heuristics to generate more optimal policy makers.

a general Actor-Critic scheme taken from Elmar Diederichs’ “Reinforcement Learning — A Technical Introduction”

So did it work?

A single game at 5X speed

First of all to train our agent we had overall approximately 100 training episodes (games), each training episode taking about 100 seconds (until either reaching the goal or timing out). The training was done by 3 different users and tested by 7 users, each playing 10 evaluation games. While this is actually not much in terms of learning algorithms, even with only this much of experiences we were able to start see threads of learning as we will see soon.

As you can see below, one of the first things the agent learned was to slow down when the relative distance increased (i.e. the car was far behind the drone). A second order observation is the general speed reduction, most likely to satisfy the penalty/reward for relative speed. Having said the above, a user with only 2 introduction games, was able to complete the game for 8/10 games and get an overall success rate of ~80% (before training it was hard to even see the drone before it just flew off to the goal). Nevertheless, there is still much more training needed to actually optimise the algorithm and have it train on a more diverse dataset (i.e. more users) to reduce the bias of us learning to play and to allow it to capture more complex insights on the human agent.

State-action plot of a single episode. after approximately 60 runs the drone was able to learn to slow down when the car was further away

While much more data is needed to conclude the algorithm is indeed working and more specifically testing it with more new users, the results as can be seen, do indicate some learning is taking place

Below is a graph of the average reward of each episode during the learning phase. As can be noticed, at around 40–50 episodes the algorithm was able to capture the relation between the penalty of relative distance and started to slow down when the car was further away. That said, all the episodes were also finished within 100 seconds or less (thus it kept pursuing its goal reward)

Average reward on last 90 episodes (games). After about 40 episodes the drone “learned” it’s better to wait for the car to reduce the relative distance penalty.

Wrapping up

The main takings from this project are first of all the complexity of actually setting up an environment for such algorithms to be built and trained, and more specifically the work with the external simulator and plugins we had to set up. Furthermore, the challenge of gathering data where each episode has to be done together with a human (as opposed to an algorithm learning to play a video game for example) is a great challenge in algorithms needing a mass amount of training data to converge and thus in some cases a reduction in complexity or a breakdown to subproblems is needed to actually allow learning to be feasible. That said, we were excited to actually see the drone start to capture the (though trivial) constraints of human/car response time and the speed limitations, and do believe that with further work and training data it could get even further.

We hope to soon invest some more time on stabilising the platform and replicating it so we could have it train asynchronously and thus enlarge our training set. Moreover, we plan to explore the state vector structure and content, initially to include more broad metrics of the state and later to possibly even feed it with just a camera input from the drone and have its neural network learn to extract the state features by itself (a romantic idea though requiring a massive amount of data). Furthermore, shaping of the reward function could definitely yield improvement, both scaling it appropriately and finding more sophisticated mechanics to base on.

All the source code for the project can be found here along with some more implementation details if wanted.

Thanks!

--

--