Driving is a social activity. Consider the impressive multi-agent social interactions in this scene:
Thousands of drivers safely maneuver their vehicles in a complex scene, while traveling in close proximity, with conflicting objectives and incomplete information about other drivers’ intentions. How do human drivers accomplish this feat?
Social prediction is essential for driving
Human drivers use their social intelligence to predict how others’ future motions will depend on interactions with themselves, the surrounding agents, and the scene context. By predicting nearby agents’ trajectories, drivers can proactively plan safe interactions, rather than reactively responding to unanticipated events, which can lead to unsafe behavior, such as hard braking or failure to execute critical maneuvers.
However, we can never predict with perfect certainty the trajectory another vehicle will execute. We are both uncertain about which maneuvers other vehicles will execute — will they yield? — and how they will execute them — if they yield, will they decelerate rapidly, or slowly?
We’ve developed methods for coping with an uncertain future by making probabilistic predictions. These capture the space of different possible trajectories that others could follow, and how likely each possible trajectory is to occur. Safe driving then involves planning a path that minimizes the chance of future collisions under all the future contingencies that are predicted.
Learning to predict
We’ve developed a neural network architecture that learns to make probabilistic predictions of others’ trajectories from large scale data, without needing to encode any explicit prior domain knowledge. Our method naturally generalizes across environments, scenarios, and types of vehicles and agents (trucks, cars, buses, motorcycles, bicycles, pedestrians, etc.), given only the training data collected while driving.
Our approach, called Multi-Agent Tensor Fusion (MATF), combines the strengths of spatial- and agent-centric representations by aligning scene features and agent trajectory features within a Multi-Agent Tensor (MAT) representation, illustrated here. This MAT encoding naturally handles scenes with varying numbers of agents, and predicts the trajectories of all agents in the scene with computational complexity that is linear in the number of agents, via a convolutional operation over the MAT. Conditional generative adversarial network (GAN) training allows the MATF to learn to predict a distribution over trajectories which captures uncertainty in how the situation will unfold. The MATF learns to predict joint trajectories, which account for interactive behaviors between vehicles, such as yielding and collision avoidance.
A detailed illustration of the MATF architecture is shown here. The MATF architecture first encodes all the relevant information about the scene, and encodes all the relevant information about each agent by processing each agent’s past trajectory with a recurrent NN. The network then aligns the scene and agent features spatially into a Multi-Agent Tensor, preserving all the local and non-local spatial relationships in the scenario. Next, Multi-Agent Tensor Fusion is performed using a learned fully-convolutional mapping to produce a fused Multi-Agent Tensor as the final encoding of the multi-agent driving scenario. The convolutional mapping is the same for every agent, captures the spatial relationships and interactions among all the agents, and applies to all agents in the scene simultaneously. The MATF method then learns to probabilistically decode the information from the fused multi-agent tensor to produce predicted trajectories that are sensitive to the scene features and the trajectories of surrounding agents.
We employ conditional generative adversarial network (GAN) training techniques to learn a probability distribution over trajectories, given the MATF encoding. GANs allow learning high-fidelity generative models that capture the distribution of the observed data. In the driving context, the modes of the distribution correspond to the different maneuvers a vehicle or pedestrian might execute, such as lane / path following versus lane / path changing. The distribution around each mode corresponds to the manner in which the maneuver is executed, e.g., fast, slow, aggressively, cautiously, etc. GANs naturally capture both kinds of variations. Importantly, our GAN algorithm trains the model to produce joint trajectories, which account for interactions between vehicles such as yielding and collision avoidance.
We first applied our model to learn to predict vehicle trajectories given large-scale driving data collected by isee. The figure below shows five scenarios, with past trajectories shown in different colors for each vehicle, followed by 100 sampled future trajectories. Ground truth future trajectories are shown in black, and lane centers are shown in gray. In (a) a complex scenario involving five vehicles is shown; MATF accurately predicts the trajectory and velocity profile for all vehicles. In (b) MATF correctly predicts that the red vehicle will complete a lane change. In (c) MATF captures the uncertainty over whether the red vehicle will take the highway exit. In (d), as soon as the purple vehicle passes a highway exit, MATF predicts it will not take that exit. In (e), MATF fails to predict the precise ground truth trajectory of the red vehicle; however, the vehicle is predicted to initiate a lane change maneuver in a very small number of sampled trajectories, reflecting the low prior probability of spontaneous lane changes learned from our dataset.
Next, we applied our model to learning to predict pedestrians’, and multiple other types of agents’ trajectories from the Stanford Drone Dataset, a large-scale, state-of-the-art dataset containing trajectories of pedestrians, bicyclists, skateboarders, carts, cars, and buses navigating a university campus. In the figure below, blue lines show past trajectories, red lines show ground-truth trajectories, and green lines show predicted trajectories. All the agent trajectories shown in this figure are predicted jointly via one forward pass through the network. The closer the green predicted trajectory is to the red ground-truth future trajectory, the more accurate the prediction. Our model predicts that (1) two agents entering the roundabout from the top will exit to the left; (2) one agent coming from the left on the pathway above the roundabout is turning left to move toward the top of the image; and (3) one agent is decelerating at the door of the building above and to the right of the roundabout. In one interesting failure case (4), an agent on the top-right of the roundabout is turning right to move toward the top of the image; the model predicts the turn, but not how sharp it will be.