CVPR’20: The Garden of Forking Paths: Towards Multi-Future Trajectory Prediction

In this post, I will present our CVPR’20 paper for the multi-future trajectory prediction task [1]. [Dataset/Code/Model]

Person future trajectory prediction: can you predict where is the person going to go?

In this paper, we study the problem of multi-future trajectory prediction. As shown in the following example, the person is likely to walk in multiple directions.

The Forking Paths Dataset

In real-world videos, only one possible trajectory is available for the same scenario. (We can only see and experience a single universe.)

In order to provide a quantitative evaluation of multi-future trajectory prediction, we create a trajectory dataset using a realistic simulation environment, where the agents are controlled by human annotators, to create multiple semantically plausible future paths, given the same scenarios.

First we re-create the static scene and dynamic trajectories from real-world videos into 3D simulation.

Multiple human annotators observe the scenario for a period of time, and then they are asked to control the agent to go to a destination in first or third-person view. The idea is that by reconstructing the real-world scenarios into 3D simulation and then asking human annotators to control the agents to navigate, we could record human behaviors that resemble the real-world.

Human annotation interface.

Here is a visualization of the dataset:

The first benchmark for quantitative evaluation of the models to predict multi-future trajectories.
Human annotators will observe the scenario for a period of time (the yellow path) and then assume control to navigate to the destination. There is a time limit of 10.4 seconds and the annotation session will restart if the agent collides with others.

After annotating the multi-future trajectories, we record the scenarios from different camera views, and even different weather and lighting conditions.

We have released the dataset, all the code and 3D assets here, which includes a detailed tutorial of using the simulator and creating the dataset.

We provide a powerful editing interface to easily create, edit and playback scenarios in simulation.

The Multiverse Model

We propose a multi-decoder framework that predicts both coarse and fine locations of the person using scene semantic segmentation features.

The Multiverse Model for Multi-Future Trajectory Prediction
  • History Encoder computes representations from scene semantics
  • Coarse Location Decoder predicts multiple future grid location sequences by using beam search
  • Fine Location Decoder predicts exact future locations based on the grid predictions
  • Our model achieves STOA performance in the single-future trajectory prediction experiment and also the proposed multi-future trajectory prediction on the Forking Paths Dataset.
Single-Future Trajectory Prediction. The numbers are displacement errors and they are lower the better. For more details see [1].
Multi-Future Trajectory Prediction on the Forking Paths Dataset. The numbers are displacement errors and they are lower the better. For more details see [1].

Qualitative analysis with the popular Social-GAN [2] model:

Qualitative comparison. The left column is from the Social-GAN [2] model. On the right it is our Multiverse model. The yellow trajectory is the observed trajectory and the green ones are the multi-future trajectory ground truth. The yellow-orange heatmaps are the model outputs

Now, back to the example at the beginning, did you get it right?

Here is the correct prediction. Did you get it right?

Check out our Social-Distancing-Early-Forecasting system!

References:

[1] Liang, Junwei, Lu Jiang, Kevin Murphy, Ting Yu, and Alexander Hauptmann. “The garden of forking paths: Towards multi-future trajectory prediction.” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020. [Dataset/Code/Model]

[2] Gupta, Agrim, Justin Johnson, Li Fei-Fei, Silvio Savarese, and Alexandre Alahi. “Social gan: Socially acceptable trajectories with generative adversarial networks.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.

Ph.D. student at CMU doing Computer Vision and Language.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store