Game of Modes: Diverse Trajectory Forecasting with Pushforward Distributions

Joey Hong
Joey Hong
Dec 24, 2018 · 8 min read
I just took the first photo from the paper’s supplementary materials.

Hi! A while back, I interned at a self-driving car company, where I specifically did research on trajectory forecasting. As of now, it’s probably the biggest hurdle to realizing the technology, but the literature on it is relatively scarce. The research thus far has used either sequence prediction with RNNs, or maximum-entropy IRL, or if it’s really fancy, a combination of both.

While I was still an intern, I came across this R2P2 paper; aside from a workshop talk at CVPR 2018, the paper seems to have relatively flown under the radar, so I wanted to share its novel contributions here. In this article, I will try to accurately explain the novel concepts in the paper with my still nascent knowledge of imitation learning.

Why is trajectory forecasting so hard anyway?

That’s a good question. The goal of trajectory/motion forecasting is to take some representation of the environment and produce, for each relevant entity, a context-conditioned distribution of spatiotemporal trajectories of the entity’s future. Though that sounds easy enough for ML (I mean, it can even make random people dance now), there are elements that make it a prediction problem on steroids (or at least one on protein powder with a rigorous exercise regimen).

  1. Reasoning from scene context: Unlike other sequence prediction problems, trajectory forecasting requires much more than the past motion profile. There is a lot of static context in the road map, and dynamic context in other agents interacting with the entity of interest. A lot of recent papers (like here and here) focus on this problem, experimenting with mid to high-level representations of sensor data in some spatial parameterization, and feeding it into different recurrent and convolutional architectures.
  2. Multi-modality: Perhaps the more theoretically interesting problem is that there is often multiple plausible trajectories (an intersection comes to mind), and a model should be diverse enough to cover all such modes. Conversely, the model must also be precise enough to avoid generating bad trajectories.

Of the two problems, R2P2 primarily focuses on the latter. Historically, sequence learning is done via maximum likelihood estimation, or equivalently, minimizing the cross entropy loss,

A quick look at the edge-cases of this loss shows that it gives high penalty to failing to cover some of the modes of the training distribution, but low penalty to missing the underlying distribution entirely. And if we flipped the direction of the two distributions in the cross entropy, we get the reversed behavior. This is best summarized in a figure in the paper:

So, the two cross-entropies serve complementary purposes, and the writers of the paper decided to combine them into one, symmetric loss.

Now, even for someone whose creative highs consist of adding an extra tree when following Bob Ross painting tutorials, I feel like arriving at a symmetric loss wasn’t a huge leap (similar stuff has been done with KL divergence). However, optimizing the expression via gradient descent doesn’t work because the second term isn’t differentiable (and there’s also the problem of finding a prior approximation). An uninspired way, and how I would have done it, is to use the REINFORCE method from RL, which is usable as long as the learned distribution is differentiable in its parameters, but the paper uses a cooler, less stupidly high-variance, way around it.

Pushforward distribution modeling

What the paper instead proposes is a simulator that maps noise from a simple distribution to forecasted trajectories. That way, the loss can be rewritten via change-of-variables. Specifically, they introduce a simulator,

Now, the distribution of trajectories is determined by a simple noise distribution and the simulator (hence a pushforward of the noise). Here is another figure from the paper to illustrate this:

Pushing forward a base distribution to a trajectory distribution.

If the simulator is differentiable and invertible, then we can use the change-of-variables formula to derive,

With some kind-of simple substitution we can rewrite the symmetric loss from earlier as,

The next step is choosing an invertible and differentiable simulator; luckily, the trajectory forecasting task lends itself to some pretty glaring ones from stochastic dynamical systems. The writers chose a stochastic one-step policy,

This gives us an iterative way of generating the next point along a trajectory using the past motion profile and context. As long as the stochastic term is invertible, and both terms are differentiable, a simulator using this policy will also be invertible and differentiable.

For example, if the noise distribution is the standard normal, then the trajectory distribution is also a normal,

Now, using a one-step policy like the one proposed makes computing values in our symmetric loss easy. Namely,

Trivial as a Steph Curry 3-pointer right? (I was told to include a sports reference to make myself appear cool.)

Well, for those of you who disagree (myself included because I slept through a lot of my linear algebra classes), I’ll show some of the worked-out math. It hinges on the observation that trajectory points don’t depend on points for later timesteps, which makes the Jacobian lower triangular,

A special property of such matrices, which you can easily verify by row/column-expansion, is that their determinants are just the product of the diagonal entries. This gives the result,

Finally, we have to revisit this problem I promised I’d go over about 5 minutes ago — approximating the underlying data distribution since we cannot evaluate the underlying’s PDF directly. It’s important that this fixed approximation is decent, because it could add unnecessarily high penalty if it severely underestimates the training distribution in some region.

One such method is assuming that the trajectory distribution factors over timesteps, i.e. parameterizing,

Then, for each timestep, we discretize the region of possible points into L possible locations. We essentially model the probability distribution for each timestep as some discrete grid map, which can be trained via logistic regression with L classes,

The paper shows some examples of learned spatial cost functions. As you can see, it typically gives low-ish cost to all drivable surfaces, and high penalty to obstacles. There’s definitely some questionable parts of the learned distributions, but at least its support (of low-cost regions) covers all the plausible trajectories.

Learned prior (white: high cost, black: low cost). The blue shows the expert demonstrated trajectory.

Congrats! This basically covers the novel components of the paper. All that’s left are implementation architecture details and experiments, which I will gloss over super briefly.


We have mentioned the context repeatedly without explicitly saying what it was; well, as you may expect, it’s just a concatenation of the past motion profile and a spatial grid representing a map of the scene (which looks like a top-down 2D projection of LIDAR data).

In the R2P2 RNN model (which is the best-performing model in the paper), the map is fed into a CNN to produce a feature encoding of the static context; the past history is then encoded in a GRU-RNN, which also uses the featurized map to iteratively forecast points of the trajectory.

The architectural details were relatively basic and intuitive, which I liked; I’m also basic and intuitive (vanilla is my favorite ice cream flavor, blue my favorite color, and black bear my favorite bear).

Another contribution of this paper is the compilation of a new dataset, which they call the CaliForecasting dataset. It’s larger, and supposed to contain more multi-modal examples than the KITTI dataset. They compare their R2P2 approach to some baseline generative models, one of which is a CVAE (which is a pretty cool application of variational inference).

Comparison of R2P2 and CVAE (a pretty popular generative model).

CVAE get its diversity from a stochastic latent encoding, which is sampled from a standard normal, but it’s fairly straightforward to see that R2P2 covers more plausible modes when multiple exist. This video also shows R2P2 working continuously, just to confirm that there isn’t a lot of flickering in generated trajectories over time.

There are other contributions in the paper that I skipped, such as applying their differentiable simulator to create a low-cost variant of GAIL (creatively called R2P2 GAIL). I just didn’t include it because I thought it was secondary to the main, proposed method, and honestly, don’t feel confident enough to explain GAIL without rife mathematical inaccuracies.

Final thoughts

Although this paper specifically focuses on trajectory forecasting, the novel approaches of using symmetric losses, and push-forward modeling instead of high-variance, bad-convergence methods from RL, should be applicable to many other types of data. My opinion, which holds no weight in the ML community, is that such methods could potentially be used for language generation models in NLP. Anyway, the mind kind-of races thinking about other potential avenues of application.

Don’t worry, mine was too.


  1. Nicholas Rhinehart, Kris M. Kitani, Paul Vernaza. R2P2: A Reparameterized Pushforward Policy for Diverse, Precise Generative Path Forecasting. The European Conference on Computer Vision (ECCV), 2018, pp. 772–788

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data…

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem

Joey Hong

Written by

Joey Hong

My name almost rhymes with “reinforcement learning” | Caltech ‘19

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem