Baseline Walkthrough of the Vehicle Motion Prediction Task in Shifts Challenge at NeurlPS 2021

15 min readSep 30, 2021

Due to their significance in real-world deployment, distributional shifts have gained plenty of attention in recent years. This year at NeurlPS, we have a new challenge, the Shifts challenge, aiming at investigating robustness and uncertainty quality on the distributional shifted datasets. This blog only targets the Vehicle Motion Prediction task in this challenge, and is a complement to the tutorial on the official Github page, with additional explanations on the datasets, training, and evaluation process.

Table of contents:

· 1. Overview of the task
· 2. Getting the datasets ready
· 3. Train the model
· 4. Getting prediction and uncertainty/confidence scores
· 5. Evaluate the model and submit the results
· 6. Directions for improvements

1. Overview of the task

The vehicle motion prediction task is among the most important in the autonomous driving domain. It involves predicting the distribution over possible future states of agents around the self-driving car at a number of moments in time. In a real-world deployment, vehicles often face distributional shifts when, for example, they begin operation at a new location or even a new route in an existing location. To ensure safety, it is crucial for the vehicle to transfer as much knowledge as possible from the old locations in order to perform well in new, unseen locations, and inform the human drivers when it is uncertain about the next actions. This corresponds to the robustness and uncertainty of predictions respectively. In the Shifts challenge, we evaluate both the prediction robustness and uncertainty under distributional shifts.

Notably, in most prior work, uncertainty estimation and robustness have been assessed separately. Robustness to distributional shift is typically assessed via metrics of predictive performance on a particular task, such as classification error rate. On the other hand, the quality of uncertainty estimates is often assessed via the ability to discriminate between an “in-domain” dataset that is matched to the training data and a shifted or “out-of-domain” (OOD) dataset based on measures of uncertainty. However, we believe that these two problems are two halves of a common whole. In autonomous driving, we need robustness to make correct predictions under distributional shifts, and the uncertainty to inform us when the shifts are too large for the model to make safe predictions. In this challenge, we evaluate robustness and uncertainty both separately and jointly. We will describe more details in the Evaluation section.

2. Getting the datasets ready

To get the datasets ready, we can follow the three steps below:

Download the repository and install the dataset API. In your terminal, run the following:

git clone https://github.com/yandex-research/shifts.git
cd shifts/sdc
pip install .

This will help you install the necessary packages (specified in requirements.txt) for the API to run.

2. Download the dataset (link) and unzip them in the repository (or anywhere that suits you). One example of the directory structure is:

shifts/
   |--> sdc/
         |--> data/
               |--> train_pb/
               |--> development_pb/
               |--> train_tags.txt
               |--> development_tags.txt
               |--> train_rendered/
               |--> development_rendered/

3. Now that we’ve set up the environment and datasets, we can start our Python file (or a Jupyter notebook). We first import dataset API functions we will need:

from ysdc_dataset_api.dataset import MotionPredictionDataset
from ysdc_dataset_api.features import FeatureRenderer

Then, to construct a dataset, we first need to define a feature renderer that transforms the raw data into feature maps, which can be used as an input to a standard vision model. We can use the FeatureRenderer function in the API:

renderer_config = {
    'feature_map_params': {
        'rows': 400,
        'cols': 400,
        'resolution': 0.25,  # number of meters in one pixel
    },
    'renderers_groups': [
        {
            'time_grid_params': {
                'start': 0,
                'stop': 0,
                'step': 1,
            },
            'renderers': [
                {'vehicles': ['occupancy', 'velocity', 'acceleration', 'yaw']},
                {'pedestrians': ['occupancy', 'velocity']},
            ]
        },
        {
            'time_grid_params': {
                'start': 0,
                'stop': 0,
                'step': 1,
            },
            'renderers': [
                {
                    'road_graph': [
                        'crosswalk_occupancy',
                        'crosswalk_availability',
                        'lane_availability',
                        'lane_direction',
                        'lane_occupancy',
                        'lane_priority',
                        'lane_speed_limit',
                        'road_polygons',
                    ]
                }
            ]
        }
    ]
}renderer = FeatureRenderer(renderer_config)

We then define the dataset as:

dataset = MotionPredictionDataset(
    dataset_path='/path/to/datasets/train_pb/',
    scene_tags_fpath='/path/to/datasets/train_tags.txt',
    feature_producer=renderer,
)

Alternatively, we can also use the provided pre-rendered features:

prerenderer_dataset = MotionPredictionDataset(
    dataset_path='/path/to/datasets/train_pb/',
    scene_tags_fpath='/path/to/datasets/train_tags.txt',
    prerendered_dataset_path='/path/to/datasets/train_rendered/',
)

Either way provides us with a dataset ready for training.

To understand the dataset better, we further describe the components of the dataset. First, each data point of the dataset describes a vehicle (“prediction request”):

The input x is the (current and past) high-dimensional observations (features) of the scene that this vehicle is in. In the raw data, the features of the scene consist of 25 time steps (5 seconds).
The corresponding ground truth prediction y is the (future) trajectories of this vehicle. The future trajectories also consist of 25 time steps (sampled at 5Hz from 5-second-long observations), and y will be of shape 25*2 since the position at each time step is described by a 2D position vector.

The goal of the task is to predict the movement trajectory of vehicles at time T ∈ (0, 5] based on the information available for time T ∈ [−5, 0]. In the code above, we only render the features to start from and stop at time 0, so our feature map only has a single time frame that corresponds to time 0 (current time). The features rendered by our feature renderer have 17 channels describing both HD map information and dynamic object states and are centered with respect to the prediction request vehicle. So in this code example, the features will be of shape 17*400*400.

Now, if we print out the keys of a data point, we would also see two other keys: track_id and scene_id. These two are the indices of the data point. The dataset consists of 600,000 scenes, and each scene contains multiple vehicles (tracks) and pedestrians (which are part of the features but don’t need predictions). Together, track_id and scene_id determine a unique vehicle that we are interested in predicting its future trajectories, i.e., the prediction request.

Keys of a single data item and visualization of the features. *Each of the 17 channels* represents one particular type of feature, e.g., vehicle occupancy, pedestrian occupancy.

When constructing the dataset, we also need to specify scene_tags, this is used to split the dataset into different subsets (training, development, and evaluation sets, in- and out-domain sets). Specifically, we use two scene tags: location and precipitation for dataset separation. The in-domain data only consist of scenes with no precipitation collected from Moscow. The detailed partition of the dataset is shown in the figure below. To get a specific subset, we can simply filter the whole dataset by the scene tags. We will show code examples in the evaluation section.

Finally, we provide a list of concepts to understand the dataset setup. These concepts are enough for us to carry out the task. For more details, we encourage readers to read the whitepaper [1].

Scene. Each scene is 10 seconds long and is divided into 5 seconds of context features (feature map) and 5 seconds of ground truth targets for prediction, separated by the time T = 0. The goal of the task is to predict the movement trajectory of vehicles at time T ∈ (0, 5] based on the information available for time T ∈ [−5, 0]. In a single scene, there can be one or more prediction requests.
Prediction request. Each data item in the dataset is about a prediction request, i.e., a vehicle that requires prediction of future trajectories. The input x is the high-dimensional observations of the scene that this vehicle is in, and are often in the form of rendered features. In the training and development data, we also provide the ground truth future trajectories of the prediction request vehicle for training and tuning.
Rendered features. To facilitate easy use of this dataset, we provide utilities to render the scene information as a feature map, which can be used as an input to a standard vision model. We also provide rendered features that can be used directly. At each time step, the feature map has 17 channels which include the information about the state of dynamic objects (i.e., vehicles, pedestrians) and an HD map. The state of a vehicle is described by its position, velocity, linear acceleration, and orientation (yaw, known up to ±π). A state of a pedestrian consists of a position vector and a velocity vector. They provide the context for the model to predict future trajectories of the prediction request.

3. Train the model

After getting the datasets ready, we can start training the model. The baseline method adopts an ensemble approach which requires training multiple “backbone” models as the ensemble members. The requirement for each ensemble member is that they should produce both the prediction and the uncertainty/likelihood of the prediction (see next section). In the baseline, we choose two classes of likelihood models: a simple behavioral cloning agent with a Gated Recurrent Unit decoder (BC) [2] and a Deep Imitative Model (DIM) [3] with an autoregressive flow decoder [4]. In both cases, we will have a model q(y|x) of the likelihood of a trajectory y in the context of features x.

We will use pre-trained BC models in this blog (here is the link to the pre-trained models). Readers can find the training details in the Github repository. Note that since any model that can produce uncertainty/likelihood estimation is valid for this challenge, we also encourage participants to train and try out their own models. For example, different backbone models for the ensemble, or variational methods that use multiple Monte Carlo samples to calculate the uncertainty/likelihood. The only requirement is that these models should meet the inference time requirements (see computational limitations in FAQ).

4. Getting prediction and uncertainty/confidence scores

During model training, we only care about the predictions of future trajectories. But during the evaluation stage, we need to produce uncertainty/confidence scores to tackle the distributional shifts. In the baseline, we adopt the framework of Robust Imitative Planning (RIP). Again, readers are encouraged to implement their own method to incorporate uncertainty in making robust predictions.

We now describe how the confidence/uncertainty scores are calculated under the RIP framework. There are two types of confidence scores that we are interested in: per-trajectory confidence scores and per-prediction request confidence scores. The pre-trajectory confidence score decides the preference over predicted trajectories. The per-prediction request confidence score describes the model’s familiarity with the features of the prediction request and decides which inputs we would ask the model to predict (instead of consulting human experts). We use the following steps to generate the two types of confidence scores:

Trajectory Generation (Prediction). Given an input x, K ensemble members generate G=K×Q trajectories.
Trajectory Scoring. We score each of the G trajectories by computing a confidence score (i.e., log probability in the baseline) under each of the K trained models.
Per-Trajectory Confidence Scores. We aggregate the scores G×K confidence scores to G scores by using a per-trajectory aggregation operator, e.g., averaging. Now for each of the G trajectories, we have a per-trajectory confidence score. We then select the top D trajectories with the highest confidence scores and apply a softmax to these D confidence scores (so that they will sum to 1).
Per-Prediction Request Confidence Score. We further aggregate the D top per-trajectory confidence scores to a single confidence score U using another aggregator, e.g., minimum or averaging. This is the per-prediction request confidence score.
Report. Finally, given the input x, we report the top D trajectories as our predicted trajectories with D corresponding per-trajectory confidence scores, and a single per-prediction request confidence score U (in practice, we report -U as the per-prediction request uncertainty score, see the bottom of the code below).

This ensemble-based model object that produces both predictions and uncertainties can be defined and loaded as the following (the code can be also seen in the jupyter notebook tutorial, and the pre-trained models can be downloaded here):

# Specifications of the modelfrom sdc.config import build_parser

parser = build_parser()
args = parser.parse_args('')

def ipynb_patch_args(args):
    args.dir_checkpoint = '/path/to/model_checkpoints'

    # The below configuration was our best performing in baseline experiments.
    
    # Backbone model details
    # Behavioral Cloning: 
    # MobileNetv2 feature encoder, GRU decoder
    args.model_name = 'bc'
    args.model_dim_hidden = 512
    args.exp_device = 'cuda:0'
    
    # Used in scoring generated trajectories and obtaining 
    # per-plan/per-scene confidence scores.
    args.rip_per_plan_algorithm = 'MA'
    args.rip_per_scene_algorithm = 'MA'
    
    # Number of ensemble members
    args.rip_k = 5
    
    # Data loading
    args.exp_batch_size = 512
    args.data_num_workers = 10
    args.data_prefetch_factor = 2
    
    # Cache loss metrics here
    args.dir_metrics = '/path/to/metrics'

    return args

c = ipynb_patch_args(args)# Defining the modelfrom sdc.oatomobile.torch.baselines import init_rip
from sdc.oatomobile.torch.baselines.robust_imitative_planning import \
    load_rip_checkpoints
from typing import Mapping
from sdc.metrics import SDCLoss
from typing import Optional

class Model:
    def __init__(self, c):
        self.c = c
    
        # Initialize torch hub dir to cache MobileNetV2
        torch.hub.set_dir(f'{c.dir_checkpoint}/torch_hub')
        
    def load(self):
        model, self.full_model_name, _, _ = init_rip(c=self.c)
        checkpoint_dir = f'{c.dir_checkpoint}/{self.full_model_name}'
        self.model = load_rip_checkpoints(
            model=model, device=c.exp_device, k=c.rip_k,
            checkpoint_dir=checkpoint_dir)
        
    
    def predict(self, batch: Mapping[str, torch.Tensor], sdc_loss: Optional[SDCLoss]):
        """
        Args:
            batch: Mapping[str, torch.Tensor], with 'feature_maps' key/value
        Returns:
            Sequence of dicts. Each has the following structure:
                {
                    predictions_list: Sequence[np.ndarray],
                    plan_confidence_scores_list: Sequence[np.ndarray],
                    pred_request_confidence_score: float,
                }
        """
        self.model.eval()
        with torch.no_grad():
            predictions, plan_confidence_scores, pred_request_confidence_scores = (
                self.model(**batch))
            
        predictions = predictions.detach().cpu().numpy()
        plan_confidence_scores = plan_confidence_scores.detach().cpu().numpy()
        pred_request_confidence_scores = pred_request_confidence_scores.detach().cpu().numpy()
        
        if sdc_loss is not None:
            ground_truth = batch['ground_truth_trajectory'].detach().cpu().numpy()
            sdc_loss.cache_batch_losses(
                predictions_list=predictions,
                ground_truth_batch=ground_truth,
                plan_confidence_scores_list=plan_confidence_scores,
                pred_request_confidence_scores=pred_request_confidence_scores)
        
        return [
            {
                'predictions_list': predictions[i],
                'plan_confidence_scores_list': plan_confidence_scores[i],
                # Negate, as we need to provide an uncertainty for the submission pb, not a confidence score. Uncertainty = -Confidence
                'pred_request_uncertainty_measure':
                    -(pred_request_confidence_scores[i])
            } for i in range(predictions.shape[0])]# Initialize and load the model from checkpoints
# On first run, will fail and create a directory where checkpoints
# should be placed.
model = Model(c=c)
model.load()

Note here the load function loads multiple ensemble members into the model and the predict function calculates the predictions, pre-trajectory confidence scores, and per-prediction request uncertainty scores. When we replace the baseline model with other models, we should make sure that the model can still output these quantities. (Update Sept 24: Note that for submission results, we need uncertainty instead of confidence scores for each prediction request. We’ve changed the relevant names of variables in the repository accordingly, for clarity. Please make sure your submitted results are in the correct form.)

In the next section, we describe how the predictions and confidence scores are evaluated.

5. Evaluate the model and submit the results

In this section, we first describe the steps to evaluate the model, then explain the metrics that matter in our leaderboard ranking. To evaluate the model, we first need to set up the validation datasets (filtering by scene tags):

def filter_moscow_no_precipitation_data(scene_tags_dict):
    if (scene_tags_dict['track'] == 'Moscow' and
            scene_tags_dict['precipitation'] == 'kNoPrecipitation'):
        return True
    else:
        return False

def filter_ood_validation_data(scene_tags_dict):
    if (scene_tags_dict['track'] in ['Skolkovo', 'Modiin', 'Innopolis'] and
        scene_tags_dict[
            'precipitation'] in ['kNoPrecipitation', 'kRain', 'kSnow']):
        return True
    else:
        return Falsemoscow_validation_dataset = MotionPredictionDataset(
    dataset_path=validation_dataset_path,
    prerendered_dataset_path=prerendered_dataset_path,
    scene_tags_fpath=scene_tags_fpath,
    scene_tags_filter=filter_moscow_no_precipitation_data,
)

ood_validation_dataset = MotionPredictionDataset(
    dataset_path=validation_dataset_path,
    prerendered_dataset_path=prerendered_dataset_path,
    scene_tags_fpath=scene_tags_fpath,
    scene_tags_filter=filter_ood_validation_data,
)dataloader_kwargs = {
    'batch_size': c.exp_batch_size,
    'num_workers': c.data_num_workers,
    'prefetch_factor': c.data_prefetch_factor,
    'pin_memory': True
}

moscow_validation_dataloader = torch.utils.data.DataLoader(
    moscow_validation_dataset, 
    **dataloader_kwargs
)ood_validation_dataloader = torch.utils.data.DataLoader(
    ood_validation_dataset,           
    **dataloader_kwargs
)

To get the evaluation results and produce a submission protobuf, we can still use the Yandex dataset API:

from ysdc_dataset_api.evaluation import Submission, object_prediction_from_model_output, save_submission_proto
from sdc.oatomobile.torch.baselines import batch_transform
import tqdm.notebook as tqdm
from functools import partialsubmission = Submission()

batch_cast = partial(
    batch_transform, device=c.exp_device, downsample_hw=None,
    data_use_prerendered=True)

for is_ood, dataloader in zip(
    [True, False], 
    [ood_validation_dataloader, moscow_validation_dataloader]):
    for batch_id, batch in enumerate(tqdm.tqdm(dataloader)):
        batch = batch_cast(batch)
        batch_output = model.predict(batch) # prediction_list, plan_conf_score_list, pred_request_conf_score

        for i, data_item_output in enumerate(batch_output):
            proto = object_prediction_from_model_output(
                track_id=batch['track_id'][i],
                scene_id=batch['scene_id'][i],
                model_output=data_item_output,
                is_ood=is_ood)

            submission.predictions.append(proto)
save_submission_proto('dev_moscow_and_ood_submission.pb', submission=submission)

The final step creates a file ready for submission.

Since the ground truth for the development (validation) dataset is available, we can also calculate the metrics using the predictions. To do this, we need to cache our predictions and use Yandex dataset API again:

from sdc.cache_metadata import load_dataset_key_to_arrs, construct_full_dev_setsfrom ysdc_dataset_api.evaluation.metrics import 
compute_all_aggregator_metrics# We first load in our predictions, ground truths, per-plan confidence scores, request IDs for each dataset from a cached directory
dataset_key_to_arrs = load_dataset_key_to_arrs(metadata_cache_dir=dir_metadata_cache)  # Add a field for the full__validation dataset 
dataset_key_to_arrs = construct_full_dev_sets(dataset_key_to_arrs)(model_preds, plan_conf_scores, pred_req_conf_scores, 
 request_ids, is_ood_arr) = (
    compute_dataset_results(
        k=5, d=5, plan_agg='MA', pred_req_agg='MA', 
        dataset_key='full__validation',
        dataset_key_to_arrs_dict=dataset_key_to_arrs, 
        n_pred_per_model=10, 
        retention_column='weightedADE', 
        return_preds_and_scores=True))

# Compute all metrics using our predictions.
metrics_dict = compute_all_aggregator_metrics(
    per_plan_confidences=plan_conf_scores,
    predictions=model_preds,
    ground_truth=dataset_key_to_arrs['full__validation']['gt_trajectories'],
    metric_name='weightedADE')

On the challenge leaderboard, we can see many metrics. But not all of them are affecting the leaderboard ranking. In this blog, we only describe the metrics that influence the ranking. For other metrics, readers can check the whitepaper for their definitions.

Before explaining the metrics in the leaderboard, we first introduce three key measures: ADE (the average displacement error), R-AUC (the area under the error-retention curve), and cNLL (updated on Sept. 17 as the new metric to replace ADE in the ranking).

The average displacement error (ADE) measures the quality of a predicted trajectory y=(s_1, …, s_t) with respect to the ground truth trajectory y* as:

Analogously, the final displacement error (FDE)

measures the quality at the last timestep.

Given D trajectories and their per-trajectory confidence scores c, we can also calculate a per-trajectory confidence-aware ADE metric:

R-AUC is the area under the error-retention curves. It’s a joint measure for both robustness and uncertainty. The curve shows the error over a dataset as a model’s predictions are replaced by ground-truth labels in order of decreasing uncertainty scores (in our case, per–prediction request uncertainty scores). Different choices of errors will result in different R-AUC metrics (e.g., the following figure shows R-AUC using weighted ADE as the error metric).

cNLL (updated on Sept. 17 to replace ADE): While ADE is a highly intuitive metric to measure the prediction performance, it has a conceptual limitation: the optimal solution under the ADE is the weighted geometric median of the modes of the true distribution of trajectories (mode collapsing behavior). For example, at a T-junction, where trajectories can go either left or right, the optimal model under ADE is one that yields a trajectory going straight, which is clearly a fundamentally undesirable behavior. This can be avoided by a likelihood-based metric (confidence-weighted negative log-likelihood):

Under this metric, which assumes that each mode is modeled using a Normal distribution of fixed variance 1, an optimal model would place a Normal distribution over each mode and weight them appropriately.

In the leaderboard, the only metric that will affect the ranking is the R-AUC CNLL, i.e., the area under the retention curve whose error is calculated as cNLL. Other metrics are also provided to help participants to obtain greater insights.

6. Directions for improvements

This blog only covers the baseline method for the Vehicle Motion Prediction task in the Shifts challenge. There are at least three directions that can lead to interesting discoveries and improvements:

Alternative (backbone) models and uncertainty estimators. As we mentioned, we only require the model to produce both predictions and the uncertainty/likelihood of the predictions. So besides training a density estimator (likelihood model), there are many other choices that may perform better, e.g., a discriminatively trained model using Mahalanobis distance [5] as uncertainty. However, since the trajectory predictions are intrinsically time-series data, there should be some adaptations from other uncertainty/likelihood estimators to this specific task. Also, participants should also take into account the computational limitation, i.e., models must run on, at most, 1xV100 GPU GPU with 32 GB of graphics memory and yield predictions within 200 ms per 1 input sample (see FAQ).
Alternative framework to incorporate robustness and uncertainty. In this blog, we only discuss RIP, which is an intuitively simple method to incorporate predictions and uncertainty. It’s worth thinking about the benefit of using other methods and possibly more information to select the valid trajectories for prediction.
Alternative features. The inputs to the model, i.e., the rendered features, certainly play an important role in the performance. Currently, we only use a single time step (T=0). Alternatively, participants may consider using more time steps from the past to provide more context, e.g., in case some vehicle happened to be occluded at the current timestep. If a new model can take advantage of the additional information in the past features, it should have superior prediction performance.

Of course, there can be many other creative ways for improvement. We welcome all readers interested in this challenge to contribute!

Reference

[1] A. Malinin, N. Band, Ganshin, Alexander, G. Chesnokov, Y. Gal, M. J. F. Gales, A. Noskov, A. Ploskonosov, L. Prokhorenkova, I. Provilkov, V. Raina, V. Raina, Roginskiy, Denis, M. Shmatova, P. Tigas, and B. Yangel, “Shifts: A dataset of real distributional shift across multiple large-scale tasks,” 2021.

[2] Felipe Codevilla, Matthias Müller, Antonio López, Vladlen Koltun, and Alexey Dosovitskiy, “End-to-end driving via conditional imitation learning,” in 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2018, pp. 4693–4700.

[3] Nicholas Rhinehart, Rowan McAllister, and Sergey Levine, “Deep imitative models for flexible inference, planning, and control,” CoRR, vol. abs/1810.06544, 2018.

[4] Danilo Jimenez Rezende and Shakir Mohamed, “Variational inference with normalizing flows,” 2016.

[5] K. Lee, K. Lee, H. Lee, and J. Shin, “A simple unified framework for detecting out-of-distribution samples and adversarial attacks,” in Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS’18, (Red Hook, NY, USA), p. 7167–7177, Curran Associates Inc., 2018.