Will the pedestrian cross?

Anna Székely
17 min readJun 15, 2020

--

by Laura Liis Metshvat, Heidi Korp, Anna Székely, Andrzej Lippa

Introduction

The rising tendency of robotization has increased the number of challenges when creating AI applications, and the weight of responsibility on the shoulders of those who train the algorithms grows permanently. One of the most current fields that gives a wide and exciting area of research is self-driving cars. The general importance of the topic can be seen by the momentous discussion that arises not only in the field of AI, but also in the field of philosophy and social sciences around self-driving cars. The source of the outstanding interest is the fact that even though we can reduce the number of accidents caused by human error by using algorithms. However, in the cases of unavoidable accidents the algorithm has to decide whom to save and who will be the victim. Due to this issue the precision and reliability of those algorithms which lead the self-driving cars is literally deadly important.

Since most of the fatal accidents happen when pedestrians cross the road [1]unexpectedly or unattentively, the importance of algorithms that deal with pedestrian path prediction is not only relevant in the field of self-driving cars, but can be highly useful in alerting systems, which help raise the driver’s attention to the crossing pedestrian. These algorithms are already in use eg. Collision Warning with Full Auto Break and Pedestrian Detection system. [1]–[3]

Determining whether the pedestrian is going to step onto the road and cross it or not is a highly challenging problem due to the dynamic nature of pedestrians (quick changes in direction and velocity), thus it is not possible to make high accuracy predictions before the preceding 1–2 seconds of the potential crossing. Furthermore, the task is even more complicated since as little as 30 cm can make the difference whether the pedestrian is on the car’s way or not. Thus an algorithm not only needs to perform a high accuracy prediction during the shortest possible time, but the spatial precision is needed to be highly accurate [2]–[4].

Due to the challenging and influential nature of the above discussed problem, our aim is to improve the current state-of-the-art solution using neural networks. First, we are going to review the previous works related to the topic and present the existing methods which have been used for pedestrian path prediction. In the later chapters we will present the parameters we will be targeting to improve and describe the model and dataset we will be using to solve the problem.

Related works

Different models use different features of the pictures or videos to predict, whether the pedestrian is going to cross or not. Most of these models measure the pedestrian’s distance from the vehicle, the pedestrian’s distance from the curbstone, or the distance from the egolane (the lane on which the vehicle moves). Most of the models also investigate the measurement of the pedestrian’s velocity. For predicting the pedestrian’s motion, a very common and widely used tool is the Kalman filter. The Kalman filter is used in many disciplines, including physics, econometrics or engineering for describing dynamic systems. The Kalman filter makes it possible to determine the body’s position, acceleration, and velocity, thus it is widely applied for traffic detection problems [2], [3].

The applicability of Kalman filter was so advantageous that multiple different version of it have been tried, eg. the Extended Kalman Filter or the interaction of different filters, which are specified for detecting different features of the motion (constant velocity, acceleration, turn, etc.). In literature, this method is referred to as IMM, which stands for InteractingMultiple Models. The popularity of Kalman filter also emerges from the relative low computational cost. [3]

Out of the mere motion of the pedestrian, various features can be used for predicting the upcoming actions. The usefulness of further features can be inferred by the fact that even a human’s ability to distinguish whether the pedestrian will cross decreases drastically if the pedestrian is not fully visible. In the dataset, this situation can be recreated by partially or fully masking the pedestrian in the picture and letting people see only a part of the body or the position of the pedestrian, eg. by the blank bounding box. [2], [4]A study from 2014 included the head position, which can imply whether the pedestrian has or hasn’t seen the car, consequently also alters the likelihoods of possible events.

The interconnectedness of head position, motion and lateral position features can be seen on figure 1, where latent variables are in the non-shaded frames, which the model tries to estimate, while the observed (ground truth) variables are in the shaded frames. SV stands for “see the vehicle”, HSV stands for “has seen the vehicle”, AC stands for “at the curb”, SC stands for “situation is critical” (when both the pedestrian and the vehicle continue to move with their previous velocity), M stands for “movement” (can be standing or walking with different velocities), X is the prior belief about the position state (it is computed by the learnt parameters of the training data). The observed variable for SC is the Dmin, the minimum distance between the pedestrian and the vehicle, head orientation (HO) for seeing the vehicle. The observed position (Y) is the grand truth for the latent position (X). The distance to the curb (DTC) is the observed to the variable “at the curb” (AC). As it is shown in the graph, the predicted latent variables are depends on each other sequentially and the SV, SC and AC nodes also depend on their state in the previous time step. [5]

1. Figure: directed graph, unrolled for two time slices, source: Kooij et al. 2014

Another study from 2014 includes even more variable in order to a more precise prediction. Bonnin [1]and her co-authors works with the following features:

1. Lateral distance between the pedestrian and the collision point, where the pedestrian and ego-vehicle will intersect

2. Time for the pedestrian to reach the collision point

3. Time for the ego-vehicle to reach the collision point minus time for the pedestrian to reach the collision point (difference in time to reach the collision point)

4. Lateral distance between the pedestrian and the curbstone

5. Time for the pedestrian to reach the curbstone

6. Time for the pedestrian to reach the egolane

7. Moving direction as the angle between the pedestrian and the road (global orientation)

8. Pedestrian moving toward the road parallel or away to it (is facing the road)

9. Moving direction as the angle between the pedestrian and the ego vehicle (relative orientation)

10. Lateral distance between the vehicle and the egolane

11. Lateral distance between the pedestrian and the zebra crossing

12. Time for the pedestrian to reach the zebra crossing

The main contribution of Bonnin’s model to the state of the art is to distinguish between zebra and non-zebra environments, since the road crossing has a much higher probability in the proximity of zebras, than other road sections, thus the prediction can be much precise when the algorithm first detect the environment. The results shows that distinguishing in the environment can improve accuracy and enhance the time horizon for which the model is able to give reliable prediction.

Keller and Gavrial also introduced some novel approach how to outperform the basic Kalman filter-baesd and IMM (Interacting Multiple Models) models. They have worked with augmented visual filters, like optical or scene flow (see in figure 2) and tried to include into the measures the prediction error which emerge from the motion of the ego vehicle. They have applied two new models to compare with the Kalman filter and IMM. Both of the novel models based on nonlinearity in comparison with the Kalman filters, and in both cases augmented visual features has been applied. The fist one was a first-order model, based on Gaussian process dynamics model (GDPM), while the second was a higher order model, with probabilistic hierarchical trajectory matching (PHTM). The researchers found that both the GPDM and PHTM models outperformed the previous KF and IMM models by deciding the pedestrian’s position accurately and giving a precise estimation for the action with a greater time horizon. [2]

2. Figure: Optical flow and scene flow (source: Keller & Gavrila, 2014)

It is also worth to mention that one of the most challenging point of pedestrian path prediction to train the models to detect and predict different motions (eg. stopping and walking) thus numerous studies the researchers train different models to capture the different motions, eg. [2]trained one model on pictures where pedestrians are walking, while another model has been trained for stopping pedestrians. Bonnien and her co-authors had face a similar problem which complexity also emerged from the high number of features (the above listed 12). Thus to be able to combine the features (which includes numerous motions and positions of pedestrians) the researchers have used a learning classifier,in the form of a single layer perceptron. [1]

JAAD dataset: video 0001. Source: http://jaad-explore.nvision2.eecs.yorku.ca

Datasets

JAAD dataset

In our project, we are experimenting with the Joint Attention in Autonomous Driving (JAAD) dataset introduced by Iuliia Kotseruba, Amir Rasouli and John K. Tsotsos. The aim of the dataset is to capture the behavioral variability of traffic participants and the joint attention that must occur between drivers and pedestrians, cyclists and other drivers. In the dataset, many different weather conditions, geographical locations, traffic and demographics of people are presented. The ground truth of the dataset contains information about the location of participants as bounding boxes, the physical conditions such as lighting and speed and the behavior of the parties involved. [6]

JAAD dataset: video 0007. Source: http://jaad-explore.nvision2.eecs.yorku.ca

This dataset is used in many works related to predicting pedestrian behavior as the dataset represents a variety of scenarios involving pedestrians and other drivers. Most of the data is collected in urban areas, where people are waiting at the designed crossings. There are samples of people of different ages, carrying heavy objects or walking with children and pets. The data is described using bounding boxes and textual annotations. [6]

JAAD dataset: video 0024. Source: http://jaad-explore.nvision2.eecs.yorku.ca

PIE dataset

PIE is stands for Pedestrian Intention Estimation. The researchers who created the PIE dataset did it because they believed that it is possible to improve the model performance by predicting the human intention instead of relying barely on the motion data. The critic for the motion based approach is that the action already need to be started in order to predict the outcome. The pedestrian need to start already the motion for road crossing to being able a motion based model to make an accurate prediction. Thus the developers of PIE suggest to combine the motion based approach with an intention based approach. By this approach and by the new PIE dataset the researchers in 2019 achieved 79% precision which had outperformed state-of-the-art by 26% [7].

The dataset is contains 10 minutes long videos which has been taken from the vehicle. The annotation which goes with the data contains information about the bounding boxes, the pedestrian’s action (“walking”, “standing”, “looking”, “not looking”, “crossing”, “not crossing”, “crossing”). Furthermore, spatial information is also included (like traffic light, signs, zebra crossings, road boundaries). Based on the GPS data accurate speed and heading orientation data is generated. The novelty of PIE dataset compared to JAAD is also the accurate vehicle information, spatial street features information (lights, signs, ets) and the pedestrian intention data [7].

The grand truth for the pedestrian intention (does he/she want to cross?) had been collected by showing the video for human subjects who needed to decide at given time steps of the video whether the pedestrian wants to cross or not [7].

3. Figure: JAAD and PIE dataset features. Source:[7]

Solutions

Stacked with multilevel fusion GRU (SF-GRU)

After a lot of research and reading papers we came across a novel solution called stacked multilevel fusion GRU (SF-GRU) which seemed to have the best performance resulting in accuracy of 84%. We were also able to find the model’s code which we investigated thoroughly. The plan was to use it as the base model, however we were not able to get the model running even after being in contact with the authors.

For predicting agents’ future actions often the history of their movements is used. Although the dynamic information is very important, the motion patterns only are not sufficient enough to make sense of pedestrian behaviour. There are multiple environmental factors, such as signals, road structures etc. that can affect the behaviour of a pedestrian.

Architecture

This solution takes into account also the visual observations of the pedestrian itself and their surroundings. The model is defined as a binary classification problem predicting whether the pedestrian is going to cross or not while taking into consideration the observed context up to some certain time.

The approach is relatively novel (published in 2019) and uses a stacked recurrent neural network (RNN) architecture. The data from different modalities is gradually fused in different layers. The arrangement of different data is important and affects the resulting accuracy. [8]

4. Figure: The architecture of the proposed algorithm SF-GRU comprised of five GRUs each of which processes a concatenation of features of different modalities and the hidden states of the GRU in the previous level. The information is fused into the network gradually according to the complexity of the features. Each feature input consists of m sequential observations. From bottom to top layers features are fused as follows: pedestrian appearance c_p , surrounding context c_s , poses p, bounding boxes b and ego-vehicle speed s. + refers to concatenation operation. Source: [8]

The prediction relies on sources of information: local context (visual features of the pedestrian and their surroundings), pedestrian pose, 2D bounding box locations and the speed of the ego-vehicle.

Local context: The pedestrian surroundings and appearance is used at every time step of the observation. This is done by cropping the frame image to the size of the 2D bounding box around the pedestrian. In order to capture the surroundings the bounding box is resized and made into a square so that the width of the scaled bounding box matches its height. This results in a wider viewing angle around the pedestrian and might include the people around, street, traffic signals etc. The pedestrian’s appearance is suppressed in the cropped surroundings image by changing the pixel values in the original bounding box coordinates to neutral gray. Both of these crops are processed with convolutional neural network that produces two feature vectors.

Pedestrian pose: For each pedestrian 18 body joints coordinates are generated using a pose network. Each coordinate responds to a point in 2D space. The coordinates are normalized and concatenated into a 36D feature vector.

2D bounding box locations: The bounding box coordinates are transformed into relative displacement from the initial position forming a vector, which could be interpreted as the pedestrian’s velocity at every time step.

Speed of the ego-vehicle: Each time step the ego-vehicle’s speed is recorded. [8]

5. Figure: Skeleton fitting is based on 18 keypoints, distinguishing left and right. We use the 9 keypoints highlighted with stars.Source: [9]

In order to joint model the sequence data the gated recurrent units (GRUs) are used. GRUs are simpler than Long Short Term Memory networks (LSTMs) and in this solution they achieve a similar performance.

The RNNs are used as they are able to learn temporal dependencies in sequence data. The temporal depth has been shown to benefit tasks that apply single layer RNNs for pointing coordinates in a space such as trajectory prediction. What is more, the spatial depth improves sequential data modeling in complex tasks, such as video sequence analysis, and could be increased by stacking multiple layers of RNNs on top of each other.

The chosen approach uses stacked RNN architecture which gradually fuses features at every level based on their complexity. The visual features that benefit more from the spatial depth are fed to the network at the bottom levels and the dynamic features (speed, trajectories) at higher levels. [8]

Implementation

The model uses GRUs with 256 hidden units. In order to get the local context the pedestrian samples are cropped (using 2D bounding box annotations), resized to 224 and then padded with zeros to preserve the aspect ratio. For suppressing the pedestrians in the surrounding bounding box, the neutral gray RGB(128, 128, 128) is used. The local context images are first processed using convolutional network for classification and detection (VGG16) pretrained on ImageNet, which is followed by a pooling layer that generates a feature vector of size 512. For pedestrian poses the pose estimation network used is pretrained on the COCO dataset. For each pedestrian the network generates 18-joint poses. [8]

The previously described context and pose features are precomputed. The model is trained using Adam optimizer with learning rate of 5*10−6 for 60 epochs with batch size of 32 and L2 regularization of 0.0001. Additionally the data is also augmented by horizontally flipping the images and sub-sampling the classes that are over-represented in order to equalize the number of crossing and non-crossing samples. [8]

Evaluation

Although one of the best datasets used in the field of predicting pedestrian behaviour is JAAD, it was not the most suitable one for this task. This is because more samples were needed together with more precise data about vehicles and longer sequences for making long-term predictions. Therefore, the pedestrian intention estimation (PIE) dataset was used. [8]

The performance of the proposed model SF-GRU was compared to 5 different models (single layer GRU, multi stream GRU, hierarchical GRU, stacked GRU). All the evaluations were done on observation sequences with duration of 0.5s (15 frames). The samples were selected with 2s time to event (TTE) as it should be the minimum time for the pedestrian to make the crossing decision. [8]

The proposed model SF-GRU performs better than the other tested models (single layer GRU, multi.stream GRU, hierarchical GRU, stacked GRU) on all the metrics except for recall. The single layer GRU results in better recall while decreasing the precision by 6%. What is more the results also indicate that no performance improvement is achievable by simply adding layers to the network or separating the processing of features with different modalities.[8]

6. Figure: The impact of different sources of information on the performance of SF-GRU. The feature types are as follows: C_p pedestrian context (appearance), C_s surround context, C_p+s full context, P pose, D displacement (center coordinates), B bounding box and S speed. Source: [8]

The performance of all algorithms degrades when the observation is done further before the actual event. The increased length of the observation time can provide more data but also add more noise.

7. Figure: Feature fusion strategies and their impact on the performance of the proposed algorithms SF-GRU. The feature types are as follows: C_p pedestrian context (appearance), C_s surround context, P pose, B bounding box and S speed. Source: [8]

The table above indicates how different fusion strategies alter the performance. The table demonstrates how moving simpler features (for example speed) to the higher levels of the stack, improves the performance up to 9% on accuracy, 10% on recall and more than 15% on precision. This is probably due to the fact that more complex features benefit more from the spatial depth of the network and are fed to the network at the bottom levels and the simpler features (speed, trajectory coordinates) at higher levels. [8]

Conclusion

The SF-GRU model is a stacked RNN architecture that gradually, at different levels of processing, fuses together different features, such as vehicle dynamics, pedestrian appearance and their surroundings. The SF-GRU performs the best compared to other RNN architectures having the most optimal performance when more complex features are fed to the bottom layers and the simpler features at the higher layers of the network. [8]

Our solution

In our experiment, we are analyzing the JAAD dataset and creating our own model. The preprocessing of the dataset can be divided into two tasks: processing of the video files and processing of the annotations.

The authors of the JAAD dataset have written a short script for splitting the video clips into sets of images. The images corresponding to one video clip are saved into the folder with the same name as the video. Each image name contains the ID of the frame, which can be later associated with corresponding frame annotation.

The annotation files are in the XML format and contain data about the bounding boxes, whether the pedestrian is crossing or not crossing and other related information, such as if the pedestrian is occluded or not. We are using the value of “crossing” tag as the label and a few other fields as features. [6]

After preprocessing both video clips and annotations, we are creating an object per frame that contains the image frame and related features. There is also a set of labels corresponding to the object in the training data. We are then using this data to train our model.

Results

The first model we created is very simple: it has four layers which consist of one input layer and three dense layers. The first dense layer uses softmax activation function and the following two use RELU activation function. We are using binary cross entropy as the loss function, Adam optimizer, hidden layer size of 32 and from metrics we are using accuracy and binary accuracy.

To train our model, we are using images with corresponding annotations from 17 video clips, which makes 3960 images. The annotations consist of data about the pedestrian, such as if they are occluded or not, their hand gesture, are they looking or not, are they walking or not, are they nodding or not and their action and reaction. As labels, we are using the value of the tag “cross”, which indicates if the pedestrian in the image is crossing or not.

After training for 100 epochs, we get the loss, which is 2.5986, accuracy that has value 0.0604 and binary accuracy with value 0.6770. The value of the binary accuracy is so high compared to accuracy, because there are much more images with pedestrians who are not crossing compared to those who are. However, the model’s ability to recognize a crossing pedestrian is relatively low.

Future improvements

There are many steps that could be taken to improve the model. First, we should add more data and divide them into training, validation and test data. We could also use more layers in the model, particularly the GRU layer as it was used in the existing model we have been thoroughly investigating. There could be other hyperparameters used that would correspond to a two-class classification model. Also, the quality of the data can be improved by choosing different or additional tags, such as the locations of pedestrians’ bounding boxes. It would also help if there were an equal amount of images of pedestrians crossing and not crossing.

Conclusion

In this project, we created a simple model using JAAD dataset to predict from camera images, whether the pedestrian is going to cross or not. At first, we wanted to use an existing model that had better results than other approaches we also took a look at. However, we were not able to get the model running as intended and thus decided to understand the problem better by creating a simple model ourselves. The resulting model needs further improvements, but from the results we got, we got a better understanding of the data and how it should be used to make the model effective. The topic about whether the pedestrian is going to cross or not in an area with both marked and not marked crossings is very important regarding the creation of autonomous vehicles. We are hoping that our work can be used in future works in a similar field.

References

[1] S. Bonnin, T. H. Weisswange, F. Kummert, and J. Schmuedderich, “Pedestrian crossing prediction using multiple context-based models,” 2014 17th IEEE Int. Conf. Intell. Transp. Syst. ITSC 2014, pp. 378–385, 2014.

[2] C. G. Keller and D. M. Gavrila, “Will the pedestrian cross? A study on pedestrian path prediction,” IEEE Trans. Intell. Transp. Syst., vol. 15, no. 2, pp. 494–506, 2014.

[3] N. Schneider and D. M. Gavrila, “Pedestrian path prediction with recursive Bayesian filters: A comparative study,” Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 8142 LNCS, pp. 174–183, 2013.

[4] C. G. Keller, C. Hermes, and D. M. Gavrila, “Will the pedestrian cross? Probabilistic path prediction based on learned motion features,” Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 6835 LNCS, pp. 386–395, 2011.

[5] J. F. P. Kooij, N. Schneider, F. Flohr, and D. M. Gavrila, “Context-based pedestrian path prediction,” Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 8694 LNCS, no. PART 6, pp. 618–633, 2014.

[6] I. Kotseruba, A. Rasouli, and J. K. Tsotsos, “Joint Attention in Autonomous Driving (JAAD),” pp. 1–10, 2016.

[7] A. Rasouli, I. Kotseruba, T. Kunic, and J. Tsotsos, “PIE: A large-scale dataset and models for pedestrian intention estimation and trajectory prediction,” Proc. IEEE Int. Conf. Comput. Vis., vol. 2019-Octob, pp. 6261–6270, 2019.

[8] A. Rasouli, I. Kotseruba, and J. K. Tsotsos, “Pedestrian Action Anticipation using Contextual Feature Fusion in Stacked RNNs,” pp. 1–13, 2020.

[9] Z. Fang and A. M. López, “Is the Pedestrian going to Cross? Answering by 2D Pose Estimation,” IEEE Intell. Veh. Symp. Proc., vol. 2018-June, pp. 1271–1276, 2018.

--

--