Travel Time Prediction using Deep Learning

14 min readFeb 3, 2021

Executive Summary

Travel time prediction has become the most thought-provoking problem in Intelligent Transportation Systems (ITS). Precise travel time information can help travelers, operations, and planners plan routes wisely to save travel time and costs. Specifically, operators can make better and more efficient decisions based on reliable and timely predictions to manage resources and forward the information to higher decision-makers during events. This tool will be invaluable when Executives must make significant decisions or notifications to Governmental officials on closures, diversions, the extent of delays, and clearing time during extreme or abnormal events. This, in turn, results in less traffic congestion, improved operational efficiency, and increased consumer’s confidence of the conditions on the public road network.

In recent years there has been a vast increase in available data with the advancement of smart cities. Additionally, advancements in Big Data and Artificial Intelligence (AI) technologies offer enhanced features and functionalities that have made Machine Learning (ML) adoption much more effortless. Cloud providers such as AWS and Azure offer scalable and on-demand infrastructure, making implementation and experimentation much easier, and costs more reasonable. In ITS’s domain, this modernization can positively affect transportation networks, thus cutting down travel time, increasing efficiency, and reducing the environmental impact from vehicles.

During the Predictive Modeling POC, the team has leveraged standard ML libraries to predict travel time. Historical data has been ingested into the Microsoft Azure data lake. Here the data has been cleaned, transformed, and prepped for machine learning training and testing. Since this was a POC, only 2 corridors i.e. I-80 and I-495 were selected. These corridors had large variations and had a mix of urban and rural travel time patterns. Various methods and techniques were utilized to train the models, and ultimately, we concluded with 2 models, i.e., normal and Incident condition models, to predict travel time. After analyzing the predictions and accuracy, the results seem to be extremely promising. The testing concluded with 87% of the travel time test results being within +/- 15 seconds of the actual travel times and 76% of the travel time test data being within +/- 10% of these actual travel times.

Project Brief

This POC on Predictive Modeling begins to determine if and how accurate future travel conditions can be predicted based on real-time transportation information for the Tri-State region. If successful, this modeling will help Transportation Operations staff make better and more efficient decisions, based on reliable and timely predictions, to manage resources and assets during events on a day-to-day basis and to forward the information to higher decision makers during events. This tool will be invaluable when Executives must make major decisions or notifications to Governmental officials on closures, diversions, the extent of delays, and clearing time during extreme or abnormal events. Predictive modeling will also allow users to produce near real-time reports that can then be distributed to the public so that they can make informed decisions about their travels.

Approach Overview

This project emphasized the data preparation required for a machine learning project, the various features that are a must for a travel time prediction, and the feature enrichment necessary for improving the prediction accuracy. A major emphasis has been to demonstrate that they could attain high performance and accuracy in travel time prediction by having only two machine learning models, unlike others where separate models are created for separate roadways. The first supervised model, the “Ordinary or Normal model,” was for predicting travel times for peak (high congestion) and non-peak hours. The second (unsupervised + supervised) model known as the “Extra-Ordinary or Incident Model” was developed for predicting the impact on travel time when an incident occurs. An incident can be identified as a traffic accident, a construction project occurring on a roadway, or traffic delays occurring due to bad weather.

Project Timeline

The project lasted for approx. eight months, with the critical activities performed during the project being:

1. Business understanding — During Feb 2020, the core objectives and goals of the project were defined. During this time the project team was established.

2. Data Acquisition and Modeling — This was done during Mar to Sep 2020. During the Data acquisition process, various activities such as extractions from the data sources, massaging, cleaning, prepping, staging, splitting, and transforming raw data were done. While in the Data modeling phase, steps such as data engineering, training, and testing the models were performed.

3. The last phase was Deployment and Customer Acceptance, conducted from Oct to Dec 2020. During this time, the team developed various exploratory charts, operationalized the models, conducted system validation and project closure related activities.

During this POC, the data was for the I-80 and I-495 corridors, and these were comprised of 87 and 46 TMC’s respectively, resulting in millions of transactions.

Scalable & Sustainable Architecture

The entire data prepossessing procedures and routine, training procedures and scripts, results, experiments, and post-processing routines are written in Python, NumPy, PySpark, and SQL. The Microsoft Azure cloud platform, along with a Data Bricks (Spark included) subscription, was leveraged as the choice of infrastructure to act as the Big Data storage and to perform the computing and explorations. GPUs were also leveraged for making the training go quicker. NVIDIA Tesla K80 with 56 GB RAM and 6 cores were used as the main configuration to run the experiments. A GPU server with Python (Anaconda and PyTorch) was leveraged for training the models to ensure a speedier completion. On average, the normal model’s training times are 2 to 3 hours and 1 to 2 hours for the incident model. The data, which was from quarter 4 of 2019 and quarter 1 of 2020, was loaded in parallel, using all cores through multiprocessing, and ran GPU’s model training. Without the GPU, it was observed that the training times went up 3 to 4-fold. Using more GPUs for training sped up the training. Additional GPUs could also be used to run the training if the need is required. The following diagram provides the logic of this process.

High-Level Overview

Over 20 important data attribute features were used for the normal model and almost 40 vital features for the incident model. As the incident model is tasked to make predictions from negligible data, it tried to provide the best and feature rich dataset for both the models’ training. The data is collected and sampled over two-minute epochs. Each passing epoch is 2 minutes. Thus, there were 720 travel time readings for each passing day. The entire data was restructured and time series with steps of seven in each consecutive time series was created. Predictions were done for Q4 2019 and Q1 2020. The normal model was trained on 60% of this data. It used 20% of this data for validation, and the remaining 20% for testing of the data. The train, validation, and test datasets are randomly sampled with a random seed of 123 through all the experiments. The entire dataset is also normalized to zero mean and unit variance. The paramount importance of parquet files for ETL processes was utilized in the data prepossessing steps. In cases for both models, given x amount of features, the goal was to regress to a single value, making this a standard regression problem. Hence, the task at hand was given numerous x features to try to predict travel time for those given features.

The Need for Two Models

Two different models were used: normal and the incident models as there were two tasks to deal with. The first task was predicting travel times in normal recurrent cases during peak hours, holidays, and weekends. This kind of prediction is possible as abundant data existed to feed the model. Whereas on the other hand, it was very challenging to predict travel times in the cases of non-recurrent events, such as incidents caused due to weather, construction, accidents, etc. and other non-recurrent events. Thus, trying to predict both kinds of recurrent and non-recurrent events with a single model proved to be impossible. Thus, the problem was segregated out two-fold, where a separate model tackles recurrent events which was called the normal model and another separate model that tackled non-recurrent events.

Detailed Overview of Both Models

The data used in this study was for the I-80 and I-495 corridors. These corridors are part of NY and NJ and have a mix of free flow and congested traffic data. Originally, the data model was to be trained using one year of historical data, but it was soon realized that this data was irrelevant, as its common for traffic patterns and roadways to change on a quarterly average basis. Hence, it was decided to use the last quarter of 2019 for the training of the models. The data was provided in 2 min intervals and resulted in close to 10 million records for both corridors and, converting it to time series, the data size grew to 70 million records.

All the data was ingested in the raw(stage) zone of the data lake. In this data lake, the data was cleaned, prepped and transformed and stored in the curated (gold) zone. A 15 min aggregation of the entire data set was computed and then outliers were identified via z-score and replaced. Null data was deleted or substituted, with the 15 min aggregations previously computed. Incidents and events were identified and separated as inputs for the mixture model. Exceptional business rules and transformations were applied to the data pertaining to Holidays, such as Labor Day, Memorial Day, Christmas, and other significant holidays.

Identification of peak hours and tagging it as a feature was the biggest challenge, hence congestion loss was developed as an innovative feature by comparing the real time traffic data with historical averages. The traffic segments (link id) initially used for model training did not work as they were too small in length and then later, these segments were changed to longer roadways, such as TMC’s. Enriching the event start and end dates were essential as part of the feature engineering process, as the event could have been recorded in the IT systems much later than the event’s actual occurrence. Additionally, the event is marked as closed as soon as the accident is cleared from the roadways (in cases of accident events), however the impact of the accident can be seen for several hours later from the time the accident/wreckage was cleared.

The key features that were used for the normal model training were TMC, length of the TMC, mean historical travel time, and rolling averages of the travel time. Also, SHAP analysis (see model feature analysis pic) was done to analyze these features in the later sections. The year and week were never considered as features as it was wanted to be able to retrain the model incrementally with the least data as possible. Also, it was not wanted that the model have redundant features that could affect the model training and time to learn, as well. Mean travel time, historical congestion, rolling averages, interchange, day type and holiday flags were just some of the key features added to enhance and improve the model performance.

This model expects the data to be provided in a time steps (series) format, which meant it had to be given past travel time records that led up to the current travel time. It originally started to use 4-time steps, but later finalized with 7-time steps for training the models. So all these models expected traffic pattern data in the form of a time series of 7 steps where each step is a 2 minute epoch reading of the travel time of a particular TMC.

The incident model receives 2 datasets as inputs, one with all the event features (accidents, construction, and weather event data) and the other was with travel time information in LSTM (time steps) format. The event features data was fed to an autoencoder model for learning the optimal event impact features. It gauged the duration of an event in a particular case of an event. The results of the autoencoder were fed to the LSTM for predicting the travel time. The data format fed to the model was in the same 7-time step format that was used in the normal models; however, this data was exclusive for the events only and had the event attributes coupled along during the training. It was realized that the autoencoder stacking did not really improve the performance of the incident model, hence autoencoder was not considered in the final implementation.

Normal Model

The normal model tried to predict travel times using 20 or so features. It was believed that rolling averages, congestion, and mean historical average travel time were the three primary contenders for the most significant features. The SHAP values were used to determine the significance of the feature values.

The architecture of the normal model is as follows: two sub networks for the ordinary model: stacked deep LSTM network and fully connected network sitting on top on the stacked LSTM network. The LSTM network is denoted by lstm in the architecture and each linear layer in the fully connected network is denoted by fci where i represents the ith layer in the fully connected network. The lstm module as mentioned 10 lstm models stacked on top of each other each with 64 hidden dimensions.

Incident Model

The architecture of the incident model is as follows: two sub networks for the incident model: stacked deep LSTM network and fully connected network sitting on top on the stacked LSTM network. The LSTM network is denoted by lstm in the architecture and each linear layer in the fully connected network is denoted by fci where i represents the ith layer in the fully connected network. The lstm module as mentioned 10 lstm cells stacked on top of each other each with 64 hidden dimensions. The only change in the incident model from the normal model is that the number of input dimensions jump from 20 to 40.

Quantitative Results

For quantitative results the following metrics were used:

It was concluded that the model delivered 95% of predictions in less than 15 seconds for both corridors with further analysis. The model seems to be predicting reasonable for normal conditions. For I-495–34th Street performance was less accurate than the remainder of the corridor, and for I-80 — the George Washington Bridge (upper & lower) performance was less accurate than the rest of the corridor.

Similarly, with a different metric it was realized that 31% of the data was predicted within +/- 2% of the actual travel times, 64% of the predicted data within +/-5%, and 83% within +/-10% for both corridors.

For both roadways in this POC, only three months of incident data was used to begin to calibrate the model. Since there are very few incidents, the data distribution is sparse, and distributions are skewed, hence difficult to compare and a fair and reliable assessment is not possible. This can be remedied by including more historical incident data into the model. Once additional data is consumed, further research is required to determine additional factors and the weights of all the factors to obtain reasonable results for incident conditions.

Qualitative results

A lot of exploratory analysis was done during the model training, data extractions, transformation and feature engineering tasks. Sample visualizations are shown below:

Model comparison: A histogram day wise representation for differences i.e. actuals — predictions to confirm

which model performed better and yielded better results.

Days Comparison: A histogram representation for differences i.e. actuals — predictions to confirm which days of the week yielded better results. Day 1 (Monday) had the worst predictions for the week, while Thursday and Friday yielded the best results. The optimum scenario would be when histograms for all days would be congregated in the center which is zero difference between actuals and predictions. The x axis represents the difference in seconds between the predicted and actual travel times while the y axis is the number of test transactions that made up the criteria.

A scatter plot to review the predictions during week days and weekends for 2 different Corridors are below. It might seem from the results below, that the I-80 corridor has more sporadic results and differences when compared to a smoother result for I-495. However, the MSE of the I-80 was much less than I-495 because the I-80 test number of records are much higher than the I-495. The difference between the predicted and actual travel times on the y axis is in seconds, x axis is the epoch code, and the blue dots represent weekend transactions while the orange dots represent week day transactions.

Summary

The project proved to be successful as the performance metrics were in the expected range. However data science projects turn to be an evolution and a learning process. Hence some recommendations are:

a) Revisit the clean up activities, outlier detection logic and feature engineering activities within the data engineering domain

b) Further data exploration and analysis must be conducted, such as different time periods, TMC’s, seasonality, locations, incidents, and pandemic impact

c) More and rapid experimentation must be performed, with more epochs, more features, different algorithms such as Ensemble techniques such as XG boost, etc.

d) Develop unsupervised algorithm to predict traffic peak time, seasonality, holiday impacts to feed(ingest) to the LSTM models

e) Review additional autoencoder design options

f) Develop better metrics for performance improvement and bench marking the various models

g) Operationalize the models by predicting various future time intervals, and experiment multiple models for prediction

References

Rose Yu, Yaguang Li, Cyrus Shahabi, Ugur Demiryurek, and Yan Liu. Deep Learning: A Generic Approach for Extreme Condition Traffic Forecasting, Proceedings of the 2017 SIAM International Conference on Data Mining. 2017, pg 777–785.

Harris, C.R., Millman, K.J., van der Walt, S.J. et al. Array programming with NumPy. Nature 585, 357–362(2020). DOI: 0.1038/s41586–020–2649–2

Pedregosa et al, Scikit-learn: Machine Learning in Python, JMLR 12, pp. 2825–2830, 2011.

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Comput. 9, 8 (November 15, 1997), 1735–1780. DOI:https://doi.org/10.1162/neco.1997.9.8.1735

X. Ma, Z. Tao, Y. Wang, H. Yu, and Y. Wang, Long short-term memory neural network for traffic speed prediction using remote microwave sensor data, Transportation Research Part C: Emerging Technologies, 54 (2015), pp. 187–197.

J. Van Lint, S. Hoogendoorn, and H. Van Zuylen, Freeway travel time prediction with state-space neural networks: modeling state-space dynamics with recurrent neural networks, Journal of the Transportation Research Board, (2002), pp. 30–39.