On Time Every Time: Predicting Bus Arrival with Machine Learning

9 min readJan 17, 2024

Narrative Highlights

The purpose of this project is to develop a smart bus arrival prediction model based on Python that would correct inaccurately estimated times of arrival (ETAs) for buses in Malaysia’s public transport system by using state-of-the-art algorithms and real-time data. The study concentrates on algorithmic design, smooth integration with Malaysia’s public transport systems, and an assessment of the model’s effects on sustainability and urban transportation dynamics. Regulatory frameworks, non-bus transit options, and conventional infrastructure projects are all excluded. Python programming was utilised in developing the model. The project aims to fully comprehend and address the issues preventing bus arrival time estimates from being as accurate and reliable as possible, with a focus on Malaysia and potential global relevance.

Data Description

In this project, we leverage a confidential dataset obtained from a public transportation systems and services company to develop a machine learning model for predicting the estimated times of arrival (ETAs) of a bus. Due to the sensitive nature of the medical data involved, we are committed to maintaining the utmost confidentiality and ensuring the privacy of individuals. The dataset comprises details related to bus stops, bus registration numbers, driver IDs, operation dates, operation days, hour categories (categorised as either peak or non-peak hours), and the average time taken for each operation.

We would like to express our gratitude to the dataset provider for their collaboration and trust in sharing this valuable data for research purposes. Throughout this article, the insights gained from the machine learning model and the methodology employed are introduced, without disclosing specific details that could compromise the confidentiality of the dataset.

Data Preprocessing Process

Jupyter Notebook and Python are employed throughout the execution of this project. In addition to the standard data preprocessing steps, such as checking for null values, duplicates, and outliers, a distinctive aspect of our project lies in the creation of a new feature. We achieved this by calculating the difference between Arrival time and Departure time. This innovation arose from our observation that when the disparity between these two columns is less than one minute, the Average Time Taken recorded as 0. To enhance accuracy, we opted to introduce a new computed column for this calculation:

For instance, consider the example: “2022–01–31 18:20:59–2022–01–31 18:20:19.” The calculated `total_time_taken` for this instance would be 0.67 minutes, capturing the minute-level granularity of the temporal duration. This newly created feature, `total_time_taken_per_trip`, adds valuable information to the dataset, contributing to the model’s ability to capture temporal patterns and relationships, and used as the target variable of this project.

The examination of the results reveals that the dataset exhibits a clean and well-structured nature, as evidenced by the absence of both null values and duplicated data. This absence of missing values and duplicates are pivotal in ensuring the reliability and integrity of the dataset for subsequent analytical processes.

Since our project focuses on specific buses only, hence, we first check on the bus_registration_no that has the highest number of rows. By selecting a bus with the highest number of rows, it helps to ensure a substantial amount of data for that particular bus. This can contribute to more robust and reliable model training, as it allows the algorithm to learn patterns and relationships within that specific context:

Check for the bus_registration_no with the most data

Following this, our attention shifts specifically to the bus that has the highest number of data. As it appears unrelated to the bus_registration_no that exhibited extreme values upon examination, we can proceed without further consideration of that particular registration number.

To simplify our dataset and make it work more efficiently, we’re starting to pick which features to keep and which to remove. Some variables have already been encoded, so there’s no need for extra encoding steps. As a result, we can now remove redundant variables and stick with their encoded versions. This simplifies things. In addition to the feature encoding adjustments, we identify ‘Route_ID’ as a constant variable, therefore, opt to exclude it from our feature set. Its uniformity renders it redundant for predictive modelling purposes.

Following the feature selection, our next step involves assessing the correlation among the remaining variables in the dataset. To accomplish this, we utilise the corr() function to compute the correlation coefficients between pairs of features. Subsequently, a heatmap is generated to visually represent the correlation matrix.

Data Visualisation

Average Time Taken during Peak Hours (Weekdays vs Weekends)

The box plot above shows the average time taken during peak hours for both weekdays and weekends. From the boxplot, we can observe that the average time taken during peak hours on weekdays is generally higher than on weekends, with a wider range of times and more variability as indicated by the spread of the data points. The weekdays show a higher median time as well as several outliers, suggesting that there are often delays that are much longer than the typical times.

The pie chart above displays the distribution of trips by hour category which is divided into peak and non-peak hours. Based on the pie chart, it shows a lower percentage of trips during peak hours compared to non-peak hours. A lower percentage of trips done during peak hour suggests that there is less demand during this period of the day when passengers use commuters typically to avoid crowded buses or use alternative modes of transportation. In contrast, a higher percentage during off-peak hours indicates more demand during times with less traffic and more flexible travel schedules. This could be due to a variety of factors, such as flexible work schedules or lower traffic during off-peak hours.

Average Time Taken by Day and Hour Categories

The heat map combines two factors: day (weekday/weekend) and hour (non-peak/peak) to show average trip times. Weekday passengers appear to take longer trips during the weekdays to and from their places of employment or education, based on the observation of greater average trip times. This could be caused by factors like severe traffic during rush hours or taking longer routes to get to areas where there is a high population density. However, reduced average travel durations on weekends imply that these days usually see speedier travel. This is often because fewer people travel on the weekends for work or education, which reduces traffic and congestion.

Model Development

Given that our target variable of interest is “total_time_taken_per_trip,” the nature of our analysis leans towards regression modelling. Regression models are particularly well-suited for predicting a continuous numerical outcome, making them an appropriate choice for estimating the average time taken in our scenario. The objective of regression modelling is to establish a relationship between the selected predictor variables and the target variable, allowing us to make predictions about the average time taken based on the observed patterns in the dataset.

In the process of model development, the Holdout method was employed to partition the dataset into distinct training and testing sets. This partitioning involved separating the features (‘X’) and the target variable (‘y’), allocating 70% of the data for training the model and reserving the remaining 30% for evaluating its performance on unseen data.

Besides, the dataset’s features underwent standardisation using the scikit-learn StandardScaler. This process involved centering the data around its mean and scaling it to have a standard deviation of 1. The training set was used to compute scaling parameters, which were then applied to the testing set for consistent feature representation. Standardisation is beneficial for models sensitive to feature scales, enhancing their performance.

In our analysis, we strategically opted for a diverse ensemble of regression models, including Random Forest, Decision Tree, XGBoost, AdaBoost, and Artificial Neural Network (ANN). These models collectively offer a broad spectrum of advantages, ranging from capturing non-linear relationships and interactions with Random Forest, Decision Tree, and XGBoost, to boosting the performance of weaker models with AdaBoost, and leveraging deep learning capabilities with ANN. This comprehensive approach allows us to uncover intricate patterns and complexities within our dataset, ensuring a thorough exploration of various modelling techniques and enhancing our ability to make accurate predictions.

Model Evaluation

To check how well our models are doing, we’re using a bunch of measures like Mean Squared Error (MSE), Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and R-squared (R²). MSE and MAE help us understand how much, on average, our predictions differ from the actual values. RMSE gives us an idea of the overall size of those differences. R² shows the proportion of the variation in the average time taken that the models can explain. The performance metrics for the different regression models provide insights into their effectiveness in predicting the target variable:

The Random Forest model stands out with the lowest MSE (19.3037), indicating minimal average squared differences between predicted and actual values. Its superior MAE (1.6902) and RMSE (4.3936) further emphasise its accuracy and precision. The positive R² value (0.9210) signifies that the Random Forest model explains a notable proportion (92.1%) of the variance in the target variable, highlighting its effectiveness in capturing underlying patterns.

Limitation

First off, the completeness and quality of the data that is now accessible determine how accurate the prediction model will be. The performance of the model may be hampered by missing or inaccurate data. The model may not fully account for external factors that impact bus arrival timings, such as unanticipated traffic accidents, weather, or even road closures. Due to their complexity, ensemble models such as Random Forest can be difficult for stakeholders to understand and analyse. Additionally, the deployment and scalability of real-time data gathering and processing systems may be impacted by resource limitations, which include both financial investment and technological infrastructure. It will be essential for the effective implementation and long-term viability of the suggested solutions to recognise and deal with these constraints.

Conclusion

In summary, this project has made significant strides in developing an innovative smart bus arrival prediction model with the potential to revolutionise Malaysia’s public transportation system. The identification of ETA errors underscored the imperative for enhancements, and the meticulous data preparation, utilisation of diverse regression models, and subsequent evaluation showcased the effectiveness of leveraging machine learning for improved bus arrival forecasts. Notably, the Random Forest model exhibited a high performance level, illustrating the capability of sophisticated algorithms to capture intricate patterns in the dataset. However, for sustained success, addressing issues like model complexity and data quality assurance is crucial. Recommendations from us including continuous investments in data quality assurance, exploring partnerships with local authorities, ongoing model development, and enhancing public accessibility through user-friendly mobile applications or interfaces for precise bus arrival forecasts.

Reference

Noor, R. M., N, S. Y., Kolandaisamy, R., Ahmedy, I., Ma, H., K, A. Y., Shah, W. M., & Nandy, T. (2020, February 15). Predict Arrival Time by Using Machine Learning Algorithm to Promote Utilization of Urban Smart Bus. https://doi.org/10.20944/preprints202002.0197.v1

Vadapalli, P. (n.d.). Top 12 Commerce Project Topics & Ideas in 2023 [For Freshers]. upGrad Blog. https://www.upgrad.com/blog/types-of-regression-models-in-machine-learning/

Ray, S. (2023, May 1). 7 Regression Techniques You Should Know! Analytics Vidhya. https://www.analyticsvidhya.com/blog/2015/08/comprehensive-guide-regression/

What is a Machine Learning Pipeline? (n.d.). https://valohai.com/machine-learning-pipeline/

M, S. (2021, December 12). What is a Pipeline in Machine Learning? How to create one? Medium. https://medium.com/analytics-vidhya/what-is-a-pipeline-in-machine-learning-how-to-create-one-bda91d0ceaca

Brownlee, J. (2021, February 15). Regression Metrics for Machine Learning. MachineLearningMastery.com. https://machinelearningmastery.com/regression-metrics-for-machine-learning/

About the Authors:

Lim Joey, Mitraa Kolanthai, Dharumashan Bathiban and Melissa Faqihah

A cohort of students enrolled in the Bachelor of Applied Science in Data Analytics with Honours program at Universiti Malaysia Pahang Al-Sultan Abdullah (UMPSA), a prominent public technical university situated in Pahang, Malaysia.

On Time Every Time: Predicting Bus Arrival with Machine Learning

Written by Joey Lim