Train Line Reliability

3 min readAug 30, 2021

Public transportation is stressful, planning routes based on times of independently moving vehicles can make for long and tedious travel times. I set out with the goal to try and help cut down this catastrophic inconvenience by predicting which specific train lines were particularly unreliable. The data set I used is the Toronto Subway Delay Data set on Kaggle, which dates from January 2014 to June 2021 .

Here is the first 5 observations of the raw data:

Setting up the data

Luckily this data set is already fairly clean, and not much was needed in order to clean it up. I felt the ‘Day’ column was a little vague, so I changed that to ‘Day of Week’ and set the index column to the ‘Date’ column.

After this the data set looked like:

Now the fun stuff begins

It was about this time when I really settled into the idea of predicting reliability, as it was time to choose my target, which obviously I chose which line the observed train belonged to. I ended up choosing this because I knew people aren’t particularly interested in why a train is stopped, or how long previous trains have been delayed, but instead which railway is more likely to have a delay.

From here I figured out that the most common railway to have delay codes was the Yonge-University(YU) railway, having a frequency of approximately 46.9%, slightly worse than just flipping a coin to determine if YU is the line throwing a delay code in Toronto.

Building Models

I decided to try 3 different models, random forest, gradient boosting, and XGBoosting. I decided on these as I wanted the generalization that random forests bring to the table, but I also wanted to potentially correct any errors the random forest had on predictions.

The model that preformed the best for me was XGBoosting, although they were all very accurate with the worst being Gradient Boosting at 97.4% accuracy on my validation data.

The following graph shows the accuracy of the XGBoost model using a confusion matrix:

Drawing Conclusions

My XGBoost model has a tendency to predict YU and Bloor-Danforh(BD) as the other one when wrong, as well as predicting YU incorrectly more often than any other line. This is likely due to a vast majority of the observations belonging to one of these two options, which might cause issues with new observations that do not belong to one of the two lines, as demonstrated by the number of incorrect predictions under the YU column that belonged to other lines. Maybe this is something I’ll address as I continue to learn.

The code notebook can be found on github at this link

Train Line Reliability

Setting up the data

Now the fun stuff begins

Building Models

Drawing Conclusions

Written by Dillon Conroy