Imputation on passenger numbers in Ruter -Part 1

Jostein-André Nordmoen
4 min readJun 14, 2022

--

Public transport in Ruters region

Ruter AS is the Public Transport Authority of Oslo county and a part of the Viken county in Norway. The region has a population of 1.1 million people, with 246 million boardings in the year 2021. The numbers of journeys by buses, trams, metro, etc. during a day are about 18 500, with about 500 000 possible stop events where passengers might board or alight a vehicle.

Handling missing passenger numbers by using Imputation on collected data

Ruter does collect a lot of sensor data on passengers called Automatic Passenger Count (APC). Almost every vehicle has APC sensors covering every door, counting numbers of boarding and alighting. Even if we have quite good coverage of the countings, we still have problems with missing data.

From data collection until the data are available in a “reporting table”, we’ll have 20 percent missing APC data. To compensate for this “missingness” we use mean imputation.

Imputation is a technique to fill in or replace the “missing numbers” with numbers we can trust. There are many methods used for imputation out there, from very simple to quite complex. We test complex models, however, thus far, we use mean imputation, mostly because it is a robust model and on aggregated data, the mean represents an unbiased estimator which is important for reports on aggregated data.

Matching sensor data to the journeys

For the APC data to make sense, they must be matched to the correct Journey and stop. The APC timestamp is very important to do a good matching. Here is the ruleset:

1. Best quality, called a direct match. The APC data for the vehicle using the APC timestamp is matched on the time interval for the journey between arrival and departure on a stop.

2. For the residual APC data, which is not matched in step 1, we increase the time interval for the journey on the stop to +/- 30 min.

3. To handle APC data at the end of the journey/new journey. Boarding at the last stop will be moved to the next journey and alighting at the first stop will be moved to the former journey.

4. When there is a NULL value for APC on a stop and dwell time = 0, we impute a zero value for boarding and alighting

5. Journeys that start 15 minutes before scheduled departure time get a status = fail, meaning all APC values for this journey are imputed.

From this, we get a quite good dataset for the next steps. Now, when the matching is done, it’s time to look into the data cleansing before we create passenger numbers to impute.

How to create the mean numbers or the numbers to impute?

From the matched raw data mentioned above, we do a cleansing job where data with good quality are kept and data with bad quality are set to NULL.

The rules for getting the data for computing means for imputation are:

1) All the journeys without APC data get NULL on all possible stops

2) If the journey is complete or not; meaning if the vehicle has completed the route plus/minus one stop. Complete journeys get status=ok, incomplete journeys get status=fail

2) If there have been registered at least one counting on every door during a day, status=ok, else status=fail.

3) If the deviance between boardings and alightings is less than 15 % for the whole journey, status=ok, else status=fail

4) Journeys with less than 200 boardings/alightings on a stop get status=ok, more than 200 gives status=fail.

5) Journeys where ‘total sum boardings’ + ‘total sum alightings’ = 0, get status=failed

6) Stops on complete Journeys which have NULLs will get the number zero imputed.

From here the data is ready to be used for computing the means to be imputed.

Rules for calculating the mean for APC

With the data ready for calculating the means, we use the following quality indicator/granularity for the time interval:

APC = 1; APC counting’s used as given from sensors, best quality

APC = 2; APC mean calculated based on values from the same departure time from the days before

APC = 3; APC mean calculated based on values from the same hour interval from the days before

APC = 4; APC mean calculated based on values from the whole day. Indicating quite incomplete data.

APC = 5; APC mean cannot be calculated, no sensor data available.

Only APC numbers with status= ok will be used in the calculation. Further, we have a date-interval too, the model use data for the last 50 days to compute the mean values, when controlling for the type of day, like Workday, Saturday, Sunday, and special dates. So, we only use data from the same kind of day types to calculate the mean values.

All in all, we do impute passenger numbers for about 20% of our journeys, gradually the imputation percentage is shrinking due to deploying APC on an increasing number of vehicles, removing bugs in the pipeline and we are getting more experience.

To sum up

We are running this matching and imputation pipeline to fill in the missing APT numbers to get a “complete” dataset. It’s a resource-consuming process, and for some routes with large percent imputation, the variance will be artificially reduced which makes the data less valid for statistical testing. However, the benefits are higher than the drawbacks, but the best is to improve the quality of the sensor data and the matching pipeline, the less missing data, the better.

This article was mostly about creating the foundation for doing imputation on our data. I plan to publish two more articles about imputation demonstrating some simple techniques in part 2, and some more advanced techniques in part 3.

--

--