Predicting Ridepooling Searches using POIs

Published in

door2door Engineering

8 min readMar 14, 2019

Existing Ridepooling Searches in the operating area in Berlin, grouped by roads they’re routed through

A very common question I think a lot of B2B companies get nowadays (regardless of industry) by clients that want to start doing business with them is: how do I know this will succeed?

In the Mobility Intelligence team at door2door, we try to answer that question for our potential Ridepooling partners. We do so by analysing and visualising a number of indicators (like Public Transport Coverage for example), and simulating possible Ridepooling scenarios based on existing demand data to get KPIs that help clients design their service.

But what if we don’t have demand data for a given region? What then?

We could try and simulate with random demand (and we do), but how much does that tell us, really? We needed something more solid. Something better than random to give us more confidence in our simulations.

Existing Demand Data

What we define as “demand data” is actually movement data. Where people want to move from/to and when. One of our demand data sources is the Searches from our existing Ridepooling services. It’s nothing too complex. Every Search consists of:

location_from: where the person wants to depart from.
location_to: where the person wants to go to.
timestamp: when the search happened.

For the purpose of Insights, we also add a layer of aggregation to this raw data, to be able to analyse and therefore draw conclusions from it more easily. We divide every region we have in Insights into hexagon cells, and all the hours of the week into 1-hour bins. So the data we end up getting from the raw data is:

cell_from: which cell the trip originated from.
cell_to: which cell the trip ends up going to.
time_id: the 1-hour bin in the week when the trip happened.
weight: how many trips, on average, happen from/to the same cells in the same 1-hour bin.

What we want to predict

The idea is to predict how people want to move within a region, in general. But more specifically, we want to predict how they move between Insights cells (and at which hours/weekdays). Therefore, we want to predict, in a completely new region where we don’t have any Search data yet:

For every combination of cell_from, cell_to, and time_id: how many trips will happen on average (weight)?

How will we make those predictions?

The process to get to a prediction consists of three main steps:

Define what a cell has or is that contributes to the decision of taking a trip to or from it (“Features”). This includes how many points of interest it has, how many people live there, etc.
Teach the algorithm we’re writing to translate those features to a number of searches (weight) at a given time (“Training”).
Based on the previous two steps, predict what the searches will look like in a new region.

Features

What can a given cell have or be to contribute to the decision of taking a trip to or from it? We boiled that down, for now, to certain POIs (points of interest) and population density. The set of features we use right now are:

Population density: number of inhabitants per square km (not taking into account demographic data like age distribution, etc).
Sustenance: number of restaurants/bars/etc that exist in a given cell.
Residential buildings: number of buildings that are marked as “residential” in a given cell.
Shops: number of shops in a given cell.
Offices: number of offices in a given cell.
Entertainment: number of entertainment POIs (cinemas/theatres/etc) in a given cell.

We get all of this data (except population data) from OpenStreetMap.

We take those features into account for both the cell_to and the cell_from.

Training

Now that we can define what a connection means in terms of features, we have to teach the algorithm how to translate those features into a number of searches. We do this by:

Preparing a training data set based on Searches from our Ridepooling operations in Berlin.
Feeding this data set to a model, which will then map the features of the cells in Berlin to the number of Ridepooling Searches we have there and come up with a way to translate features into searches.

Preparing the (training) data set

We add two new features to every connection:

First, a feature called target_value (= weight of Ridepooling Searches in that connection) based on real data, and also
A time_category: this indicates in which interval of time did those Ridepooling Searches happen.

Wait. “Time Category”?

For the purpose of this prediction, we thought a 1-hour bin might be too granular. We wanted broader bins to hold that info, which is where time_category comes in. We divided the day into bigger bins (like “morning”, “afternoon”, “evening”, and “night”). Therefore, we will also predict in those bigger bins instead, and assume that within every time_category, trips are distributed evenly between individual hours.

Scaling the data

To be able to compare features in different regions in a more accurate way, we scale the data so it’s relative to its region, and instead of representing an absolute number for cell features, we transform that to represent how many features this cell has relative to other cells in the same region. For example:

Berlin has 3.7 million people living in it. Therefore if a cell A has 3.7 million people, it has 100% of the population (=1 on the scale from 0 to 1). If it has half, then it’s a 0.5 on a scale from 0 to 1, and so on.
Duisburg has 500K people living in it. Therefore if a cell B has 500K people, it has 100% of the population (=1 on the sale from 0 to 1).
With scaling, cells A (Berlin) and B (Duisburg) are equal. They both represent a very high density (100% density) of the people in that area.

This also applies to other features, like say, shops. 20 shops in Berlin might be comparable to only 5 in Duisburg because of how the shops are distributed in a region. This gives us the power to analyse/predict based on relative features because every region is different.

Note: when we first predict a value for the searches, we get a value from 0 to 1, we then un-scale this value by multiplying it by the maximum number of Searches we could have in a connection to get a real/absolute value again.

Finding the right model

Every type of model has its own algorithm (= way of translating features into values). There are many models out there that could be potentially used to solve this type of problem: predicting a numeric target value based on a set of features.

Since the target value is numeric, and can be any number from 0 to infinity, we use a general technique called Regression. The other general technique would be Classification, but this only applies to data sets where the target value is a closed set and not an infinite range. Under Regression, there are several models we could use: Linear, Polynomial, and Random Forest are just three examples.

We also tried to use Deep Learning using a Neural Network with different configurations, but that did not yield results we are satisfied with.

But how do we measure results, anyway?

We used a combination of different metrics in order to compare different models to each other (and to give us a general sense of the performance of each one with our data). Those metrics are applied on testing regions (non-Berlin) where we predict searches and have real data to compare how accurate the predictions are:

Root Mean Square Error:
For every prediction we make, we compare predicted_searches to ridepooling_searches (actual searches). We calculate the squared difference (“Square Error”) = (predicted — actual)². We then get the mean of all the Square Errors, and get the square root of that mean to get an average error in the entire dataset. If we get, for example, an actual RMSE of 0.2 this means that on average, we predict searches that are +/- 0.2 from the actual searches.
Value Distribution:
We plot the actual and predicted values on a chart, where every point on the x-axis is a connection, and every value on the y-axis is the number of Searches. We then visually compare the distribution of values between the connections (predicted vs. actual).
Spatial Distribution:
We plot the actual and predicted values to/from a certain cell on a (heat) map and visually compare the distributions. This helps us see if we predict demand in places like city centers, for example, or rural areas correctly or not.

Results

Based on all the results combined, we decided to go with Random Forest Regression. We then generated more in-depth insights of the results of that model by predicting demand in two regions: Duisburg and Munich, for which we already have real data.

After training the chosen model, we analysed the importance of the POI features on the number of actual searches (Ridepooling Searches) in the training region (Berlin):

Importance of every feature as determined by the model

Great. Now we have successfully predicted Searches. We can use this everywhere and make absolute decisions with it, right?

Err..

Limitations

Since we use a training data set, we are limited by its limitations. Since our training data set is the Ridepooling ops in Berlin, our limitations are:

The operating period: we will not be able to predict searches outside that operating period because we simply don’t know how people behave the rest of the day(s).
The features in the operating area: if different combinations of features yield different results, we won’t know. Best we can do is compare the new region to Berlin’s operating area and try to predict based on that.
The service itself: pricing, advertising, etc all affect how many people use a service to move. If different services have different models that might affect who uses their service, we are not able to detect that difference or take it into account.

Existing Research and Studies

In the scientific community what we refer to as connections (trips from one cell to another) is usually called Origin-Destination (O-D) matrices. There exists already quite a a bit of research on how to predict O-D matrices using machine learning. We’ve reviewed some of this research to decide on how to solve our problem. Here’s a quick overview of some of what we reviewed:

ADEWALE 2017: Deep Learning Based Origin Destination Matrix Prediction Through GPS DATA
SAADI et al. 2017: A bi-level Random Forest based approach for estimating O-D matrices: Preliminary results from the Belgium National Household Travel Survey
ASHOK 1996: Estimation and Prediction of Time-Dependent Origin-Destination Flows
HELD et al. 2018: Future mobility demand estimation based on sociodemographic information: A data-driven approach using machine learning algorithms

Conclusion

In a parallel universe, this would be a lot easier: just find someone who can see into the future, train them to gather data efficiently, and hire them to come back with accurate predictions. We’re stuck with Maths for now. But, while it does have its limitations, especially with a limited dataset, it certainly gives us a pretty good insight (ha!) into one of the many possible futures.