Location privacy issues in car sharing services

Mattia Dimauro
10 min readSep 19, 2015

I have monitored for three months a car sharing service operating in my city, collecting roughly 200000 routes made with such service. I started this process just for the sake of data acquisition and data visualization. Collecting such data could have given me the possibility to elaborate one of those fancy visualizations which illustrate people mobility on a map. Indeed I tried to achieve such result, and here you can see an animated example I’ve made with this data and a bunch of Javascript lines of code and Google Maps API.

This sample represent around 3hours of traffic in July 2015

After gathering and analysing this data, I started to ask myself if there were some other informations and possibilities lying within this moving dots.

This could be used to build a predictive model and expose some privacy issue on drivers’ behaviours and habits.

Shifts per day of the week

Of course there were other informations lying there. You can easily grasp the daily usage of the service, the coverage of a certain area, the average distance covered in a shift and so on. With this bar chart, you can see that during the weekend there’s a clear predisposition to use the service. But I wanted to take a deeper look into this, especially to see if some privacy flaw could’ve been found.

The easiest attack that I could think of, was a kind of virtual car chase. That is, you can follow a car remotely. Then, if you see somebody getting on one of these cars, you can just wait to see it reappearing on the map to discover where he was going. This is simple and can be done easily. Let’s see if we can push this further.

I started to wonder if performing some cluster analysis on this shifts, some recurring routes patterns would’ve emerge, as if this could be used to build a predictive model and expose some privacy issue on drivers’ behaviours and habits.

The intuition here was that since theres was repetitive pattern, this could be a person moving from point A to point B, let’s say from home to office. This turned out to be an intuitive, but incorrect assumption.

So I tried to divide into clusters these routes paths, where the paths in each cluster share some similar properties. In this case, I choose to aggregate them using their start and end position. That is, I wanted to see if it was possible to find more routes which were leaving from one point of the city and heading to a particular other location. The intuition here was that since there was a repetitive pattern, this could be a person moving from point A to point B, let’s say from home to office. This turned out to be an intuitive, but incorrect assumption.

So this is what this routes clustering provided.

Here I analysed 2000 routes and tried to divide them into 30 clusters. Every color here represent a different cluster.

There was mainly three problems here I didn’t think of at first:

  1. Deciding the number k of clusters
  2. Dynamically generate k colors is not a trivial task. Especially if k is big and you want them to be fairly distinguishable with each other!
  3. The incorrect assumption I mention before, which I’m going to illustrate in a bit.

The number of clusters denote in how many different groups of similar objects you think you can divide a larger set of objects. A rule of thumb to determining the number of clusters in a data set is to obtaining it like so:

with n as the number of objects (data points).

This is why I tried to divide in 30 groups a set of 2000 routes.
But, we have to consider we were trying to make emerge a persons movements from this shifts. In a typical day roughly 2000 shifts are made with the car sharing service. Dividing it in 30 groups, where each group would represent a person movements, would imply that this person is making around 66 shifts per day, which is unlikely to happen.

So I’ve started trying to find different number of clusters, making assumption like “A typical user should use this twice a day!”, to decide the next k I would have used.

Techie part. Feel free to skip this part if you’re not keen to.

I also decided to include time and add a fifth feature. This to try to cluster together routes, not only similar on starting and ending location, but also in time. Anyway, this didn’t bring the desired effect but providing longer and longer waiting time to process the request, considering the complexity of k-means algorithm growing exponentially to k and d.

Dimensionality is now up to 5, and the feature vector looks something like this: [45.49343, 9.146723, 45.470757, 9.122398, 19801], including latitude and longitude of both start and end position, and the number of seconds passed from midnight(of that day) to the actual time of the rent. I did take in consideration to normalize this before training. Please let me know if you think I missed something here.

Example of one cluster found

Trying to find a greater number of cluster, some appreciable improvements were made, but some funny pattern emerged. Here you can see how the algorithm cluster together routes, which are indeed similar as they shift north/south east/west to a similar extent, but clearly are still not genuinely separated.

However, another consideration starts coming to my mind: The starting point of a shift, is not really useful and relevant. True information is held in the arriving point. This may sound like a zen koan, let me explain. Typically, when you reserve a car, the car is not located right where you are. You need to walk and reach it. When you arrive at your destination, instead, you’ll tend to park it right where you are going if possible.

The starting point of a shift, is not really useful and relevant. True information is held in the arriving point.

So to find out a more accurate travel path, that is, to find out precisely the movements of an individual from point A to point B, we should look for another kind of pattern.

Consider this example:
Suppose there’s a ninja named Janin, and he is engaged to find and follow a Man, to observe his habits and learn from him. Janin has been told that Man lives in a certain area of the city and he use car sharing service to go around.
He needs to discover more precisely where Man live and where he usually goes. Then, he decides to looking over that area, and trace all of the cars leaving and coming in the Search Area.

In particular, Janin aims to find a pair of cars, such that one car is travelling to one area, and the other one is routing back from there to the Search Area where he started. He saw there’s a car leaving from point A heading to point B. So what does Janin know for now? He still don’t know where Man came from, but he know indeed where he was going. Janin decide then to look for all the car coming from near this B point, heading for the original search area. He find this car coming from A1 going to B1.

So looking at this two routes, A-B and A1-B1, Janin can infer that Man was actually going from B1 to B (here represented with the green line).
This is because, as we said, the arriving point tell us more on the intention of the driver.

Let’s check this scenario with some real data. For this purpose I set some API that, given a specific target location, return a list of all the car coming and going from that point( just like Janin the ninja was able to do!). You can try it here, just click a point on the map to see all the cars leaving and coming to that area. (this tool needs to be refined)

Here are shown all the cars leaving from a search area, in red, and all the cars arriving in that area in blue. In the zoomed proximity area, you can see there’s many routes made from the search (vaster) area to this particular spot. I think there’s a strong indication that somebody often goes from that area to this particular, specific point.

In fact, I’ve cheated. This is me going from my parents house to my place over the last months.

This pattern emerge all the time.
Cars going from on area to the exact same street (here in red), and cars around that street going back to the original area (blue).

Now you may be thinking: “Hey! You were talking about privacy here, and now you’re showing us where you live?!”

What I want to say here is that with this system you’ll not be able to correlate a specific person using this services with locations. At most you can have aggregate data on people’s mobility. But, if you add some extra priori information … If you know where somebody live, with this system it becomes easier to guess where he is going. Here is not shown, but consider that you can have other information about the trip. For example you can have the day and the hour of when that happened. This can lead to a potential privacy leakage.

You’ll not be able to correlate a specific person using this services with locations. At most you can have aggregate data on people’s mobility. But, if you add some extra priori information …

Unfortunately, not much can be done to avoid this. Car sharing services may try to hide in more ingenious way how to access to their cars position. Instead of sharing a clear list of cars positions in a JSON file, they can try to encrypt it in some way. Anyway, at some point they’ll need to display them on a map, to let their users to benefit from this service. This would just slow the process a little bit.

I’ve done this with Enjoy car sharing service, but this is easily replicable on other services as well (as Car2Go, DriveNow and so on) as long as they provide a way to access their cars positions, and this is a main feature of all of these services.

Unfortunately, not much can be done to avoid this. And this is easily replicable on other services as well.

Anyway, all of this data can’t only be used in a harmful way, there’s also a positive and good side to this. Let’s now disclose some of this nice features.

Possible useful features

A part from this privacy issue, I think there’s also some useful features that can be exploited from this data. Those could be implemented from the the company providing the service or from some third party application, if some API are provided.

One possibility here is to build a system that can predict the arrival of a car in a certain location. The scenario here is: you want to book a car next to your position, but there’s none. You have then to decide if you should start consider other way of moving, or wait for a car to appear next to you.
I’m a user of this service my self, and this a scenario occur all the time.

One possibility here is to build a system that can predict the arrival of a car in a certain location.

So sometimes, I decide to wait and I’m rewarded with the appearance of a car next to my position. Some other times anyway, I wait in vain for 10–20 minutes before starting to walk / use public transportation / go for a Uber, resulting in a discrete waste of time.

It could be useful then, that the service would be able to provide you the information of how much is likely that a car is about to pop out next your current location. This can be inferred by previous data collected. If is know that in a certain area a numbers of cars is found to be there at some point in time, there’s an indication that some car may appear soon. Or not.

Also, it could be nice to have the opportunity to automatically book a car in the eventuality that it’s going to be positioned next to your location, and notify you that your car is there. Hopefully we’ll see this happening soon. Before self-driving cars become a daily reality would be much appreciated.

--

--