Aggregating trip data using k-anonymization

Morgan Herlocker
SharedStreets
Published in
4 min readNov 6, 2019

Trip data is any type of data that connects the origin and a destination of a person’s travel. This data is generated in countless ways as we move about our day and interact with systems connected to the internet. For example, trip data is generated by your phone’s GPS sensor as you navigate with turn by turn directions or when you take a taxi or ride share service to calculate your receipt.

This data is being generated in vast quantities everyday and organizations that collect it should handle this information with caution, similar to the handling of credit card data or social security numbers. This post will walk through concrete strategies for handling trip data securely to maintain individual privacy, while preserving the analytic power that makes it useful. Cities need data to manage their streets and using K-anonymization can keep individual information safe while maintaining high quality data.

Why is trip data sensitive?

The trips that you take are unique to you. Researchers have found that it takes 12 points on a finger print to identify an individual, while it takes only 4 location points to uniquely identify 95% of the population.

Our movements through the world paint a picture of our behavior that reveals more information than we would want to be public. The doctors you visit reveals aspects of your medical history, the houses you visit reveal the nature of your relationships, and the recreational activities you attend paint a picture of your lifestyle choices. Riders of transit and mobility services have a reasonable expectation that this data will be kept private when it is collected for fares and billing.

In the following maps, we show how “raw” trip data can be queried to narrow down the bulk of data, revealing highly targeted connections. By narrowing the dataset by time and space, we can easily isolate many categories of sensitive behavior. For example, using the open NYC taxi dataset, it is possible to automatically identify sensitive trips to medical facilities using basic spatial queries. Researchers have previously used this dataset to highlight privacy issues by tracking celebrities, identifying Muslim drivers, and linking confidential trips to other religious populations.

How aggregation helps protect privacy

To protect the safety of individuals, data that contains personally identifying information should be grouped whenever possible. The process of grouping data and generating statistical summaries for each group is called “aggregation”.

In the case of trip data, trips are typically grouped together when they took place around the same time, with a similar origin and destination. These trips can be grouped spatially from zip codes, census blocks, or any other geographic polygon. Low resolution datasets, such as trips grouped by month and by census tract present very low privacy risk to users. A more granular grouping, such as trips grouped by census block, by hour provides significantly less protection of user data, however, the low resolution groups may provide little analytic value.

High resolution data reveals sensitive trip information, but low resolution lacks the analytic power we need to understand travel through our cities. Fortunately, there is a solution that allows for the best of both options, providing privacy safety at high resolution.

Preserving privacy at high resolution

K-anonymization is a technique used to find a balance between data utility and user privacy. It is also easier to understand and implement than alternative techniques, making it a common choice for storing and sharing trip data.

One way to implement K-anonymization is to select a minimum privacy threshold for unique trips, below which groups will be suppressed. For trip data that is sold commercially, it is common to see a K value of 5, meaning origin destination pairs will be removed if less than 5 trips were recorded. This technique provides the same level of protection regardless of the spatial resolution chosen for aggregation.

If the chosen resolution is too high, few groups will exceed the privacy threshold, incentivizing selection of a more appropriate granularity. This allows data publishers to try multiple parameters and select the groupings that best balance utility and privacy, or even automate granularity selection entirely. Here we demonstrate k-anonymized aggregations at multiple levels of granularity, while maintaining a constant minimum threshold of 5 trips per group.

Implementation

The techniques described are commonly used best practices in industries handling sensitive trip data. Cities working with mobility data can implement this technique for mobility data they handle or publish using a query similar to this example SQL statement:

SELECT COUNT(*) as k FROM TRIPS WHERE k > 5 GROUPBY STREET, HOUR

There are numerous open source tools that aggregate trip data using these privacy preserving techniques, such as the SharedStreets mobility-metrics tool, a command line interface for tracking fleets of scooters operating in a city. If you have questions about these techniques and how to apply them to datasets in your city, let us know by sending us an email: info@sharedstreets.io

--

--

Morgan Herlocker
SharedStreets

building street communication protocols @sharedstreets