Uber Releases Open Source Project for Differential Privacy

Uber Security
Jul 13, 2017 · 4 min read

Katie Tezapsidis, Software Engineer, Privacy Engineering

Data analysis helps Uber continuously improve the user experience by preventing fraud, increasing efficiency, and providing important safety features for riders and drivers. Data gives our teams timely feedback about what we’re doing right and what needs improvement.

Uber is committed to protecting user privacy and we apply this principle throughout our business, including our internal data analytics. While Uber already has technical and administrative controls in place to limit who can access specific databases, we are adding additional protections governing how that data is used — even in authorized cases.

We are excited to give a first glimpse of our recent work on these additional protections with the release of a new open source tool, which we’ll introduce below.

Background: Differential Privacy

Differential privacy can provide high accuracy results for the class of queries Uber commonly uses to identify statistical trends. Consequently, differential privacy allows us to calculate aggregations (averages, sums, counts, etc.) of elements like groups of users or trips on the platform without exposing information that could be used to infer details about a specific user or trip.

Differential privacy is enforced by adding noise to a query’s result, but some queries are more sensitive to the data of a single individual than others. To account for this, the amount of noise added must be tuned to the sensitivity of the query, which is defined as the maximum change in the query’s output when an individual’s data is added to or removed from the database.

As part of their job, a data analyst at Uber might need to know the average trip distance in a particular city. A large city, like San Francisco, might have hundreds of thousands of trips with an average distance of 3.5 miles. If any individual trip is removed from the data, the average remains close to 3.5 miles. This query therefore has low sensitivity, and thus requires less noise to enable each individual to remain anonymous within the crowd.

Conversely, the average trip distance in a smaller city with far fewer trips is more influenced by a single trip and may require more noise to provide the same degree of privacy. Differential privacy defines the precise amount of noise required given the sensitivity.

A major challenge for practical differential privacy is how to efficiently compute the sensitivity of a query. Existing methods lack sufficient support for the features used in Uber’s queries and many approaches require replacing the database with a custom runtime engine. Uber uses many different database engines and replacing these databases is infeasible. Moreover, custom runtimes cannot meet Uber’s demanding scalability and performance requirements.

Introducing Elastic Sensitivity

Today, we are excited to share a tool developed in collaboration with these researchers to calculate Elastic Sensitivity for SQL queries. The tool is available now on GitHub. It is designed to integrate easily with existing data environments and support additional state-of-the-art differential privacy mechanisms, which we plan to share in the coming months.

Example reference system using Elastic Sensitivity analysis.

Elastic Sensitivity supports the majority of statistical queries written by Uber analysts and is compatible with our existing databases. It is also extremely efficient, calculating the sensitivity of a query in a few milliseconds even for large databases. This enables us to enforce differential privacy in real-time across our databases, with negligible performance overhead, while providing access for analysis and preserving the integrity of the raw data.

Scaling Differential Privacy at Uber

Now, with a simple RPC, Uber’s microservices can integrate seamlessly with our differential privacy stack. We’re also integrating differential privacy into our analytics pipeline, ensuring we can continue to improve our business with data-driven insights while using leading-edge privacy technology.

With today’s open-source release we hope to encourage wider adoption of strong privacy technology for protecting user data. More exciting announcements to come!

Update: A new release to this project embeds the differential privacy mechanism into the SQL query, before execution, so the query enforces differential privacy on its own output. Read more about this update here.

Uber Security + Privacy

Insights and updates from Uber’s security and privacy teams

Uber Security

Written by

Uber Security + Privacy

Insights and updates from Uber’s security and privacy teams