Katie Tezapsidis, Software Engineer, Privacy Engineering
Data analysis helps Uber continuously improve the user experience by preventing fraud, increasing efficiency, and providing important safety features for riders and drivers. Data gives our teams timely feedback about what we’re doing right and what needs improvement.
Uber is committed to protecting user privacy and we apply this principle throughout our business, including our internal data analytics. While Uber already has technical and administrative controls in place to limit who can access specific databases, we are adding additional protections governing how that data is used — even in authorized cases.
We are excited to give a first glimpse of our recent work on these additional protections with the release of a new open source tool, which we’ll introduce below.
Background: Differential Privacy
Differential privacy is a formal definition of privacy and is widely recognized by industry experts as providing strong and robust privacy assurances for individuals. In short, differential privacy allows general statistical analysis without revealing information about a particular individual in the data. Results do not even reveal whether any individual appears in the data. For this reason, differential privacy provides an extra layer of protection against re-identification attacks as well as attacks using auxiliary data.
Differential privacy can provide high accuracy results for the class of queries Uber commonly uses to identify statistical trends. Consequently, differential privacy allows us to calculate aggregations (averages, sums, counts, etc.) of elements like groups of users or trips on the platform without exposing information that could be used to infer details about a specific user or trip.
Differential privacy is enforced by adding noise to a query’s result, but some queries are more sensitive to the data of a single individual than others. To account for this, the amount of noise added must be tuned to the sensitivity of the query, which is defined as the maximum change in the query’s output when an individual’s data is added to or removed from the database.
As part of their job, a data analyst at Uber might need to know the average trip distance in a particular city. A large city, like San Francisco, might have hundreds of thousands of trips with an average distance of 3.5 miles. If any individual trip is removed from the data, the average remains close to 3.5 miles. This query therefore has low sensitivity, and thus requires less noise to enable each individual to remain anonymous within the crowd.
Conversely, the average trip distance in a smaller city with far fewer trips is more influenced by a single trip and may require more noise to provide the same degree of privacy. Differential privacy defines the precise amount of noise required given the sensitivity.
A major challenge for practical differential privacy is how to efficiently compute the sensitivity of a query. Existing methods lack sufficient support for the features used in Uber’s queries and many approaches require replacing the database with a custom runtime engine. Uber uses many different database engines and replacing these databases is infeasible. Moreover, custom runtimes cannot meet Uber’s demanding scalability and performance requirements.
Introducing Elastic Sensitivity
To address these challenges we adopted Elastic Sensitivity, a technique developed by security researchers at the University of California, Berkeley for efficiently calculating the sensitivity of a query without requiring changes to the database. The full technical details of Elastic Sensitivity are described here.
Today, we are excited to share a tool developed in collaboration with these researchers to calculate Elastic Sensitivity for SQL queries. The tool is available now on GitHub. It is designed to integrate easily with existing data environments and support additional state-of-the-art differential privacy mechanisms, which we plan to share in the coming months.
Elastic Sensitivity supports the majority of statistical queries written by Uber analysts and is compatible with our existing databases. It is also extremely efficient, calculating the sensitivity of a query in a few milliseconds even for large databases. This enables us to enforce differential privacy in real-time across our databases, with negligible performance overhead, while providing access for analysis and preserving the integrity of the raw data.
Scaling Differential Privacy at Uber
The Privacy Engineering team at Uber builds software that can enforce privacy best practices while integrating into the development workflow. This requires a scalable solution that doesn’t require every engineer to be an expert in differential privacy. Therefore, we also built a lightweight service layer around this new library to provide differential privacy as a service to other teams at Uber.
Now, with a simple RPC, Uber’s microservices can integrate seamlessly with our differential privacy stack. We’re also integrating differential privacy into our analytics pipeline, ensuring we can continue to improve our business with data-driven insights while using leading-edge privacy technology.
With today’s open-source release we hope to encourage wider adoption of strong privacy technology for protecting user data. More exciting announcements to come!
Update: A new release to this project embeds the differential privacy mechanism into the SQL query, before execution, so the query enforces differential privacy on its own output. Read more about this update here.