Privacy in a Data-Driven World
Roxana Geambasu from Columbia University came to speak at the MIT security seminar. She broadly talked about privacy in a data-driven world. What does she mean by this? For example, consider Gmail Ads. They serve ads based on information in your emails in Gmail. It’s not just Gmail though. Data brokers can tell when you’re sick, tired, and depressed, and they sell this information. Similarly, Google Apps for Education used institutional emails to target ads in personal accounts. Credit companies are looking into using Facebook data to decide loans.
These are the consequences of a data-driven web. The web is a complex and opaque ecosystem driven by massive collection and monetization of personal data. However, who has what data? What is it used for? Are the uses good or bad for us? End-users, privacy watchdogs, and other outsiders are equally blind.
Her research is on building transparency tools that increase users’ awareness and society’s oversight over how applications use personal data. She has built Sunlight [CCS ‘15], which reveals the causes of targeting. She has built XRay [USENIX Sec ‘14], which reveals targeting through correlation. Finally, she has built Pebbles [OSDI ‘14], which reveals how mobile applications manage persistent data. In her talk, she mainly talked about Sunlight, and I refer you to the links for more information on her other papers.
She also builds development abstractions and tools that facilitate construction of privacy-preserving applications. She has built CleanOS [OSDI ‘12], which is a privacy-mindful mobile operating system. She has also built Pyramid, which is minimizing data exposure in data-driven applications.
Now, I’ll provide an overview of Sunlight. For more details, I refer you to the paper.
Sunlight is a generic and broadly applicable system that detects personal data use for targeting and personalization. It reveals which data (e.g. emails) triggers which outputs (e.g. ads). The key idea is to correlate inputs with outputs based on observations from profiles with differentiated inputs. Sunlight is precise, scalable and works with many services. They tested it for Gmail ads, ads on arbitrary websites, recommendations on Amazon and Youtube, and prices in travel websites.
In terms of design, Sunlight is generic and broadly applicable for targeting detection. They assume that a small set of inputs is used to produce each output. Their goal is to discover the correct input combination. They have precise and justifiable targeting predictions. They targeting predictions must be statistically justified. Their goal is to detect as many true predictions as possible. They scale in the number of inputs and outputs. Their goal is to detect targeting of many outputs on many inputs with limited resources.
In the end, if during data collection, they randomly assign their inputs independently of any other variable, Sunlight’s associations have a casual interpretation (not just correlation). However, Sunlight cannot explain how this targeting happens. What player in the ecosystem is responsible? Is it a human intervention or an algorithmic decision.
They have a very detailed evaluation, and I won’t reproduce it here. I refer you to the paper for that. However, she talked more about use cases for Sunlight. Specifically, they developed a GmailAdObservatory, a service to study targeting of Gmails ads on users’ emails. This is meant for researchers and journalists. The researcher supplies a set of emails. The GmailAdObservatory uses a set of Gmail accounts to send emails to a separate set of Gmail accounts, and it then collects ads periodically. This uses Sunlight to detect targeting for each collected ad.
Roxana and her group have done an interesting body of work on how to identify privacy issues in a data-driven web. I think this is an important body of work as users start releasing more data to applications. It’s important for users to know who is using their data and how it is being used.