Google Fuzzing: Is the privacy/utility tradeoff a false dichotomy?

Nick Merrill
CLTC Bulletin
Published in
6 min readFeb 19, 2021

By Nick Merrill & Jeremy Gordon

Search engines like DuckDuckGo aim to serve you results while protecting your privacy. But doesn’t Google just… work so much better?

Part of Google’s efficacy lies in its internal predictive model. Your queries inform Google’s algorithms, which in turn allow them to suggest search results that you are more likely to be satisfied with.

To be clear, this is a model of you, however incomplete.

For example, after learning your preference for vegan recipes, this model may choose to serve you results (and, of course, ads) about local, vegan-friendly restaurants. It may infer, over time, that you are vegan. Perhaps it will infer where you live, where you work, and what kind of person you are.

What if there were a way for you to negotiate with Google — to decide to what degree you want to trade your privacy?

Every time you hit the “search” button, you’re giving up some privacy. You’re trading information about yourself for Google’s ability to recommend relevant results.

What if there were a way for you to negotiate with Google — to decide to what degree you want to trade your privacy? What if you could decide what the optimal tradeoff between privacy and accuracy is for you?

That question emerged from a conversation that the two of us — Nick Merrill and Jeremy Gordon, both researchers at the University of California, Berkeley — had about a research paper that Jeremy co-authored with UC Berkeley researchers Max Curran, John Chuang, and Coye Cheshire. (Jeremy is a Ph.D. student at the School of Information and a member of the BioSense lab.)

In the paper, “Covert Embodied Choice: Decision-Making and the Limits of Privacy Under Biometric Surveillance” (to be presented at the CHI 2021 conference), the authors share findings from a virtual-reality-based behavioral experiment studying how people respond to surveillance. In short, the researchers found that people are pretty bad at evading algorithms’ predictions, even when they’re explicitly motivated to act unpredictably.

Image from “Covert Embodied Choice: Decision-Making and the Limits of Privacy Under Biometric Surveillance”

Participants in the experiment were asked to pick up one of two cards on a virtual table. Simultaneously, a computer adversary was given access to a stream of biometric data: gaze (where the participant is looking), heart rate, arm and head motion, and more. The computer then used this data to predict the participant’s choice before they pick up the card. Gordon and his fellow researchers found that, while participants used various strategies, the adversary successfully predicted their decision over 80% of the time. Additionally, a significant portion of participants became more predictable when attempting to obfuscate their intent.

A significant portion of participants became more predictable when attempting to obfuscate their intent.

This finding made us wonder: perhaps people can act more unpredictably toward algorithms with the help of… yet another algorithm? We agreed that, by adding some random noise to their behaviors, participants in Jeremy’s study could have evaded the algorithm successfully.

But this led us to a much bigger question: how might one evade prediction by an entity as powerful as Google?

One solution you may be familiar with is to use a virtual private network (VPN). This prevents services like Google from knowing where exactly you are, making it more difficult for them to profile you. If you’ve ever used a VPN, you’re also aware of the price you pay: poor search results. If you’re using a VPN and use Google to search for a ramen restaurant near you, the results you see may be from Stockholm.

So how do we preserve some privacy while getting slightly better search results? One simple approach for hindering Google’s ability to profile you is to throw some misleading queries into the mix: use a “fuzzer” to flood the zone with fake queries and hide your real query somewhere inside the stream. Plant your own needle in the haystack. (See one technology in this vein here).

The problem with this approach is that Google knows what “real queries” look like. They have lots of data of people querying Google, and they’re likely able to tell human-generated queries (which are often correlated with other recent queries, or aspects of a user’s context) from automated queries, which will likely look as disjoint and artificial as the system that created them. In fact, Google is probably the only entity capable of doing so!

Another solution would be to share a Google account. The more people you share it with, the more “private” any individual user’s searches become. (Google knows that someone is curious about jellyfish but isn’t sure who.) Of course, the tradeoff is that your future results will be less accurate, but by controlling the size of the “sharing pool,” you could at least control the accuracy/utility tradeoff.

A problem with this approach is that sharing Google accounts could cause some privacy leakage. People could inadvertently reveal sensitive data about themselves to Google and to the other people in the sharing pool.

The sharing pool approach is analogous to differential privacy, a method used to find the patterns in a data set while obscuring each individual’s information. Differential privacy is also based on a trade-off between accuracy and privacy: by adding noise to a database, the database becomes more private but less useful. Using a value called “epsilon,” developers can control this tradeoff. (Learn more about differential privacy here.)

If you’re designing a sharing pool, you’d ideally like to gather queries while protecting everyone in the pool’s privacy. This problem is similar to differential privacy: the fuzzer needs to issue search queries that are realistic enough to garner good recommendations but random enough to prevent different users in the pool from finding out what their friends are looking up. Perhaps an epsilon can help in this situation.

Another way a privacy-preserving fuzzer could work would be to establish a decentralized network where some percentage of queries get piped through someone else’s Google account, with results “onion-routed” back to the requester (that is, the results would pass through multiple hops, making it unreasonably difficult to trace who in the network made a particular query). An individual benefits when queries are routed through their account, as it creates a new source of “noise,” and when their own requests are routed back to them, as they receive a degree of privacy.

What we’re really converging upon here is some way of sharing the power that comes from being a widely used recommender like Google.

At this stage, the problem becomes the design of a routing scheme. This scheme would be further customized by each user’s privacy/utility preferences (our version of epsilon).

Why is this approach important? Philosophically, we’re contesting who knows who is Googling what. What we’re really converging upon here is some way of sharing the power that comes from being a widely used recommender like Google.

If there were some distributed or consensus way of querying an oracle like Google in a privacy-preserving way, what would that mean?

Consider it this way: a common assumption is that Google’s market power (and valuation) are a function of its ability to deliver a service that needs to be centralized. If users could get good-enough utility from Google without revealing enough about themselves for effective ad targeting, it would seriously question Google’s business model. Would advertisers still pay?

If developed, this work may demonstrate that effective privacy is possible and expose the privacy-for-personalization tradeoff as a false dichotomy. The implications would go well beyond Google to include an entire class of internet businesses whose valuation is tied to the promise of extracted user information.

Ultimately, we want to see users gain greater leverage over their data. They should not have to give up any more privacy than necessary. Where regulation may help with data stewardship, perhaps user-facing tools will play a critical role in realigning incentives — and keeping companies honest.

--

--

Nick Merrill
CLTC Bulletin

Director @ Daylight Lab, UC Berkeley Center for Long-Term Cybersecurity — daylight.berkeley.edu