How Mad Libs Helped Solve Differential Privacy

Published in

Data Science Student Society @ UC San Diego

5 min readDec 20, 2021

Introduction

Think back to your childhood, when, at least once, you filled out a Mad Libs story. These word games brought sempiternal joy and allowed us to create wacky scenarios and meanings by altering a few words.

*An example of a Mad Libs game. Image adapted from* *https://www.readbrightly.com/mad-libs-printables-activities/*

Aside from being able to immediately fill the room with laughs on any occasion, from birthday parties to talk show episodes, the Mad Libs paradigm has percolated into the Natural Language Processing (NLP) field. In recent years, researchers have adopted this simple method to tackle one of the most perplexing issues in data analytics: privacy of customer-supplied data. Before diving into some methodology, let me provide you with a bit more motivation: that data collection, even intended for social good, can lead to harrowing results — and you should be concerned.

The year was 2010, and Netflix was facing one of its largest legal challenges at the heels of its annual movie recommender algorithm-improvement contest, which was designed to boost customer retention. The dataset, consisting of over 103 million ratings of 17,770 movies by 480,189 users, was initially built with privacy in mind, but quickly took flak from privacy experts. Two researchers at The University of Texas at Austin were able to identify specific users from the earlier Netflix Prize datasets, and even deduced additional information about the users, such as their political affiliations. The anonymizing and de-identification jobs performed on the datasets were ultimately insufficient, and the contest was canceled after running for a few years.

*The sunrise of the $1 Netflix Prize coding contest. Image adapted from* *https://www.thrillist.com/entertainment/nation/the-netflix-prize*

When grappling with such abundant yet sensitive data, you need to lather, rinse, repeat, and do it all again, multiple times. As Netflix has shown us, the results of forgoing customer privacy can be beyond what the minds of even the brightest engineers can conceive. I will be introducing to you a toolset known as differential privacy, which addresses this issue, and how scientists at Amazon have applied it to text data.

Differential Privacy

Differential privacy (DP) retains this key piece of information: no matter how much we anonymize and scrub identifying features from data, attackers can use supplementary information to infer users. All we can do is decrease the likelihood of this new data and our source data being related. Under this framework, we obscure identities by adding noise to the data, although this comes at the expense of accuracy. Fortunately, the DP framework defines a quantitative tradeoff between privacy and accuracy (utility).

You may consider two datasets: one with your information, and one without it. Adding random noise throughout records will obfuscate the difference between the two datasets. In other words, the difference between values outputted by statistical/aggregation queries on each of the two datasets will be minimized. This is a mathematical guarantee that DP lays out for us. One caveat, however, what if these datasets are really small? You can add more noise, but in this case, there would be no utility from your analyses. Your best bet is to just collect more data.

Some other forms of DP that are important to distinguish are curator DP, wherein privacy is defined “with respect to aggregate statistics and ML models” (so, results), and local DP (LDP), where privacy is defined “with respect to the data points contributed by each individual”.

Now, Back to Mad Libs

Fast-forward to 2020. Researchers at Amazon Science proposed a technique that involves coming up with new phrases to replace pre-existing ones that reveal important information about the customers who supply the data. Analysis is then performed on the brand new text data.

As mentioned before, in order for DP to be satisfied, it must be equally probable that a given statistic was derived from either of the two datasets. The allowed error between the probabilities, denoted >0, must be prescribed by a domain expert.

How the Privacy Preserving Mechanism Works

How was this improved?

In a follow-up paper, Feyisitan et. al. extended their methods to hyperbolic space, beyond the ordinary Euclidean space. For me, the most simple analogy to help discriminate between hyperbolic space and Euclidean space is that when you walk on a Euclidean plane, you can go an infinite distance and never return to where you started. Whereas, hyperbolic space has constant curvature, so you can walk around the equator and return to your starting point. Just like in Euclidean embeddings, distance serves as a notion of semantic similarity, but the hyperbolic curvature injects a hierarchy into the embeddings: locations at different curvatures signify differing levels of specificity. In the figure below, words get more general as you observe closer to the center of the embedding.

*The WordNet noun taxonomy projected to a two-dimensional hyperbolic space. Image adapted from* *https://mnick.github.io/project/geometric-representation-learning/*

This allowed the authors to adopt a strategy of replacing the specific private words with relevant general ones, making user-specific data even harder to trace.

Written by Camille Dunning