# Part II: The risk of re-identification of wearable data is real. (And well known.)

This is Part II in the series. In Part I we looked at how the privacy of NHANES participants is immediately impacted by the findings in the article of Na et al.

In the Na et al. study, the motivating scenario is that of two organizations, an Accountable Care Organization (ACO), and an employer, being authorized access by the same individual to their wearable device data in different contexts: a weight loss intervention and an employer wellness program, respectively. The ACO may later on share anonymous wearable data with the employer linked with additional individual health information (the reason, not articulated in the paper, may be e.g., in the attempt to create more outcome-oriented benefits for the employee). At that point, a malicious employer may be enabled to re-identify the wearable data from the ACO and ultimately learn from the linked health information sensitive data about their employees.

If this sounds like a contrived example, it’s not too far-fetched. Protocols such as Open Authorization (OAuth) make it really easy for a user to share a copy of the data with different external entities, each of which may request different additional linked data, relevant to their specific context. What is not expected is that those entities may later on engage in sharing of anonymized data with each other, either directly or through additional third parties. The breach of privacy resulting is particularly hard to prevent because it may happen **anywhere in the chain of custody **of the data, at any point in time in the future. However, unlike much of the press on Na et al. article seems to suggest, such risk is far from being news.

In fact, almost 5 years ago in May 2014, the United States President’s Council of Advisors on Science and Technology reported that de-identification “[…] remains somewhat useful as an added safeguard, but it is not robust against near‐term future re‐ identification methods.” and deemed it not “[…] a useful basis for policy”.

Re-identification attacks have a **long history** in Computer Security literature including:

- the 20-year-old famous paper on de-anonymization of a Massachusetts hospital discharge database by joining it with with a public voter database.
- the celebrated re-identifying Netflix movie preferences in the Netflix Prize Dataset in 2008(eventually causing Netflix to pull the plug on the competition),
- the incident of “anonymized” runner tracks released from Strava accurately revealing perimeter of US military bases around the world.

Image and video data is not immune either (see our work on de-anonymizing pixelated video here for examples and references).

Even the specific context of Internet of Things (IoT) and, even more precisely, wearable device data, has seen significant research with issues identified (e.g., “Privacy Disclosure from Wearable Devices”, Yan et al. 2015) as well as proposed solutions (e.g., “A Framework for Personal Data Protection in the loT”, Torre et al. 2016).

The bottom line, which is fairly well understood in the Computer Security community is that re-identification is easy to do when **high-dimensional** data is involved.

# A rule of thumb for re-identification risk of wearable data: 6 days of step counts are enough to uniquely identify you among 100M other people.

Let's now try to quantify the risk incurred by Ada, a Fitbit user who shares her physical activity data with two organizations engaging in anonymized data exchanges, to be uniquely identified by one organization in the data received from the other. We'll see that, somewhat counter-intuitively, very little data, less than a week of daily step counts, is sufficient to be able to uniquely re-identify Ada among millions of other users.

Suppose Ada is sharing her Fitbit data via OAuth with organizations **A** and **B**. Using the OAuth token received, **A** and **B** can then later on contact the Fitbit API on behalf of Ada to download her daily step counts for a given period, which may look like this:

So, both **A** and **B**, will know the number of steps taken by Ada in a given day, for every day of the year Ada has worn her Fitbit.

Later on, organizations **A** and **B** engage in a data partnership deal, and **B** comes into possession of the dataset of all Fitbit users sharing data with **A**. **A** has taken good care on removing all personal identifiers (names, email addresses, etc.) and has provides **B** only with a large list of time series, one for each user, of the user’s daily step counts. **A** may believe in good faith to have successfully de-identified their Fitbit dataset before handing it over to **B**.

Now, let’s assume that an employee of **B **with access to the data goes rogue and wants to know how many of **B’**s users have also provided data to **A**. Note that this information is not directly inferable from the dataset provided by **A**, since all personal identifiers have been removed.

However, one can already intuitively guess that if **B** sees among **A**’s users a list of the kind “2011–04–27:5490; 2011–04–28:2344; 2011–04–29:2779; 2011–04–30:9196; 2011–05–01:15828; 2011–05–02:1945; 2011–05–03:366”, it can be pretty sure that it’s looking at the same user in the Figure above.

But *how* sure?

Here is where I think the existing literature is short of providing a simple rule of thumb about “how much is enough to be at risk”, so I’ll attempt to give one below.

Let’s assume for now that **B** is considering a specific user **X** with 5,490 steps on 2011–04–27 and is asking whether it’s the same as a given user **Y** of **A, **who also shows the same number of steps for the same day. What’s the probability that it happens by chance? (i.e., the match is a false positive?) To answer that question accurately we’d need to know the daily step count distribution of **X** and **Y**, but a rough lower bound can be derived by considering a slight different question: *what’s the probability that **Y**'s step count for 2011–04–27 ends with the same last 2 digits as **X**'s?* This question may seem a bit contrived, but it will help simplifying the math later on.

Even if some users walk a lot more than others on average, meaning they’ll be more likely to have daily step counts in the 8k ball-park, vs. others will seem more in the 2k ball-park, we can safely assume that any user’s daily step count modulo 100 (another way of saying “the last 2 digits”) are *uniformly distributed on the interval [00…99], ***i.e., all pairs of digits in [00..99] have the same chance to be the last 2 digits**. Even in the case of users nudged to target goals such as 10k steps/day, we must consider that 100 steps roughly equal 1 minute of brisk walk — a very small amount in the course of a day. The modulo=100 trick also allows us to safely assume that figures from user **X** and **Y are statistically independent from each other. **Even if **X** and **Y** step counts are influenced by similar factors (e.g., they're running buddies, partners, or subject to the same weather because they live in close proximity) there's no reasons to believe the last two digits would be as well.

So, if everyone independently from anyone else has the same chance of having their step counts for a given day ending in any of the 100 possible 2-digit pairs, given **X**’s time series, the chance that **Y**’s last 2 digits for day 2011–04–27 match **X**’s by pure chance is 1/100. The same applies for a match to happen on the last 2 digits for another day, say 2011–04–28. Now, if for any 2 days for the same user the last-2-digit of their step counts are statistically independent (which we can again safely assume so for behavioral reasons similarly to what outlined above), then the probability that user **X** and **Y** match on the last 2 digits of step counts on **k** distinct days is 1/100^**k**. Even with **k** = 6, that’s already 1 in 1 Trillion. If **A’**s dataset transferred to **B** contains 100M users, then **B** has to test to match **X** to 100M possible **Y**s, so the probability that it will find a false match for **X** is still less than 100M/1T, or **one in ten thousands**. That in turns means that if **B** does find a match for **X** over the 100M users provided by **A**, it can be 99.99% sure that it’s looking at **X **(or someone with the same data as **X**, see a full discussion of this nuance here).** **This ultimately means **B** will be able to assign to **X** the additional sensitive data provided by **A** linked to **Y, **thus breaking** X**’s privacy. Ada has been re-identified. In summary:

By knowing your daily step counts for just 6 distinct days, an adversary can re-identify you with 99.99% certainty among 100M other users' daily step counts.

Note that the above estimate of 1/100 probability of a false positive for a given day is **very conservative**. By working out the empirical distributions of step counts one can get much sharper estimates, thus requiring even fewer days for the same false positive rate.

Note also that removing day-level dates (a technique reminiscent of HIPAA safe harbor de-identification guidance) isn't sufficient to mitigate the risk. If, for instance, we only know that matches are in the same month, even if we don’t know whether days matched are the same, we’d only require to increase **k** from 6 to **k**’=11 matching days to set the chance of false positive lower than 1 in 3 trillions. (The results is adjusted by taking into account the number of possible ordered ways one can select 11 days over a 30-day month.)

Now, you may think that the matching procedure proposed above is very brittle as it assumes **exact matching** of the last two digits of daily step counts, (which is nonetheless a reasonable assumption as data returned by the same API should consistently be the same) yet the same reasoning can be applied even assuming approximate matching. If, for example, we call a match for a day if the last 2 digit are at most **ϵ **=10 apart, then the probability of a false positive becomes (roughly) 1/100/**ϵ** = 1/10. That means that instead of **k**=6 matching distinct days we need **k’**=12 days to have again 1/10^**k’ **= 1 in 1 Trillion probability of false positive.

In other words, a factor 10 reduction in accuracy tolerance on the single-day match results in only a 2x increase in the length of the (approximate) matching sequence keeping the false positive rate constant. To bring this an extreme: even reducing accuracy down to a binary piece of information: two days match if and only if their last 2 digit are either both ≤50, or both >50, then still only 34 distinct matching days will be sufficient to have probability of false positive to less than 1 in 10B. In summary, this is bad news for anyone who thinks that** small perturbations** in the data mitigate the risk of re-identification. Approximate matching can take care of small perturbations with only a relatively small increase in the required matching sequence length **k**.

# The curse of dimensionality is a curse for privacy, too.

The seemingly surprising facts outlined in the previous section are actually just examples of counterintuitive behavior of high-dimensional data. As the number of matching days **k** increase, (*k** can be seen as a “dimension” of our space*) the chances of a false match decrease by (multiplicative) factor of **C** for each (additive) increment of **k**, so even if **C** is low (e.g., in the case of approximate matching), it still compounds quickly. This fact is thoroughly discussed in the context of statistical re-identification here (Lemma 1) and here. It can be seen as yet another instance of the curse of dimensionality: when the dimensionality increases, the volume of the space increases so fast that the available data become **sparse**, so the probability that 2 data points match (even approximately) “by chance” in a very sparse space quickly becomes vanishingly small. The curse of dimensionality **hurts both ways**: when trying to identify *similar* traits in *different* items (AI applications) and when trying to *prevent* the *same *items to be identified as *similar* (statistical de-identification).

Continuously collected data is inherently high-dimensional. When every day is a “dimension”, a year worth of data becomes a point in a 365-dimensional space. Now we have devices that collect data, e.g., accelerometers, at the minute, second, or sub-second resolution. At those resolutions, just over 2 years of data sampling the binary information of a person being ‘moving’ vs.’still’ 100 times per second (100Hz frequency) has the same (uncompressed) size of a human genome.

It’s hard to think of anything more uniquely identifying than that.

Effective de-identification techniques for time-series data must be aggressive, including substantial aggregation and perturbation. In some restricted cases, when the kind of questions a third party will ask on the received data are know in advance, several techniques (the most practical one is differential privacy) can provide **provable guarantees** of de-identification even in the presence of **unlimited auxiliary information***. *This means guarantees that are **future-proof**. (Note that on the contrary, *un-principled* aggregation may not be enough, as the Strava incident taught us.) In the even more restricted case when data is used solely to train a machine learning model on a specific task known in advance (e.g., predict mortality from EHR) then parties can effectively perform distributed model training without sharing data across all the parties (e.g., see this recent work presented at NeurIPS18/ML4H for a health-care centric applications and overview).

However, such techniques generally require to strike a tradeoff between **privacy and utility **(i.e., the amount of useful information that can be later extracted from analyzing the data). Roughly speaking, the more privacy-preserving the use of the data needs to be by the receiving party, the more such party needs to specify in advance **a limited set of questions they will ask on the data**. (For a theoretical quantification of privacy/utility tradeoff, see this landmark paper.) One notable example where this assumption is unfeasible is for **data collected for research purposes**. In that case, data is collected based only on a very high-level research hypothesis, and research partners won’t know the specific queries to perform on the data only until after "**exploring"** the data itself. The possibility to retain such “free exploration” utility of the data fundamentally clashes with the ability to effectively preserving privacy it in high dimensions.

In summary, effective de-identification preserving utility for generic “research purposes” is hopeless for time-series data. Intuitively, that makes perfect sense: as we’re striving to build models that are more and more personalized (e.g., for precision medicine) we should expect the data we feed to those models to be more and more specific to each individual.

This fact is fairly well known in the Computer Security community and elucidated, among many other valuable ideas, in the excellent 2015 paper by Narayanan et al. “A Precautionary Approach to Big Data Privacy.” (See blog post summarizing main points here).

In their paper, the authors argue for a “precautionary approach” that goes beyond statistical de-identification considerations, and looks the probability of privacy violation due to re-identification as an end-to-end risk management problem, arguing that **data-release guidelines**, much more than de-identification methods themselves, would play an important role in mitigating the risk. (At a high level, this is in line with what the Privacy Rule prescribes in the Expert Determination route for release of de-identified data).

Narayanan et al. notably recommend that “release of ad hoc de-identified data to the entire public **should require justification; it should not be the default behavior**”, advocating instead for a system implementing “various **gatekeeping functions**, such as demanding proof of academic or peer reviewed standing, requiring ethical training, and designing and overseeing the security of the system.”

They also recommend **being transparent to the final users** (in the case of research, the participants) of the risk of re-identification, which will never be zero: “This transparency should include informing people about re-identification risks stemming from data collected about them.”

Narayanan et al.’s paper should be a recommended reading for everyone working in the data sharing space. The principles outlined there were so **clearly articulated** already back in 2015 that it’s hard to fathom that people still remain surprised when high-dimensional datasets are re-identified.

Proceed to Part III for a discussion on how AI/ML can be used for accelerating, not enabling, re-identification of time-series data.