Can AI expose sensitive health information of large swaths of research participants by mining their fitness tracker data?
The recent study "Feasibility of Reidentifying Individuals in Large National Physical Activity Data Sets From Which Protected Health Information Has Been Removed With Use of Machine Learning" (Na et al., JAMA Network Open 2018, December 21 2018) has created quite a clickbait-buzz on mainstream news and social media. A few examples:
Anonymous patient data may not be as private as previously thought
(Reuters Health) - - For years, researchers have been studying medical conditions using huge swaths of patient data…
Study says hackers could match 'de-identified' health info to patients
New JAMA Network Open study was able to use artificial intelligence to successfully match 'de-identified' Fitbit data…
- Aggregated Data From Wearables Don’t Fully Conceal Individuals’ Identities, MedPage Today, December 21 2018
- Our healthcare data is not private anymore: Study reveals that machine learning can be used to re-identify individuals from physical activity data, Packt Hub, December 24 2018
- Advancement of artificial intelligence opens health data privacy to attack Berkeley News, DECEMBER 21, 2018 2018
- Advances in AI threaten health data privacy: Study, The WEEK, December 26 2018
- Artificial intelligence advances threaten privacy of health data, Science Daily, January 3 2019
- AI in wearables and mobiles threatening privacy of health data: Report, Gadgets Now, January 5 2019
- Advances in artificial intelligence threaten privacy of health data, Market Business News, January 6 2109
- How private health info can be identified through fitness tracker data, MedCity News, January 31 2019
In the midst of recent events underscoring the need of rethinking privacy in the era of big data (see this recent Harvard Business Review commentary for an overview), it is expected that such a topic would get a lot of attention from mainstream media. However, many people and a lot of the reports have been confused on how to interpret the findings in the Na et al. study.
At Evidation, we do research on data similar to that discussed on Na et al. article, including the NHANES dataset itself. Based on that experience, in the following post series I'll try to clarify the implications of the Na et al. study about:
- Privacy implications for participants in NHANES study: Keep calm, no breach of privacy occurred.
- Privacy implications for wearable devices data and other continuously collected time-series data: Breaches of privacy will likely occur.
- The role of Artificial Intelligence and Machine Learning in re-identification attacks: AI is not the culprit here, big (high-dimensional) data is.
Any feedback is appreciated. Feel free to leave a comment to the posts or via Twitter @calimagna. More information on my background and research can be found here.
Part I: NHANES participants' privacy is safe. (For now.)
Let’s start with some background on the NHANES dataset used in the Na et al. study. The study looked at wearable data collected in the context of the National Health and Nutrition Examination Survey (NHANES), run by the National Center for Health Statistics of the US Center for Disease Control and Prevention (CDC).
In 2003–2004 and 2005–2006 NHANES collected also Physical Activity Monitor (PAM) data, consisting of 1-minute resolution intensity data measured during 7 days for 9601 adults and 5030 children. An Actigraph GT3X-plus monitor was used. (See here the PAM data acquisition protocol). (For those interested in Health Data Science, my team at Evidation in collaboration with Professor Alex Frank at UCSB has put together a repository with some ETL scripts and explorations for the NHANES 2005 PAM dataset.)
Then the authors do the following:
- Split the NHANES data into datasets A and B. Dataset A contains PAM and demographics data for all participants Monday-through-Wednesday. Similarly, dataset B contains data for all participants for days Thursday-through-Friday. In other words, A and B fully overlap in terms of participants, whereas they are completely disjoint in terms of the time interval covered.
- Aggregate minute-level PAM data on both Datasets A and B to 20-minute time resolution.
- Show that the authors' best algorithm in 90+% of cases (adults, 85+% for children) can match participant records of dataset A with the correct record in dataset B by looking only at participants 20-minute aggregate physical activity and demographics data.
In summary, aggregated physical activity and demographic data were sufficient to re-identify participants from dataset A with dataset B.
The main point that seems to have been missed by many is that the re-identification performed is fully contained within the anonymized NHANES dataset. In other words, the algorithm could reliably tell when NHANES Participant 0001337 was the same in Dataset A and B, not that Participant 0001337 is Ada Lovelace, 49, from Santa Barbara, CA.
It is also worth noting, as discussed in a follow-up editorial written in response to the article, that most of the re-identification accuracy really comes from the demographics data, not the activity data. Quoting the response: "From Table 3 [first column] in the [Na et al] article, we see that adult re-identification accuracy on demographic characteristics alone is more than 80% for both cohorts (2003–2004 and 2005–2006), whereas the accuracy from activity data alone is less than 7%."
In summary, it was shown that the successful re-identification is confined within the NHANES dataset, and that it could have been achieved, though to a lower accuracy, even without using activity data. No additional information about NHANES participants, that wasn’t already readily publicly available was disclosed by the work of Na et al.
We may be relieved that the privacy of thousands of NHANES research participants wasn't breached by the findings of the study. However, as we’ll see in Part II of this series, such statement is only true here and now. Re-identification is an inherent risk of this kind of data, and such risk it's not confined within the boundaries of the specific dataset. In this sense, Na et al.'s procedure of splitting the NHANES dataset into two non-time-overlapping datasets obfuscates the real issue with this kind of data, which is if an attacker does have time-overlapping dataset then, with this kind of fine-grained time-series data, the matching procedure becomes very simple and will yield almost-sure guarantees of re-identification.
The risk of such time-overlapping datasets becoming available to unintended third parties is a risk we have to increasingly guard against in the future. In this sense, the privacy of the NHANES participants, just as those of any research participant in a publicly released dataset, is increasingly at risk as adversaries acquire more data and computational resources. For example, if Ada had been in the NHANES PAM cohort and had boasted on Facebook her activity achievements read off her Actigraph during the study, Ada's boss who's friends with her on Facebook could potentially use Na et al.'s method to re-identify her and learn about her mental health, sexual behavior, and alcohol use, also collected as part of the NHANES study.
With the recent FDA's push on using Real World Evidence to support regulatory decision (including the release of an app for collecting digital data in studies), and recent partnership between Fitbit and NIH to enable Fitbit users to link their accounts to All of Us precision medicine research project, it becomes increasingly possible that data collected in research studies may come directly from sources, such as consumer apps and wearable devices, whose data can also be made available directly by the participant(/consumer) through other channels, e.g., social networks. For example, participants wearing a Fitbit during a study can share their daily steps achievements on Twitter, or participants on diabetes studies wearing continuous glucose monitoring (CGM) devices may post a picture of their CGM time chart on Instagram (an example covered in the forward-thinking Tidepool's EULA). If the study data is then made publicly available by the researchers, anyone can then link the data posted on social networks with the one in the study, but unlike shown in the paper of Na et al., this time connecting it with the true (online) identity of the participants.
These risks may seem only theoretical, but as we'll see in Part II, these attacks are far from impractical from a computational perspective. The risks also increase over time as more data is being generated. Already released datasets that may not be at risk of re-identification today may quickly lend themselves to new unintended inferences enabled by new data becoming available at any point in the future.
But how bad is bad? Continue to Part II to learn how to quantify the risk of re-identification how new privacy frameworks around data sharing and future-proof privacy protecting methods are being developed to mitigate these issues.