Part III: AI/ML are not necessary for wearables data re-identification. (But they may help.)

Luca Foschini
5 min readFeb 5, 2019

--

This is a three-part post series on re-identification of wearable data. In Part I we looked at how the privacy of NHANES participants is immediately impacted by the findings in the article of Na et al. In Part II we’ve seen that the risk of re-identification is intrinsic in the high dimensionality of the data. In this part, we look at the role of AI/ML in providing new re-identification methods.

Given what we learned in Part II, the specific re-identification method being used for high-dimensional data such as time-series constitutes in a sense only a practical detail, with the use of more advanced techniques such as Artificial Intelligence and Machine Learning only accelerating , not directly enabling potential breach of privacy. Yet such acceleration may still be worth ascribing to AI/ML if they were necessary condition for it. The exact matching between daily step-counts time-series of the kind discussed in Part II can be implemented in a few lines of code, yet as we've seen in Part I, much of the press around Na et al. paper cited "Advances in AI" as the cause to the new threat to privacy. So, how crucial is the role played by AI/ML in the Na et al. NHANES re-identification study?

To their defense, Na et al. are solving a more difficult problem in which participants are re-identified across two non-time-overlapping datasets. (Recall dataset A in the study contains only Monday-through-Wednesday data, dataset B only Thursday-through-Friday data). In an analogy based on our example in Part II, organization B would get in possessions of A’s data from the same users of B, but each users’s days of data would differ between A and B. Na et. al show that even in this more general case re-identification can be performed for a large portion of users. Though as mentioned in Part I, I believe such generalization isn’t born by any practical consideration and doesn’t increase the scope of the risk, it does indeed require approximate matching strategies that are able to learn user behavior that generalizes across day, thus perhaps justifying summoning AI/ML methods.

However, even in the more general case of Na et al. the problems remains one of time-series classification (TSC), a multi-decade-long field of research in the area of data mining (the uncool, now-forgotten uncle of AI). See http://www.timeseriesclassification.com/ for an overview of the field, including a benchmark dataset.

Specifically due to the reason that TSC has seen so many “heavy duty” complex methods being proposed over time, often without justification for the added complexity, in their review paper on TSC, Eamon et al. (echoed by Lipton et al. in “Troubling Trends in Machine Learning Scholarship”) recognize the importance to justify the need of new methods by comparing against strong baselines recognized in the field. In the TSC community, the baselines to compare to are 1-NN (Nearest Neighbor) matching on ED (Euclidean Distance) and DTW (Dynamic Time Warping, a way of allowing “elastic” transformation of a query time-series before attempting the match). In other words, if one wants to make the claim that AI/ML are a necessary condition for a scientific advance, strong evidence should be provided that established methods for solving the same problem fall short. It would have been nice to see the performance of any of these "pre-AI" baseline methods being reported by Na et al in their study.

Nonetheless, I do believe there could be an interesting connection to explore between self-supervised learning (SSL), a recent technique in machine learning, and time-series re-identification. Specifically, following our example in Part II, I think B could train a individualized SSL models (e.g., autoencoder) to accurately reconstruct individual time series of its users. Then B could try use the same models trained on B time series to reconstruct time series in the A dataset, and call a user in A a “match” (i.e., successfully re-identified) when its time series can be accurately reconstructed by model trained solely on B. The intuition here is that in order to accurately reconstruct B’s time series, a model would have to learn individualized patterns of B’s users, so that a model trained on individual X’s time series should perform significantly better in reconstructing parts of X’s time series (including days not present in B) than a model trained on any other individual Y!=X. This idea is similarly to what we do in our NeurIPS18/ML4H paper here, though in the completely different context of building individualized models of heart rate responses based on minute-level steps data.

If you’re interested in such a project don’t hesitate to contact me — I’m hiring summer interns! :)

Conclusion

In this three-part post series (see Part I, Part II) I have looked at privacy of wearable data, following the press around the recent study “Feasibility of Reidentifying Individuals in Large National Physical Activity Data Sets From Which Protected Health Information Has Been Removed With Use of Machine Learning” (Na et al., JAMA Network Open, December 21 2018).

Na et al. study has the merit to have rekindled the discussion on data privacy and weakness of de-identification techniques in high-dimensional data.

Though privacy of NHANES research participants has not being compromised by findings in the Na et al study, the risk of re-identification of time-series datasets remain significant. This risk is driven much less by advances in AI/ML methods than it is by the lack of general awareness of the intrinsic difficulty of privacy preservation for high-dimensional dataset.

Several valuable guidelines for data governance have been proposed in the Computer Security arena to mitigate the problem, some of which are discussed in Part II. Those guidelines put forward the urgent need to stop treating de-identification as a sufficient privacy protection on its own and instead start considering risk mitigation at all steps of the data chain of custody, starting with being transparent to the end users about the perils of re-identification of data being collected from them.

Acknowledgments: I’d like to thank the Data Team at Evidation, Tom Quisel, Brian Bot & Larsson Omberg at Sage Bionetworks, Professor Yu-Xiang Wang at UCSB, and Andy Coravos at Elektralabs for the insightful comments and suggestions.

--

--