How PEL anonymized patient data

Monika Obrocka
Patient Engagement Lab
5 min readOct 7, 2019
Photo by Blake Lisk on Unsplash

Authors : Monika Obrocka, Nathan Begbie, Themba Gqaza, Eli Grant, Charles Copley

Since the publication of `Optimising mHealth helpdesk responsiveness in South Africa` paper in the BMJ Global Health, the Patient Engagement Lab team has been working on language understanding and interpretation in a low-resource language setting. To further our work we had to bridge the gap between university research and its real-life application. We discussed the data that we had collected with academics, and realised that these data were incredibly rare, and would be extremely valuable to other programmes in ‘resource-poor language’ settings. There are substantial opportunities for win-win collaborations with universities in which we operationalise research grade work. There are numerous challenges with establishing this, but also substantial benefits. In this blog we provide guidelines as to how we did this so that other organisations can take these into consideration. In a future blog we plan to detail the substantial outputs that came from this collaboration.

Given that we work within South Africa, we took our lead from the Protection of Personal Information Act (POPI). Both POPI and the more well known General Data Protection Regulation (GDPR) frameworks take a ‘consent driven’ approach to data sharing. Under this umbrella, users own their identifiable data. This means that identifiable user data can be shared only if the user has given permission to share the data. Consent to share personalised data is often difficult to obtain, particularly in programmes that are at scale. The logistics involved in ensuring that people really understand what they are consenting to are prohibitively complicated and expensive. For example, typical research studies may require a 10–15 minute verbal explanation of risks etc. In the absence of explicit user permission or consent, we can only share anonymised data.

Correctly anonymised data minimises the risks associated with re-identification of individuals, but when one dives into the implementation of anonymisation, the concept immediately becomes complex — much more complex than the US Health Insurance Portability and Accountability Act (HIPAA) laws would lead you to believe. Simple removal of prescribed meta-data does not guarantee anonymity. Someone trying to identify you may be able to use contextual information (i.e. your participation in a programme) to further identify you. For some context, Latanya Sweeney [1] showed that:

87% (216 million of 248 million) of the population in the United States had reported characteristics that likely made them unique based only on {5-digit ZIP, gender, date of birth}. About half of the U.S. population (132 million of 248 million or 53%) are likely to be uniquely identified by only {place, gender, date of birth}, where place is basically the city, town, or municipality in which the person resides. And even at the county level, {county, gender, date of birth} are likely to uniquely identify 18% of the U.S.population. In general, few characteristics are needed to uniquely identify a person.

Given the above, one needs to exercise a great deal of caution when considering sharing personal data. The question then remains, how does one institute an anonymization process? And, more importantly, is there a cost-effective way for organisations to do this and thereby enable research collaborations?

GDPR does not provide technical guidance as to how to implement anonymisation. However, a very useful guide from the UK’s Information Commissioner’s Office does provide guidance in the form of the ‘motivated intruder test’:

The ‘motivated intruder’ is taken to be a person who starts without any prior knowledge but who wishes to identify the individual from whose personal data the anonymised data has been derived. This test is meant to assess whether the motivated intruder would be successful.

The approach assumes that the ‘motivated intruder’ is reasonably competent, has access to resources such as the internet, libraries, and all public documents, and would employ investigative techniques such as making inquiries of people who may have additional knowledge of the identity of the data subject or advertising for anyone with information to come forward. The ‘motivated intruder’ is not assumed to have any specialist knowledge such as computer hacking skills, or to have access to specialist equipment or to resort to criminality such as burglary, to gain access to data that is kept securely.

Certain data are clearly more identifying than others. We therefore know that sharing certain variables will more likely fail the motivated intruder test than sharing others. For example, knowing a first name is identifying; knowing the first and last name is more identifying. A field of statistics called disclosure control divides variables in a dataset into two groups: identifying and non-identifying. These categories are split further, as illustrated in Figure 1. In this post we will focus on the left branch of Figure 1, namely on the direct and quasi identifiers.

Figure 1.

Direct identifier variables can either be removed from the dataset prior to release or combined with a one way hash together with a random number (to avoid incorrectly assuming that a MD5 has of a number is sufficient see) e.g. in SQL:

In addition we can apply the technique of k-anonymisation [4] to ensure that no record has fewer than k other records that are similar to it, thereby reducing the chances of re-identification. This can be achieved either through generalisation or redaction. In order to improve the utility of the data, one can also transform the data to remove absolute quasi-identifiers. For example in a dataset that contains event timestamped data (e.g. a pregnancy with expected delivery date and dates that indicate attendance at a clinic), the dates of clinic attendance (e.g. 2019–03–25) can be replaced by time intervals to the expected delivery date. There will be more instances of given time intervals than absolute dates, thereby improving the k-anonymisation of the data.

Guidelines [4] for applying k-anonymisation do not provide a fixed value to choose for k; the difficulty and risk of re-identification is typically hard to calculate and will depend on the quasi-identifiers available, externally available data sets, etc. Having said that, typical values for k are between 5 and 15 in health care data when shared with a small group of researchers. We used a value of 50 to be very conservative when sharing with our research partners at the University of Stellenbosch and Duke University.

Hopefully this blog has provided some food for thought in this arena. We are aiming to publish another blog detailing our collaboration model, which we also think is worth sharing. In terms of sharing data more publicly, we do not currently think that sharing our data publicly is feasible, but are exploring other means of anonymizing the data, including differential privacy.

Please let us know your thoughts!

[1] L. Sweeney, Simple Demographics Often Identify People Uniquely. Carnegie Mellon University, Data Privacy Working Paper 3. Pittsburgh 2000.

[2] Narayanan, Arvind & Shmatikov, Vitaly. (2008). Robust De-anonymization of Large Sparse Datasets. Proc IEEE Symp Sec Priv. 111–125. 10.1109/SP.2008.33.

[3] Pierangela Samarati and Latanya Sweeney, Protecting Privacy when Disclosing Information: k-Anonymity and Its Enforcement through Generalization and Suppression, 1998

[4] El Emam, K., & Dankar, F. K. (2008). Protecting privacy using k-anonymity. Journal of the American Medical Informatics Association : JAMIA, 15(5), 627–637. doi:10.1197/jamia.M2716

--

--