In my previous post, I touched on a single but very serious risk that could bring the future of data-driven insights to its knees — unconsented data. We have more data than we are able to process at the moment and the rate of data generation is only increasing, so it is very easy for data to be used without anyone noticing — especially because data is used en masse. Very rarely is someone looking at your individual record for the purpose of analysis. Still, when data is found to have been used unethically, the reaction can be larger than necessary — probably because people react very nervously to learning that things have been going on without their awareness. The proverbial baby gets thrown out with the bath water.
In the discussion about consented vs unconsented data, I have heard “blanket consent” used as valid form of consent that justifies third party access to data. This is nonsense on so many levels, and while proponents argue that it has been legally sufficient in the past, I invite you to ask Facebook if legally solid blanket consents are carrying much water these days. As far as this writer (and most people in the conversation) are concerned, blanket consent is a snake oil saleman’s way through this issue. Third party use of data (also known as secondary use) that is transacted in this day and age under the cover of blanket consent, is effectively unconsented data to the thinking person because we have the technology to get individual consent on a large scale.
Data can be reasonably assumed to be consented by implication (if not explicitly given) when generated by the owner for its primary use. Risks exist to the data custodian (the holder of the data, not necessarily the owner) for primary use data. These risks increase when data is used for secondary purposes, and balloons when the secondary use is unconsented.
For example, you should tell your doctor of a personal mental health issue for the purpose of clinical care. The doctor becomes the custodian of the data that you consented to give him. If your data is de-identified and sold, rented, or given to an R&D organisation (assuming a legal blanket consent form that you signed when registering as a patient for the doctor), the data is now being used for secondary purposes by a third party — the R&D organisation — who has also now become a custodian of your data, albeit de-identified. De-identified data is data that has been stripped of identifying information (such as name, date of birth, street name, etc), but still contains enough information to support meaningful analysis (postcode, gender, year of birth, etc).
Continuing the example, let’s say that the R&D organisation is a university with a limited budget, but making great strides in the area of environmental factors contributing to mental health. A disgruntled computer science student at the same university hacks into the IT system, accesses and copies the data, deletes the R&D organisations records. Your private information is now held by a malicious player who stole it from an organisation that you never consented to having it in the first place.
Lucky the data was de-identified, right? Wrong. With the increasing number of publicly available datasets, fewer data points are required to relate individual records from different data sets to provide a more detailed record (including filling in the gaps that were created for the de-identification). There are examples all over the world, but the most notable example of re-identified de-identified data was in December 2016 when a team from the University of Melbourne (consisting of Dr Chris Culnane, Dr Benjamin Rubinstein, and Dr Vanessa Teague) did just that. As reported by ZDNet.com:
“The dataset containing historic longitudinal medical billing records of one-tenth of all Australians, approximately 2.9 million people, has been found to be re-identifiable by a team from the University of Melbourne, with information such as child births and professional sportspeople undergoing surgery to fix injuries often made public.”
It is important to note that the team at the University of Melbourne were friendly actors. They are data scientists whose very expertise is to see what can be done. What they found to be possible is alarming and as soon as it was discovered, they notified the Australian Digital Health Agency (ADHA — the government agency administrating Australia’s My Health Record, MHR) who immediately shut off secondary use. In this writer’s view, both the University of Melbourne and ADHA acted appropriately. The Australian government has since introduced legislation to make the re-identification of MHR data illegal. While well-intended, it is unlikely that this will be effective as it is hard to imagine that people who would maliciously re-identify de-identified data would deterred by new laws. Pirates don’t only sail where the law permits.
Also important to note is that with the increasing volume of publicly available data, re-identification isn’t an issue of probability, but inevitability.
Back to the example. The R&D systems are hacked, data is stolen, and patients’ personal data is re-identified. Public relations, reputational, legal, and ethical nightmares ensue. The main right to rage belongs to the data owner and responsive steps are taken. Whatever necessary steps are taken to recover, what will also happen is that the progress made to better the mental health of thousands will be severely hampered, if not lost. Not because data was stolen, but because data was stolen from a custodian whom the owners never really consented to have it in the first place.
The Cambridge Analytica scandal is that Facebook (the data custodian) allowed the data of millions of users to be used without their permission for a purpose they did not agree to. But what if you could get the users to agree to its use? What if the users had consented? Would it be scandalous at all?
Consented data recognises the data owner and protects the data custodian. Consented data means that the owner assumes the responsibility for each data custodian’s possession (and the user may have even received some portion of the reward for the data transaction), and the primary use custodian — and all other consented custodians — are only at risk to the degree to which they failed to reasonably take steps to protect the data. The doctor continues to treat patients, the R&D organisation continues to work toward societal improvement, and the data pirate is hopefully prosecuted under law.
Data is powerful. Data has potential. Data must be used — but in order for it to be used in the long term for the greater good, however, it must be de-risked. It must be consented.