Ethical AI Beyond Just Talking: Part Two
How to minimize risks surrounding algorithmic privacy
Any artificial intelligence (AI) or machine learning (ML) application that requires the use of data about individuals creates privacy concerns. Even when good faith attempts are made to summarize and anonymize records, the risk of privacy violations are real.
In part one of this series on ethical AI, we discussed the importance of monitoring and addressing unintentional bias in machine learning models. Today, we’re unpacking another key aspect of ethics in the context of AI/ML: algorithmic privacy.
Read on to learn how even the most well-meaning companies can experience privacy issues, and how to establish solid countermeasures to minimize associated risks.
Why do serious privacy breaches happen even in well-meaning organizations?
As noted in The Ethical Algorithm: The Science of Socially Aware Algorithm Design, a surprisingly small number of personal, idiosyncratic facts may be enough to uniquely identify us among the billions of people in the world (or at least among those appearing in a large database). These facts include things like when we watched a particular movie, the last handful of items we purchased on Amazon, or a photo of us arriving to a specific address.
Imagine a government agency that has records summarizing the hospital visits of every state employee. The agency decides to share an anonymized version of these records with another agency to support a research initiative involving ML algorithms that need individual records to “train” a regression model. Agency employees preparing the shared dataset may believe that by removing all record identifiers such as names, addresses, and social security numbers, they’ve eliminated the risk of exposing personal medical records.
This assumption was famously proven incorrect in 1997 when Latanya Sweeney — then an MIT graduate student and now a professor at Harvard — identified the medical records of Massachusetts Governor William Weld from anonymized information publicly available in a state insurance database. Sweeney only had to combine the governor’s anonymized medical records with the voter rolls for the city of Cambridge, Massachusetts, to make the identification.
How can you minimize risks of privacy violation in the context of AI/ML applications?
First, don’t let the decision-makers in your organization underestimate the risks of sharing “anonymized data” about individuals, even with the most trusted partners. Examples of privacy breaches like the ones described above can be a powerful way to communicate to your stakeholders the inherent risks of traditional data privacy approaches based on anonymization.
Second, start thinking about the risks you want to mitigate with data privatization. Private data can be used in harmful ways, such as determining which medical treatments a person is entitled to, or allowing a stalker to find the location of their victim.
Third, identify the approaches that can be used to mitigate these risks, such as differential privacy. Using differential privacy, algorithms inject a certain amount of “noise” into records before the data is shared. This makes it much harder for anyone to guess which individuals are included in the dataset, as well as which pieces of information are real or fake.
A common example of differential privacy is the randomized poll. Consider a survey that asks undergraduate students whether they’ve ever cheated on an exam. To protect individual students from any repercussions to a “yes” response, some of the responses are flipped from “yes” to “no” before they’re recorded, and vice-versa. Now each student has plausible deniability — even if their individual response is identified due to a privacy breach, they can easily argue that their answer was changed by the algorithm. The modified data contains “noise” and the individual responses are subject to significant error. However, since we know exactly how those errors were introduced, it’s possible to work backward and remove them to learn important information about the aggregate data, such as the percentage of students that have cheated at some point.
That said, because of the introduced noise, data from a randomized poll may lose its utility for algorithmic decision-making. For example, this could happen when the goal is to predict outcomes at the individual level rather than producing aggregates or percentages as in the example above.
Consider the following scenario: a nonprofit provides financial help for low-income students seeking to pursue graduate studies. To maximize the social impact of its grants, it wants to select the individuals most likely to successfully complete the graduate program. A data scientist is hired to design a predictive model using supervised ML to try and predict which new applicants are likely to graduate or not.
Imagine that the answer to the question, “Have you ever cheated in an exam?” has been shown to be highly predictive of student success or failure. Moreover, a “yes” answer may increase or reduce the likelihood of success depending on other factors, such as age and whether the student was also working while attending college. Because of its predictive value, the question would be useful to include in a candidate questionnaire. And because there is no way to know when a “yes” answer will count as a positive or negative factor for predicting when the student will graduate, candidates won’t have an incentive to lie.
Unfortunately, if the randomized poll technique is used to protect the privacy of individual applicants, an algorithm won’t be able to find the appropriate patterns and variable interactions to make accurate predictions for new applicants due to the deliberate noise (the answers switched from “yes” to “no,” and vice-versa).
However, differential privacy techniques can be applied to ensure applicant privacy while allowing sufficient information to feed a model. Consider the fictitious example in the table below:
On the left is a fabricated record containing the candidate’s details. The right side shows the result of a scrub system that suppresses identifying information. Here, differential privacy causes a degradation in data quality in exchange for a decreased risk of a privacy breach.
Of course, for an anonymization process like the one shown above to be effective, it needs to be customized for the specific business problem. For example, if the nonprofit rarely receives applications from South America and the “origin” field is populated as described, it might still be possible for someone to re-identify this candidate. To mitigate the risk of re-identification, careful methods of substituting and removing information may need to be combined into a solution that works for a specific business use case.
From creating the problem to helping solve it
The proliferation of big data, ML models, and AI solutions in various industries contribute to an increased risk of privacy breaches. That said, such technology can also be used to protect individuals from the harms associated with the disclosure or re-identification of sensitive information.
For instance, AI software can be used to achieve differential privacy and create synthetic data with the same statistical properties as real-world records. In medicine, it can be used to create a “parallel universe” where medical records for fictitious cancer patients are generated, utilizing patterns in medical data — such as the relationships between their age, symptoms, diagnosis, and reactions to medication — that are mathematically identical to those found in real-world patient records. This enables a model to learn from the data and predict patient outcomes or make customized treatment recommendations.
To go from well-meaning to well-acting, conscientious organizations using AI/ML solutions must consider the privacy risks involved and ensure that the appropriate countermeasures are in place, from anonymization auditing and differential privacy to scrub systems and synthetic data.
Slalom is a global consulting firm focused on strategy, technology, and business transformation. Learn more and reach out today.