Responsible AI/ML: privacy, bias and more in the age of the algorithm

Aleatha Parker-Wood
AI/ML at Symantec
Published in
5 min readMar 6, 2019

AI is a powerful technology — and anything that creates or redistributes power needs ethical attention. — Dr. Shannon Vallor

Millions of hours have been spent writing about the dangers of the brave new world we find ourselves in. Will we find ourselves in thrall to our robot overlords? Displaced from our jobs and cast aside? The monsters loom large in our minds, but much like Dr. Frankenstein, the true monster was with us from the beginning of the story. The most realistic dangers of AI/ML arise not from the jobs displaced (there will be many), or the sentient machines (expert consensus is that there will not be any, any time soon, not as we understand sentience). Rather, it will come from the already present human beings who collect and store our data, and train algorithms.

Would you trust this man with your sensitive data? Frankenstein (1931)

AI ethicist Dr. Shannon Vallor has called AI/ML a dark mirror of society. How data is collected, how it is used, and how value is extracted from it, are all reflections of how our society already works, for better or for worse. Facebook’s ads do not — cannot — learn from or target impoverished populations without access to the internet, and Amazon’s famously biased resume filter was only as unbiased as the predominantly male resumes fed to it. In machine learning, you can predict the future, but only if it looks exactly like the past or evolves from it smoothly and predictably. AI/ML, at least in its current state of advancement, cannot model truly disruptive phenomena like social change.

However, the fact that AI/ML replicates existing systems doesn’t mean we should undersell its dangers. AI/ML is an accelerator and amplifier of those existing processes, and that means we can create new problems faster and with more impact than ever before. AI/ML’s hunger for data has led to huge accumulations of sensitive information about every human or device with internet access (and many without), and that data poses an attractive nuisance to everyone who can gain access to it through power, money, or theft. And AI/ML’s ability to automate and accelerate decision making means that we can make biased, bad decisions at scale as never before.

Data is often described as the new oil. However, it’s more like a barren plot of land, purchased in hopes of striking oil someday. Very few companies in the “big data” rush have a clear idea of when or how they can monetize their data. But it is clear to everyone that entities with very little data are at a disadvantage. The simple physics of that have led to huge stockpiles of data that pose tempting targets for attackers, while holding an unclear immediate value for their collectors. One of the first things that companies can do to protect themselves from ethical issues around data is to ask whether they have a clear idea of what they will do with it, and whether there is any actual value to be gained. Data that is not collected cannot be stolen or misused, and a good data science consultant early on is generally cheaper than creating and maintaining huge storage pools of useless bits.

Even if a company has a clear plan to monetize their data, ethical concerns will arise. They need to be aware of how data can harm their users in the wrong hands, and also need to ensure their users are not surprised by the ways they collect and use data. As the old saw goes, “never do anything [with data] you wouldn’t want to see reported on the front page of the New York Times.”

There are a number of ways to mitigate risk around data collection. The first and simplest is not to collect it. If it must be collected, follow data handling best practices. Use encryption where appropriate. Don’t keep data longer than you need it. Audit all accesses, and the more sensitive the data, the more carefully it should be audited. Consider using algorithmic privacy techniques such as differential privacy or sketching, which allow you to collect aggregate statistics without exposing the underlying data.

Spend some time doing detailed threat modeling for your data, just like you would with any other security scenario. How could the Russian mafia use this? An abusive ex-spouse with admin credentials? A modern-day 3rd Reich? A foreign government? How would you prevent that abuse?

Once someone has made thoughtful decisions around what they will collect and how it is stored, and gotten user consent, they now have a heavier ethical problem to struggle with. What is the bias in the data? How do the choices of algorithms lead to bias amplification? There are very few truly random samples in this world. Almost all data is biased from the moment it is sampled, and some of those biases have far reaching consequences. For example, as previously noted, Facebook’s algorithms are based only on the pool of Facebook users. Data fed to word embedding algorithms can reflect the implicit sexisms of open source corpora and our own society. IoT devices oversample affluent early adopters. Most of the original facial recognition data sets were collected from college students, who are WEIRD (Western, educated, and from industrialized, rich, and democratic countries), not to mention young, and racially unrepresentative of the world as a whole. Algorithms trained on these data sets can struggle to recognize women, the elderly, people of color, and other minorities. To add fuel to the problem, the primary developers of AI/ML are also WEIRD, and they can readily overlook potential problems that come from outside their own experience.

An algorithm needs to work fairly for all the users of its system, present and potential. How does undersampling racial diversity, economic diversity, or gender impact the choices an algorithm makes for those populations? What are the social consequences? Even if the original algorithm began as just a toy, toy algorithms have a way of finding themselves reused throughout industry. Facial recognition data sets created for benchmarking in academia are now being used to train image recognition algorithms for photo tagging, facial unlocking for phones, and surveillance cameras. There is no such thing as an unimportant bias once the paper is published and the code is on the internet.

Ethics is not a tidy checklist. There is no ISO standard for it. Responsible AI/ML requires compassion and broad systemic thinking about all of your users, not just the ones who are like you, and how your choices will affect people’s lives. However, behaving responsibly towards users is not only the right thing to do, it’s the business savvy thing to do. Unethical handling of data, and bad algorithms are legally and reputationally problematic. (Just look at Facebook’s stock after the Cambridge Analytica scandal, or consider the consequences of a racially biased algorithm in a regulated industry such as loans.) Responsible AI/ML means considering the downstream effects of data handling and algorithmic choices, but the results benefit everyone, companies and users alike.

--

--

Aleatha Parker-Wood
AI/ML at Symantec

CS PhD, mother, scientist and engineer. Gets the job done right.