Humans are Doing the Hard Work that Algorithms Can’t. It’s Time for a Code of Ethics

4 min readSep 23, 2019

Following the latest privacy scandals involving the reliance of digital voice assistants on human contractors, a lot of the outrage comes from technocrats. Anyone with a realistic understanding of how data is analyzed knows algorithms aren’t capable of comprehending how chaotic the data — in this case voice data — really is. Time and again, it is people — often poorly paid contractors — who train the algorithms so they can make sense of what’s often “messy” data.

Given that human element, we need a code of ethics for the companies and people who work with this raw and potentially sensitive data. Such a code would create an environment that is respective of user privacy, honest about the current limitations of AI, and more transparent about who is actually doing the hard work.

I teach workshops on cleaning data, which is the step in data analysis in which raw (or pre-processed) data is standardized before it is passed into another part of the data analysis pipeline. One of the fundamental tenets I try to impart is learning to understand the structure of the datasets: what they look like before they are standardized, and what they should look like after.

Algorithms are really good at the latter, but terrible at the former. They don’t know the semantic or literal differences between “July 4th, 2019” and “2019–07–04” (both valid ways of representing the same date) until a human standardizes one of them into a format the algorithm has been programmed to expect.

As computer science professor Jeffrey Heer noted in a 2014 New York Times article, “It’s an absolute myth that you can send an algorithm over raw data and have insights pop up.” In other words, data requires human intervention before it can even begin to be understood by a computer.

The pharmaceutical industry, for example, requires clinical trials data to be standardized before it reaches the desks of those who write the programs to analyze it. The collection of clinical trials data, with forms for filling in basic information like age, weight and blood pressure, is fraught with data entry mistakes.

That’s why the industry created an infrastructure that protects both the integrity of clinical trials data as well as the anonymity of the individuals participating in the studies.

This type of institutional oversight is lacking within those corporations creating and monetizing voice assistants. At a fundamental level, voice itself is raw, pre-processed data. And while computers (and the algorithms that they run) have gotten quite good at deciphering distinct, clearly pronounced words spoken in specific languages, they struggle with dialects and slang, and they certainly don’t understand context. It should be obvious from the outset, then, that humans have been listening to the recordings all along.

Even invoking a digital assistant is prone to error. I’ve had Siri activate when I’ve spoken my own name. In my American English dialect, “Jerry” and “Siri” can sound the same.

Are they identical? No, but that is not the point. The point is that they are similar enough to trigger accidental activation, and data collected during such incidents should be deleted immediately, unless the user has agreed to have it collected. As influential tech writer and blogger John Gruber wrote in a recent post, “Having Apple contractors listen to random conversations or audio is the nightmare scenario for an always-listening voice assistant.”

Users should be able to have a say in whether their data is used by those with the responsibility of analyzing the samples and training the algorithms.

Along with clear and constant opt-in privacy settings, those assigned to analyze these recordings should be bound by a code of ethics. As part of adhering to this code, they should be treated and paid as professionals, not as Mechanical Turks at the bottom of the Silicon Valley caste system.

Mathematician Hannah Fry of University College London has already declared the need for an equivalent of the Hippocratic oath for mathematicians and computer scientists. I think she is unequivocally correct in this declaration and its implications for the big tech corporations.

Even the American Library Association’s code of ethics makes the reasonable expectation of privacy explicit. “We protect each library user’s right to privacy and confidentiality with respect to information sought or received and resources consulted, borrowed, acquired or transmitted,” the code states.

A precedent for standardizing “messy” data with professional human interaction already exists in the pharmaceutical research industry. Librarians and doctors each have a defining code of ethics inclusive of the rights of the individuals they serve.

Surely it is not a stretch to ask the corporations so invested in their customer’s data — and who have shaken up the world in such remarkable ways — to commit to similar standards, conduct and transparency.

Humans are Doing the Hard Work that Algorithms Can’t. It’s Time for a Code of Ethics

Written by Jerry Waller