A Dataset is a Worldview

On subjective data, why datasets should expire, & data sabotage.

Hannah Davis
Towards Data Science

--

This is a slightly expanded version of a talk given at the Library of Congress in September 2019.

A Dataset is a Worldview

My name is Hannah Davis and I’m a research artist, generative musician, and data scientist working with machine learning. Most of my work deals with ideas around emotional data, algorithmic composition, and dataset creation.

One of my bigger projects was an algorithm called TransProse, which programmatically translated books into music. It worked by finding the underlying emotional content and using that to generate music pieces with the same emotional tone.

http://musicfromtext.com/

To do this, I used several resources, including what is essentially a dataset of words that had been tagged with various emotions.

And I worked with this dataset for a long time before really deeply examining it, and when I did, I found this:

The word childbirth next to ten emotions, each followed with a zero.

The word ‘childbirth’ had been tagged as being completely unemotional.

That was interesting to me. How had this happened? Who were the people who had tagged this? I could think of a scenario for any one of the emotions to be checked, but not for none of them to be picked.

I didn’t get those answers, but I did learn that each word had been tagged by only three people. This is common in dataset creation, since labeling gets expensive quickly — even three people per word can turn into thousands of dollars.

But because a machine learning model learns the boundaries of its world from its input data, just three people informed how any model using that dataset would interpret if ‘childbirth’ was emotional.

This led to a perspective that has informed all of my work since: a dataset is a worldview. It encompasses the worldview of the people who scrape and collect the data, whether they’re researchers, artists, or companies. It encompasses the worldview of the labelers, whether they labeled the data manually, unknowingly, or through a third party service like Mechanical Turk, which comes with its own demographic biases. It encompasses the worldview of the inherent taxonomies created by the organizers, which in many cases are corporations whose motives are directly incompatible with a high quality of life.

Datasets Should Expire

In addition to the specific biases of their creators, datasets also encode the general cultural values of the time at which they are created. Outdated societal worldviews are the reason that certain Disney movies are locked up in a vault — because our society has changed and grown, and ideas in those movies (namely, racist and misogynistic ideas) are no longer appropriate.

This kind of expiration does not happen for datasets. Take, for example, the Labeled Faces in the Wild dataset, which was created in 2008. This was one of the first datasets for unconstrained facial recognition, and consisted of 13,000 images of mostly celebrities scraped from the internet. But because of the moment in time, just 3 people make up 7% of this dataset, which was meant to be generalizable: George Bush, Colin Powell, and Tony Blair.

This type of historical bias exists across subject matter. The great and genuinely amazing dataset ImageNet, which transformed the whole field of machine learning, was created in 2009 and is still used today. If you search for ‘cell phone’ in the original dataset, it returns a selection of flip phones, since this was before smartphone use was common.

Machine learning datasets are treated as objective. They’re treated as ground truth by both the machine learning algorithms and the creators. And datasets are hard, time-consuming, and expensive to make, so once a dataset is created, it is often in use for a long time. But there is no reason to be held to the past’s values when we as a society are moving forward; similarly, there is no reason to hold future society to our current conditions. Our datasets can and should have expiration dates. Consider the kind of data that would be collected and the way it would be labeled under Trump’s administration; is there any reason we would want that to inform our future?

“The AI Did It”

Many of these issues are problems within classification and taxonomies more broadly. But this is more important and scary in relation to machine learning for two primary reasons: first, machine learning datasets influence models that go into production in the real world — often quickly and without any examination — and have substantial and immediate impact in our daily lives. They affect everything from our search results to job prospects to credit scores, invisibly and with no accountability, no process to appeal and no option for redress.

Second: machine learning researchers, journalists, governments, corporations, and other relevant parties have perpetuated this narrative of AI as a black box, as completely uninterpretable and most importantly, responsible for itself. This allows all of these issues of biases and worldviews to be written off as “just the AI.” It is an exceptionally convenient narrative for those who are using machine learning for harm.

Subjective Data

One idea I am advocating for, to counter some of these issues, is the idea of “subjective data.” This means embracing the fact that a dataset has a worldview, and being explicit about what that worldview contains. It also means the opportunity to do better — to create our own ideal or experimental worldviews and transcend the ones we’ve been forced to inherit. We can create a dataset with data and labels that model an ideal or healthier society, not just mirror how it is today. We can create unique and experimental taxonomies and labels, and leave harmful ones out. We can make it so that datasets actively support those fighting for a better future, rather than becoming another obstacle to battle along the way.

This is already happening in the field of machine learning art (see below), but I believe it’s relevant to other fields as well; I believe we all have more agency over our datasets than we think. We should consider how to include nuance as much as possible — for instance, by making room to include every piece of input data, rather than truncating it to fit a few categories. We should prioritize practices that move us away from a singular worldview. We should not let people’s unique life circumstances be judged by algorithmic averages, with real-world consequences. We should consider whose worldview is being imposed on us.

Data Sabotage

I want to end on a note that will make all of us who care about data and archiving a bit uneasy, but that is important to think about in this day and age: the idea of intentional data sabotaging. Many of you know this story already, but Rene Carmillé was an early example of this; during World War II, he was a secret French Resistance fighter who ran the National Statistics Service under the Nazis in Vichy France. He was in charge of converting the Nazi census data to punchcards, and he decided to remove column 11, which was the column that indicated religion. He saved many peoples’ lives solely by deciding certain data should not be collected about them.

In present times, the gravity of our decisions around data is similar. Data about a person’s race, gender, mental health, immigration status, and other classifications is likely to be used for harm, where it hasn’t already. For those of you for whom this is relevant, talking about it with your team (or yourself) and deciding at which point you are going to take action is important to do ahead of time, so that you will be able to recognize and act when the moment comes.

In the wrong hands, or even in the perfectly-well-intentioned-but-thoughtless hands, classification is violence. Without a rigorous contemplation of this idea in relation to your own machine learning work, it is easy to accidentally cause harm through something as seemingly simple as collecting and labeling data.

A few examples (of many!) machine learning art projects that use and/or explore subjective datasets:

The Laughing Room, Jonny Sun and Hannah Davis

Classification.01, Mimi Onuoha

Feminist Data Set, Caroline Sinders

Animal Classifier, Shinseungback Kimyonghun

Dataset of Sephora reviews that mention ‘crying’, Connie Ye

This is the Problem, the Solution, the Past, and the Future, Sebastian Schmieg

Thanks to Mykola Bilokonsky, sarah k hallacher, and Luis Daniel for feedback!

--

--

Generating music from text & emotion. Researcher, artist, composer, speaker on algorithmic creativity & machine learning. www.hannahishere.com