A Good Dataset is Hard to Find

Data used for social good still comes from real people.

Published in

Impact Labs

5 min readNov 21, 2019

DeepMind, an AI lab in London, has been doing lifesaving medical work by speeding up patient diagnosis, preventing blindness with better predictive techniques, and most recently, by spotting breast cancer more accurately than doctors. The algorithms used to do this are fuelled with the healthcare data of NHS patients, which is some of the most intimate data that exists. Now that DeepMind’s health team has merged with Google Health, there are concerns that the data will be shared too widely — and without patient consent.

De-biasing efforts need data too. The Center for Data Innovation (CDI) recently covered IBM’s efforts to reduce algorithmic bias by releasing a dataset of faces with a variety of skin tones, ages, and genders. The data was created from publicly shared images on Flickr, and many people were upset that their photos could be used for facial-recognition technology without their consent. But IBM did not do anything illegal or unprecedented. Images were posted under a Creative Commons license, which allows content to be reposed, repurposed, or adopted with few restrictions. CDI writes that under current copyright law, “any data a human can access, a computer can also access.” IBM is not unusual for sourcing its images from social media — the vast majority of existing datasets are created with publicly shared images, videos, and text.

The potential for AI models to do social good is tremendous. But the privacy and psychological safety of digital citizens is not something we can afford to lose. How should we weigh the interests of users who create data against the interests of people who could benefit from using it?

For now, that balance is mostly determined by companies themselves.

Microsoft’s legal team has authored three user agreements with a variety of permission and privacy levels.

The first is what I’ll call the open agreement — a data free-for-all. It’s modelled after the Creative Commons license that IBM used to collect Flickr images. If a photographer shares their images with IBM under this open agreement, not only could their photos be used to train an AI model, but they could be reposted all over the internet.

The second is a computational agreement — data shared can only be used for AI training purposes. In this agreement, a photographer could share their work to train an AI model, but retain the right to prohibit a company from putting their photos on a public website.

The third is an open-source agreement — data can be used to train AI models as long as the models are open source. Here, the thought is that user data makes the model better, and the open-source model makes society better.

The Flickr users’ complaints seem to reveal the hope that AI models would be prevented from using photos that are available to human eyes. The latter two of Microsoft’s three legal agreements outline the inverse of that hope. They allow machines to use data that humans can’t see. This might provide the type of privacy that NHS patients would expect — their data could be used to build AI tools that help them heal, but remain securely out of public sight.

Jack Clark, OpenAI’s Policy Director, hinted in a recent edition of Import AI that he values datasets built with consent. He applauded Facebook’s AI Red Team for the way it created its new dataset for the Deepfake Detection Challenge. In “a rare turn for an AI project,” Clark says, “Facebook seems to have acted ethically here” — paid, consenting actors were used to create the images and deepfakes in the set.

Images from Facebook’s Deepfake Detection Challenge dataset, created with the consent of paid actors.

This sounds like a promising turn, but it takes vast resources to be able to hire actors for a massive dataset. And hiring actors makes it difficult to scale up. Facebook’s actor-based preview is made up of 5,000 videos, while a Youtube-sourced “kinetics” dataset from DeepMind is 130x larger, with 650,000 clips of activities like walking, climbing trees, and even making slime. In the coming months, Facebook plans to expand its dataset to “tens of thousands” of videos — still a fraction of the largest sets available.

Even with a lot of data available, authenticity is still a challenge. Actors certainly can’t be used to supply medical data, financial records, or social media habits to build programs that try to understand human activity as it actually exists.

Companies like Facebook are already grappling with hard questions, like how “to further the state-of-the-art or academic understanding on important social issues” while respecting individuals’ “fundamental rights and freedoms.” It might not be for lack of trying that privacy norms have remained contentious and blurry — the right balance is far from obvious.

If participation in datasets is completely optional, the people that choose to contribute might not be representative of the population at large, leading to algorithmic bias that responsible data scientists are trying to eliminate. But there is a long way to go before all people — especially marginalized groups — feel safe to share their data. A first step for companies and governments is to provide users with good reason to believe that their data will not be used against their interests — and recourse, if it is.

Perhaps we should proceed by letting representatives decide how the costs and benefits of open data stack up. Use of NHS data by DeepMind’s health team might be more properly decided by parliament, if not voters themselves. There will be voters who oppose medical research fuelled by their data. What “fundamental rights” do these citizens have?

For now, the biggest mistake is to claim that the balance between giving individuals absolute control over the data they create and allowing data to be used for good is easily decided. Data used for social good often has a human cost, and it is in our interest to determine when that cost is worth paying.

A Good Dataset is Hard to Find

Data used for social good still comes from real people.

Written by Christina Barta