The “Dataset Nutrition Label Project” Tackles Dataset Health and Standards

Berkman Klein Center
Jan 29 · 8 min read

by Hilary Ross with Nicole West Bassoff

We use algorithms to make decisions every day, from finding the least trafficked route, to browsing the news, to making hiring decisions at work. As algorithmic decision-making becomes more pervasive, there is a lot of important work to be done to ensure that algorithms are developed with attention to accuracy, bias, and fairness. Increasingly, journalists and academics have been investigating and exposing the bias in algorithmic outputs, but less attention has been paid to the bias in the data that’s used to train those algorithms.

The Dataset Nutrition Label Project (DNLP), which was created during the 2018 Assembly program hosted by the Berkman Klein Center and MIT Media Lab, seeks to tackle this blindspot in our understanding of the health and quality of data.

The project’s premise is simple. The integrity of a machine learning model is fundamentally predicated on the data used to train it — as the saying goes, “garbage in, garbage out.” Instead of waiting to assess models after they’ve been created, the DNLP aims to make it easier to quickly assess the viability and fitness of a dataset, before it is used to train a model, by giving it a “nutrition” label.

In 2018, the DNLP team developed quantitative and qualitative dataset health measures. Now, the team is working to package those measures into an easy-to-use “dataset nutrition label.” Check out their first prototype label here, built on ProPublica’s Dollars for Docs dataset. The team also wrote a white paper explaining their framework and the concept of a dataset nutrition label.

Since last year’s Assembly program, the project has grown and evolved. We spoke to four of the project’s current team members — Kasia Chmielinski, project lead; Sarah Newman, researcher and strategist; Josh Joseph, AI researcher; and Matt Taylor, data scientist and workshop facilitator — to learn more about how Assembly brought them together and what they’re working on now. The interview has been edited for clarity.

An example of a dataset nutrition label

Starting From a Point of Interdisciplinary Collaboration

Assembly gathers a small cohort of technologists, managers, policymakers, and other professionals to confront emerging problems related to the ethics and governance of artificial intelligence. The four-month program begins with a two-week intensive ideation process and short course, during which participants begin to form project teams. This is followed by a twelve week collaborative development period, when team build their projects.

Last year was Assembly’s second iteration, with a nineteen person cohort. By the end of the program, the group had created six projects, including the DNLP.

During our interview, the DNLP team spoke about how the Assembly program brought them together and encouraged interdisciplinary collaboration.

JOSH JOSEPH: As an AI engineer, I really like building stuff. To be honest, before the program, I hadn’t thought that hard about a lot of the ethics, policy, governance, and law related to AI. Assembly was a way to think more deeply about important questions like “what do we mean by bias?”, and at the same time, to actually work on a project and build something with people who aren’t all engineers. As an engineer, I got a lot out of being challenged in that way.

KASIA CHMIELINSKI: Agreed. In the ethical tech conversation, it can often feel like there are people who are building technology, and then there are people who are writing papers about the implications of that technology. It’s rare for them to be able to come together to collaborate. Assembly was an opportunity to think about ethics and implement ideas across these disciplines. Our group is really diverse. We’re thinking about art and media, learning, product management, and engineering. And that’s reflected in the outputs of our project: a prototype, but also a paper, and now we’re also speaking regularly across domains. I’m really glad for the opportunity to have these conversations across the industry.

SARAH NEWMAN: Assembly brings together people with very different backgrounds, and the program encourages and facilitates collaboration, which makes for really unique outputs. We came up with common language, and were generous with each other, and designed projects that were stronger than they would have been otherwise, because of our varied perspectives and approaches to solving problems.

It’s one thing to go to an event or a conference where there’s people that are coming from different fields or different sectors. You meet, schmooze, talk about ideas. That’s great. But there’s something very different about actually working with people on a team; going through the ups and downs, the tensions, the successes, really being in the process of working together. The connection becomes so much deeper. So, one of the big benefits of Assembly for our project was this collaboration across sectors.

MATT TAYLOR: Newman mentioned how Assembly facilitated collaboration. Thinking about the overall experience, the grounding sessions that we did in the first two weeks were key. There are two specific moments that were emblematic for me.

First, towards the end of the first day, we established ground rules and guidelines for how we wanted to be with each other, facilitated by two of our fellow Assemblers, Newman and David Colby Reed. That’s something that I don’t often see in more technical or academic spaces. I think it’s a valuable practice. It was helpful for allowing us to be in dialogue with each other.

Second, we did a “k-means clustering” activity, led by fellow assembler Gretchen Greene, which was an embodied version of how the particular k-means clustering algorithm works. We all physically acted out the steps of the algorithm. We could all participate, coming at it from technical, policy, art perspectives. It was another example of how everyone brought their expertise to help the group create some shared language to tackle these projects.

A number of the Assembly 2019 cohort, including some members of the Dataset Nutrition Label Project. From L to R: Kathy Pham, Jonnie Penn, Matthew Taylor, Sarah Newman, Sarah Holland, Kasia Chmielinski, and Ahmed Hosny

Scoping and Developing The Dataset Nutrition Label Project

During the Assembly program, the cohort spends the first two weeks dividing into project teams and developing project ideas. Over the following twelve weeks, each team works together to build out their projects. The teams are supported by a group of expert advisors, practitioners, and academics who provide feedback on ideas and outputs. We asked the DNLP team to tell us a bit about how their project was scoped and developed.

KASIA: I was the team product manager. After we had an idea — building standards around datasets — we sat down to figure out what we could actually do in four months. Which is a very short amount of time! For a while, we were thinking that we could either create a prototype of a label or write a policy paper. Instead, we decided to do both. We realized that we had all the skill sets that we needed. To me, that was a really strong moment at the beginning of the project, which was possible because Assembly brought all these diverse talents to the table.

Our idea also requires us to constantly be talking to people outside of our project, to figure out what the standards should be, and to have access to relevant datasets.

NEWMAN: We chose to create a “nutrition label” as opposed to developing other potential outputs for a few reasons. First, it’s familiar, accessible, modular, legible, and translates across many mediums.

Second, a dataset nutrition label can act as an educational tool, to show that the outputs of algorithms are coming from somewhere: the training data. We hope the dataset nutrition label idea spurs broader conversation. We want to inspire people to look at every dataset that will be used to build a model and ask “What are the contents of this dataset? Is this the right dataset to build this model?” We believe the existence of nutrition labels on datasets will encourage broader interrogation of dataset contents and fit.

KASIA: On the technical side, our label framework is modular. We don’t use the exact same information for every dataset, but instead use the same label framework, which data scientists can run the data through. As we were building our prototype, using ProPublica’s Dollars for Docs dataset, we wanted to try a module based on probabilistic computing. Through Assembly, we connected with the probabilistic computing group at MIT. We were able to leverage their tool, called BayesDB, which allows us to compare the data in the prototype label to other similar datasets, to see where the biases creep in.

JOSH: The BayesDB connection is great, because they have a lot of really nice tools for finding issues in your data. We might’ve been able to build similar tools, but it would’ve taken us much longer. Instead, through the collaboration with BayesDB, we were able to do far more than we would’ve otherwise.

NEWMAN: During Assembly, you feel like you’re in a community that has the pulse of what’s going on related to the ethics and governance of AI. The wider and immediate circles bring value to the program by supporting projects, putting you in touch with people, acting as a gut check, and making sure you’re not reinventing the wheel.

What’s Next for the Dataset Nutrition Label Project?

KASIA: Our project continues to be volunteer-led and driven. Last fall, we got together to plan what we want to do in 2019. During the fall, I also had the opportunity to further develop the project as a Mozilla Open Leaders Fellow. The goal for this year is to have more conversations in this space, to push forward technically with the prototype, and get our story out there.

This January and February, we’ll be speaking at CPDP (Computers, Privacy, and Data Protection) in Brussels and SXSW in Austin. We’ll also be running workshops and working with collaborators at MIT to further the technical capacity of the project.

MATT: I’m especially interested in using the dataset nutrition label as a vehicle for involving more people in this conversation. So, we’re thinking about changing behavior, and also changing the dynamics of the conversation. The question is not just who are the communities who may not be part of the conversation — but who are the subjects of bias who should be part of the conversation?

Want to learn more about the Dataset Nutrition Label Project?

  • CPDP, Wed. Jan 30: Catch Kasia and two other Assembly 2018 alums, Sarah Holland and Jonnie Penn, speaking at CPDP on “Leveraging ‘Nutrition Labels’ and Other Tools for More Responsible AI”. Details are here.
  • SXSW, Mar. 11: Catch Kasia and Sarah Holland at SXSW on “Bias In, Bias Out”. Details here.

The third iteration of Assembly starts on March 11, 2019. Keep a look out for more exciting projects, as they are developed during the program!

Berkman Klein Center Collection

Insights from the Berkman Klein community about how technology affects our lives (Opinions expressed reflect the beliefs of individual authors and not the Berkman Klein Center as an institution.)

Berkman Klein Center

Written by

The Berkman Klein Center for Internet & Society at Harvard University was founded to explore cyberspace, share in its study, and help pioneer its development.

Berkman Klein Center Collection

Insights from the Berkman Klein community about how technology affects our lives (Opinions expressed reflect the beliefs of individual authors and not the Berkman Klein Center as an institution.)

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade