The Generation, Evaluation, and Metrics (GEM) Project: Improving dataset transparency with the Data Cards Playbook

People + AI Research @ Google

Published in

People + AI Research

13 min readJul 28, 2023

By Mahima Pushkarna and Andrew Zalidvar, with Sebastian Gehrmann

Abstract:

The Data Cards Playbook, developed by teams across Google and in collaboration with external partners, is a collection of participatory activities and resources to help dataset creators adopt a people-centric approach to structured documentation known as Data Cards. The cross-industry Generation, Evaluation, and Metrics (GEM) Project used four activities (“Dimensions Evaluation,” “Lens Voting,” “Lens Brainstorm,” and “Questions With Optics”) from the Data Cards Playbook to tackle the challenges of an open data project, in which contributors from across the world come together as a community to establish benchmarks to advance machine learning domains.

In applying the Playbook activities, they discovered opportunities for improvement. The Dimensions Evaluation revealed shortcomings in existing documentation. The Lens Voting and Lens Brainstorm helped determine the importance of different themes to capture in documentation, which were then turned into easy-to-answer questions using Questions with Optics. Together, these activities helped the GEM team update their documentation schema at scale, to document 41 heterogenous Natural Language Generation Datasets featuring 56 languages. As a side effect, the Data Cards Playbook also helped identify technical requirements for a front-end interface to capture documentation, with which the GEM team increased the information available about their datasets, while reducing the time required to complete documentation by one-third.

Introduction

Generation, Evaluation, and Metrics (GEM) is a community-driven, participatory effort to improve the evaluation of Natural Language Generation (NLG) systems that perform tasks that generate natural language, such as writing descriptions, summarizing a document or conversation, and predicting a plausible explanation. The GEM benchmark environment includes a set of resources and tools to help AI researchers understand NLG systems’ performance.

NLG describes a broad class of problems, in which a model is tasked with generating fluent and accurate natural language that can fulfill some underlying communicative goal. They have become top-of-mind for many people, with large language models (LLMs) like ChatGPT, PaLM2, and the Llama2 efforts receiving a lot of widespread attention and public discussion. The tasks that many NLG systems perform are based on structured inputs, such as online documents or articles scraped from the web. For example, summarization datasets such as MLSum are a popular type of NLG dataset that contain documents that are crowdsourced, found on the web, artificially generated, or created using some combination of these approaches. Models trained on such datasets are often tasked with generating a single-sentence summary that captures the key points in a longer article.

Regardless of how datasets are collected, it’s important that they are documented and that claims about a particular model’s capabilities are accurate, and any limitations or caveats are identified. Transparency in documentation is one way of improving a dataset’s usefulness without changing the dataset itself. The GEM benchmark environment gives a view into developing standards for evaluating and documenting NLG datasets, models and tasks with automated metrics and human annotations. It uses both Data Cards and Model Cards to report on datasets available in the repository and evaluate models on tasks, respectively.

Common natural language dataset challenges

Every dataset has biases and includes tradeoffs. As such, metrics computed on the datasets will inherit some of these aspects, which should be acknowledged in documentation. In GEM, several datasets have been altered through changes in test splits or new test splits, which in turn affect how model performance is reported. For example, one test dataset could contain subsets that depend on personal attributes of examples of people, so the performance of any model trained on that dataset would also need to report on subsets of the test set.

In more complex cases, model performance on datasets are reported with flawed metrics, which misrepresents a model’s capability to perform a task despite producing favorable human-like outputs. A general challenge for NLG is that many metrics compare the generated text to some set of reference texts. Historically, this has been measured by methods like ROUGE that count the number of words generated by the model that appear in the reference set. A common drawback to this overlap approach is that the model might generate text that isn’t meaningful, but nonetheless matches with words that are in the reference set.

Flawed metrics are not the only challenge. The quality of datasets as measurement instruments is a concern. Datasets used for the evaluation of model performance are typically not tested for whether they are the best instrument for this evaluation.

It’s also necessary to consider a diversity of different perspectives to be represented in the datasets so that these datasets are reliable tools, because perceptions of “safety” vary in demographically diverse sets of raters. “Safety” evaluation datasets can be used to measure performance against different demographics perspectives.

Another common challenge is that the reference text itself may be of poor quality or erroneous, often not representing the real-world constraints needed for evaluating model performance. For example, the surface realization dataset covers a variety of domains typically found in news, blogs, and forums. The dataset is intended to reconstruct natural language, and supports 11 different languages. It may be useful for translating articles, but may perform worse on a specialized domain that is less likely to be found in day-to-day media coverage. Approaches such as semantic similarity could also lead to deceptive performance results, because a semantically similar result may not be the desired outcome — for example, for toxicity datasets, semantic similarity ends up capturing the topic much more than the toxicity of text.

Many times, a dataset may cease to be useful for its originally intended use case, but could be helpful for other use cases, such as adversarial testing. Responsibly stress-testing models might even require an outdated, artificial, or poorly created dataset that the community may believe should be deprecated. If such a dataset is included in GEM, the GEM project clearly documents why, and what the recommended purpose is. GEM also discloses that the dataset’s data collection process may render it unsuitable for alternative use cases.

Reflecting on GEM v1: Lessons learned

The Data Card shows a multilingual dataset which was designed to support a combo of translation and data-to-text verbalization. The data card first provides a short summary of the dataset, code to load the dataset, and access links. It then shows a set of fields under the “Quick use” section. The “Dataset Overview” section is collapsed and not visible. Finally, the data card shows the “Dataset in GEM” section and the image is cut off, indicating that more content is available in the data card. — Above: A Data Card from the GEMv2 (after the Data Cards Playbook workshop) repository, describing a multilingual dataset designed to support a combination of translation and data-to-text verbalization. Models can use this dataset to first generate a summary of a document with statistical tables in them, and then translate them into a different language.

Each year, GEM organizes an annual workshop at ACL conferences in which participants analyze datasets in the benchmark environment. The first version of GEM contained 13 datasets that the GEM community was deeply familiar with. This made managing the datasets and their documentation a reasonable task in which a large change could be implemented in a couple of hours. That’s because the Data Card template first used by GEM described aspects of the datasets that GEM organizers were interested in, but not necessarily the broader NLG community.

The GEM Data Card template was also accompanied by a long Markdown guide, which provided simple descriptions of questions. This guide, however, catered to more experienced individuals who understood metrics and were familiar with issues in generation-evaluation. Because of this, people who weren’t as familiar with NLG found it difficult to use.

The GEM organizers also found that most people working in GEM — more than 98% — don’t actually work on model and dataset documentation in their research.

Some documentation needs such as, “Give a one-sentence description of the dataset,” can be relatively straightforward to address. But addressing needs related to the social impact of a dataset, appropriate use cases, and even questions like “What is the licensing status?”, “What are allowed activities (with the dataset)?”, and “What is the communicative goal?” require different background knowledge that participants in the GEM project did not necessarily have. Learning to answer these questions in a long, single-page guide, and then filling out the Data Card elsewhere created a lot of overhead for dataset creators.

While GEMv1 featured a collection of 13 curated datasets, GEMv2 expanded to 41 datasets spanning 56 languages, contributed by a global community. At this new scale, GEM organizers wanted to ensure that the GEMv2 Data Card template provided a more inclusive experience to span different languages and welcome contributions from a broader range of cultural and educational backgrounds.

The Data Cards Playbook Workshop

To incorporate these changes, GEM organizers decided to revisit their Data Card template using the Data Cards Playbook. The goals were to create a well-structured Data Card template that would leave no doubt about the advantages of proper documentation, while also making it easier for users of GEM to start using a data loader and use datasets to evaluate trained models. This required a new documentation schema that could fulfill many requirements that a Markdown file alone would not. First, the project needed an interactive front-end interface in which users could submit outputs and visualize metrics quickly. Second, questions in the schema needed to be structured such that novice and intermediate contributors could progressively write documentation regardless of their familiarity with NLG. Finally, with a growing volume of datasets, sufficiently useful documentation was necessary for all datasets, aimed at readers with different motivations — be it looking at existing issues or seeing a single datapoint.

With a clearer understanding of the goals and outcomes expected from the creation of Data Cards, a core group from GEM worked with the Data Cards Playbook team to workshop four activities: a Dimensions Evaluation to evaluate the strengths and weaknesses of GEM’s existing template; a Lens Voting to help prioritize what GEM wanted to document in their Data Card template; a Lens Brainstorm to capture the documentation needs of the broader community; and finally, a Questions with Optics framing to add more structure to the Data Card schema.

The Data Cards Cards Playbook team ran these four activities with organizers from GEM in two separate workshop sessions that were a few weeks apart, giving enough time between sessions for asynchronous updates and deeper reflection. As a result of the workshop, the GEM team created an interactive interface for dataset creators to add datasets and document their datasets. Most of GEM’s Data Cards were produced using this interface during a 2-week “datathon” by approximately 40 contributors, after which GEM worked with the Data Cards Playbook design team to co-develop a front end interface so users of the GEM project could navigate and explore these Data Cards.

Learnings from the workshop

Evaluating existing documentation with Dimensions

Evaluating the utility of the template by reviewing rationale, evidence, and action items. This allows organizers to see how their dimension in the data card can be improved. The dimension is then rated on a scale from poor to outstanding. There is also a proficiency rating which identifies the data fluency and domain expertise that would be needed to come to an informed conclusion about the dimension of the data card. — Above: Evaluating the utility (the degree to which a Data Card template details how to use the dataset ) of the GEMv1 Data Card Template helped the GEM team articulate what was already working well, and identify opportunities for improvement.

The GEM team evaluated their existing template and identified several opportunities to improve their existing documentation schema, with the needs of the data card reader (also known as “the agent” in Data Cards Playbook terms) in mind:

Adding more specificity to the Data Card as the diversity of datasets increases. For example, specifying which datasets target factuality (how faithful the information is to its sources) or fluency (the level of expertise needed to understand the information conveyed in the Data Card).
Allowing datasets to be represented by their own metrics. This included the ability to automate and define metrics and provide explanations for why metrics were used.
Explicitly mentioning the intended contexts for using the dataset, and specifying which use case may have risks.
Describing examples of how this data has been used in models.
Creating more structure around quality, quality assurances and documenting the collaborative analysis of datasets.

The team discussed their role in the broader documentation process, the difference between Data and Model Cards, and approaches to communicate the broader impact of these datasets on society to downstream users. One of these approaches included adding more specificity to the section about bias, fairness, and personal identifiable information (PII) in GEMv2 by introducing known biases and limitations with examples of different ways of framing limitations. Given the open nature of the project, risks associated with a dataset — such as privacy and security — are documented as and when they are applicable to individual datasets.

As a result, the team chose to include questions that discuss advantages and disadvantages of one dataset over others and provide instructions to specify when information was not available or inapplicable in GEMv2.

Another significant shift in GEMv2 was that none of the included datasets were curated, nor were any subjective judgments made about their “goodness.” Instead of editors, the GEM team were now responsible “maintainers” of transparent datasets. The dataset creators were more deeply involved, tasked with providing the datasets under a permissive open license, have a data loader (a script for users to easily utilize the datasets when benchmarking models), and a Data Card that documented the dataset’s benefits and flaws. To further improve the accountability of the datasets, the GEM team added a new question to their schema, “Have there been previous analyses done that identified bias or risk associated with the data set?” They asked creators to provide links to appropriate research papers and explain them in their answers.

Identifying what’s next with “Lenses” and “Scopes”

The discussions stemming from the dimensions assessment became the basis for the Lens Voting activity. In the Data Cards Playbook, a Lens is a statement that articulates what agents (a.k.a. Data Card readers) want to learn from a Data Card, describing the information needed to complete tasks and make decisions about the dataset and its use. In the lens voting activity, the GEM team discussed and voted on a large set of common themes that encapsulate what different stakeholders might want to know about their datasets.

The results of the lens voting activity is captured in this image. 15 minutes are spent on reviewing and voting on lenses that the group should expand on. Another 15 minutes are spent to copy/paste the lens that will be broken down into scopes in the Data Cards. — Above: Designing for various potential readers of cards requires non-traditional approaches to design. In the Lens Voting activity, GEM organizers identified themes around data representation, data annotators, and crowd workers that helped shape GEMv2 in a way that improves readability and usefulness for those in the larger NLG community.

The GEM team voted on themes that they wanted to include or refine in their new GEMv2 documentation collection interface, including questions about representations of artificial and real people, the language of annotators or data sources, the geographic distribution of crowd workers. However, some decisions required tradeoffs. Initially, the team considered including a section in which basic terms of art could be explained. Instead, they assumed GEMv2 users would have a basic level of familiarity with popular NLG metrics, and instead focused on explaining more complex or bespoke metrics used.

Another concept closely related to lenses in the Data Cards Playbook are Scopes. Scopes are questions that people ask in sequence to gain a better and more relevant understanding of a topic under a Lens, which helps them make decisions about the dataset. These questions are framed at different levels of abstractions and communicate specifics about a dataset. Lenses and Scopes form agent information journeys, which are similar to user journeys for documentation. In the Data Cards Playbook, the Questions with Optics activity provides a structured approach to deconstructing lenses into scopes, keeping with a user-centered approach.

Meaningful Improvements in GEMv2

The newly added structure made it easier to maintain parity and navigate between Data Cards in both the backend and frontend of GEMv2. For instance, if the GEM team needed to add a new question to one Data Card, they could now simply add a field that would then be available across all Data Cards without having to modify each Data Card individually. The GEMv2 pipeline was made much simpler, capable of preserving compatibility between documents even after changes were made.

Another feature in GEMv2 that was inspired from conversations during the workshop, was the ability to import existing Data Cards onto the GEMv2 platform. This would now allow the GEM community and anyone else interested in having their dataset featured in GEMv2 download or make adjustments to Data Cards in a less tedious and more scalable way. That’s because the template was now in a JSON format with a unique key for each question, which could then be further processed into something like the Data Cards Playbook Labs, a tool that renders a given Data Card into an interactive web document, or alternatively, converts directly into Markdown.

The data card is generated by a JSON which is generated with the Data Card Input Form (below). It has an overview with relevant links and points of contact as well as if it is multilingual, and if so, what the languages used are. Furthermore, it covers the intended use as well as the schema of the data.

The Data Collection Interface guides users through the GEM Data Card Input Form. It goes through a series of prompts and questions. The JSON in the image above is produced by this form. — *Above: The front end of the Data Card is generated from JSON (top), which is produced by the Data Collection Interface (bottom).*

But perhaps the most important feature in GEMv2 was that the GEM team was able to simultaneously improve the back-end experience for Data Card creators and front-end experience for Data Card readers. The instructions were improved to provide more detail for each question, and the level of detail increased depending on the needs of the Data Card creator. The open-source Data Collection Interface always displays a short explanation for each question, and users can hover on questions to read longer explanations as necessary. This structured walkthrough made it easier for contributors to fill out targeted questions that could be skipped if answers were not available. Participants no longer had to write one long Markdown file. Similarly, these short explanations are available in the front-end, helping make the Data Card more readable, while providing readers with access to more context when they need it.

The process was further streamlined by not having to sift through a large document with redundancies and complexities for all questions in the Data Card template. The sequential nature of Scopes made it easy for the GEM team to define conditional questions that were based on prior answers, to ensure that Data Card creators would always see questions that were most relevant to their datasets. In the front end, this ensures that sections within a Data Card were always coherent, and readers were aware of relevant information that was missing.

With the help of the Playbook, all of these improvements led to not just an improved user experience, but also an increase in the number of answered questions by dataset owners that submitted their datasets and accompanying Data Cards onto GEMv2. That’s because the questions in GEMv2 were more targeted — even if the quality of the answers did not necessarily change. Adding new datasets to GEM with the improved documentation became as easy as writing in a document, with all the added benefits of Data Cards. Notably, GEMv2 was able to achieve a fair amount of coverage in a third of the time, and for three times as many Data Cards than there were in the first version.

The opportunity ahead for the Data Cards Playbook

The Data Cards Playbook can bring together stakeholders from across the dataset’s lifecycle to critically analyze and improve their AI dataset documentation. The user-centered approach to transparency in AI datasets can help organizations identify and implement changes that reduce the complexity of dataset documentation at scale, while informing more responsible decision making.

We plan to continue working to validate and improve activities and guidance in the Data Cards Playbook. If you’ve adapted, implemented, or have feedback for this guidance, we’d love to hear from you at https://github.com/pair-code/datacardsplaybook.

Acknowledgements:

The Playbook workshop was co-facilitated by Mahima Pushkarna and Sebastian Gerhmann. We would like thank everyone who attended the workshop: Yacine Jernite, Angelina McMillan-Major, Nishant Subramani, Juan Diego Rodriguez, Jasmijn Bastings and Pawan Sasanka Ammanamanchi, as well as others who provided asynchronous input. The collection tool was implemented by Yacine Jernite; Vivian Tsai and Mahima Pushkarna developed Playbook Labs, which was implemented by Sebastian Gehrmann to display Data Cards in the GEM benchmark environment. We would also like to thank the datasets creators who created Data Cards for their datasets in the GEM Benchmark Environment, without whom this work would not be possible.