Jury: Evaluating performance of NLG models

Published in

Codable

4 min readJul 28, 2021

Photo by Tingey Injury Law Firm on Unsplash

TL;DR

Jury is an evaluation package for NLG systems. It allows using many metrics in one go. Also, it implements concurrency among evaluation metrics and supports evaluating with multiple predictions.

Jury uses datasets package for metrics, and thus supports any metrics that datasets package has. Default evaluation metrics are, BLEU, METEOR and ROUGE-L. As of today 28+ metrics are available in the “datasets” package, to see all supported metrics, see datasets/metrics. Along with these features, also custom metrics are easy to create inheriting datasets.Metric, and implementing abstract functions.

Repository: https://github.com/obss/jury

Introduction

Assessing “how good” a model is a common task in machine learning which is needed to measure quality, compare models, select models, etc. In tasks where the end result is a numeric output, the evaluation is more trivial; however, it is less obvious yet no less important to evaluate performance of the model where the end result is a “natural language” (as text).

The topic of evaluation of Natural Language Generation (NLG) outputs has been tackled for many years with many authors proposed metrics to automate the evaluation process. The aim of the automated metrics for NLG to evaluate quality of generated outputs based on goodness and consistency at linguistic and intelligibility levels such as semantics, lexical assessment, grammatical correctness, fluency, diversity, etc. There is also an ecole that defends human assessment as a must in evaluation of NLG outputs. Lately, there are many automated metrics which are consistent with human evaluations and adopted for NLG tasks like widely used in machine translation (MT), and also in automatic summarisation (AS), question generation (QG), Image Captioning (IC), etc.

Currently there are numerous evaluation metrics proposed for many NLG tasks. I do not dive into details of evaluation metrics, but for the ones who wish to take a look at the NLG metrics proposed until recently, I recommend reading the survey paper for NLG evaluation metrics [1], which also gives brief descriptions of how metrics work.

Foreword

Before the development of the package, we adopted and used Maluuba’s NLGEval package for our internal projects. NLGEval’s API has a good design and can easily be used by users with just several lines of code. However, NLGEval has particular metrics in its pack, and several drawbacks in terms of creating a custom metric. The development of the package has started as a need of a comprehensive evaluation tool where user can try many metrics with ease. We tried to preserve the ease of use of NLGEval’s API, but extended our package in terms of the capabilities in evaluation process and easy creation of user defined metrics.

Package

We present the Jury, a compound package for evaluation of NLG systems. It was initially developed for internal use, and now shipped as a package as an open-source project. We have several reasons for creating this package:

i) Evaluating with multiple NLG metrics in one go.

ii) Unifying structure of evaluation metrics in terms of input and output.

iii) Make evaluation available with multiple predictions.

In Jury, we adopted metrics from Hugging Face’s datasets package. This has several advantages:

Promoting open source contribution to “datasets” package. If users implement a metric for their own use, the implementation can be contributed to the datasets.metrics and made available for others as well.
The “datasets” package has a unified structure for metrics which inherit the Metric class, and anyone can create their own evaluation metric with ease.
The “datasets” package has already been actively maintained by Hugging Face, and contribution rate is also high. It would be pretty unnecessary and odd to write our own metric definitions.
Any metric available in the “datasets” package is available in Jury seamlessly.

So, what do we bring in with the Jury which currently does not exists ?

1. Concurrency

We brought concurrency in the evaluation loop for metrics where each metric runs in parallel. Multiprocessing is adopted in the package, so with a single parameter you can make your evaluation run concurrently which reduces run time significantly.

2. Multiple Predictions

Yes, I know this may sound odd. “datasets” package only supports one-to-one comparison, 1 reference per prediction (except several like BLEU, sacreBLEU, etc.). On the other hand, NLGEval allows multiple references per prediction.

As far as we know, there is currently no software that supports this feature. Although current implementations in “datasets” package support 1 reference per prediction (except BLEU and derivations), during development of a model you may want to generate multiple predictions and see how your model performs. Thus, the Jury allows you to pass multiple predictions and multiple references with a reduce function of preference.

How does this reduce function work on multiple predictions? Well, since datasets.metrics does not support it directly, we altered the structure of inputs and computation a bit. Below, there is a gif summarizing the process using BLEU 2 (with bigrams)

Evaluating multiple predictions with reduce function (max).

The reduce function can be any aggregation function applied on scores obtained from multiple predictions, default is “max”. Inherently, Jury allows all numpy function names so that you can just pass a string to reduce_fn parameter, however you can use your own custom aggregation function as well.

The output will be a dictionary.

$ python multiple_predictions.py
{
  "empty_predictions": 0,
  "total_items": 1,
  "BLEU": 0.5946035575013605,
  "Meteor": 0.8757427021441488,
  "rougeL": 0.9333333333333333
}

References

[1] Sai, A. B., Mohankumar, A. K., & Khapra, M. M. (2020). A survey of evaluation metrics used for NLG systems. arXiv preprint arXiv:2008.12009.