A bleu meteor in a rouge text world (Image by author)

Monitoring the Invisible Ink: How To Measure Text-Based Generative AI Models

Monitoring text-based generative models using performance metrics such as BLEU, ROUGE, and METEOR scores as well as prediction embeddings

Published in

Arize AI

7 min readMar 17, 2023

In recent years, text-based generative AI models have been making significant strides in natural language processing tasks such as language translation, text summarization, and dialogue generation. These models are capable of generating text that is often indistinguishable from human-generated text, making them increasingly popular in various industries, including customer service, content generation, and data analysis. While these models can be incredibly powerful and useful, they can also produce unexpected or even harmful output, making it critical to monitor them closely.

For example, consider a chatbot that is designed to help customers with their queries. If the model is not monitored, it could generate inappropriate or unhelpful responses, damaging the reputation of the company that deployed it. Therefore, it is essential to monitor these models’ performance regularly to ensure that they are producing accurate and unbiased results. In this article, we will deep dive on how to monitor text-based generative models using performance metrics such as BLEU, ROUGE, METEOR scores, and prediction embeddings.

Monitoring Generative Models with Reference Text

In order to evaluate the performance of machine-generated text, a reference text or ground truth is used for comparison. This reference text is what is expected from the generative model to produce ideally and usually collected from human domain experts. In the case that the reference text exists as models generate prompts, there are different metrics to compute performance. Let’s try to understand the different types of performance metrics for generative models with real-life examples in Python.

BLEU (Bilingual Evaluation Understudy) Score

BLEU is a precision-focused metric that measures the n-gram overlap between the generated text and the reference text. The score also considers a brevity penalty where a penalty is applied when the machine-generated text is too short compared to reference text. It is a metric that is generally used for machine translation performance. The score ranges from 0 to 1, with higher scores indicating greater similarity between the generated text and the reference text.

The following code demonstrates how to calculate the BLEU score using the NLTK library in Python:

ROUGE (Recall Oriented Understudy for Gisting Evaluation) Score

ROUGE is a metric that measures the overlap between the generated text and the reference text in terms of recall. Rouge comes in three types: rouge-n, the most prevalent form that detects n-gram overlap; rouge-l, which identifies the Longest Common Subsequence and rouge-s, which concentrates on skip grams. n-rouge is the most frequently used type with the following formula:

The following code demonstrates how to calculate the rouge-2 score in Python:

The main difference between ROUGE and BLEU is that bleu score is precision focused whereas ROUGE score focuses on recall.

METEOR (Metric for Evaluation of Translation with Explicit Ordering) Score

METEOR is a metric that measures the quality of generated text based on the alignment between the generated text and the reference text. The metric is based on the harmonic mean of unigram precision and recall, with recall weighted higher than precision. You can check out the algorithm behind METEOR here.

The following code demonstrates how to calculate the METEOR score using the NLTK library in Python:

The main difference between ROUGE and BLEU is that bleu score is precision focused whereas ROUGE score focuses on recall. The METEOR metric was designed to fix some of the problems found in the more popular BLEU and ROUGE metrics, and also produce good correlation with human judgment at the sentence or segment level.

BERT Score

One main disadvantage of using metrics such as BLEU or ROUGE is the fact that the performance of text generation models are dependent on exact matches. Exact matches might be important for use-cases like machine translation, however for generative AI models that try to create meaningful and similar texts to corpus data, exact matches might not be very accurate.

Hence, instead of exact matches, BERTScore is focused on the similarity between reference and generated text by using contextual embeddings. The main idea behind contextual embeddings is to understand the meaning behind the reference and candidate text respectively and then compare those meanings.

The following code demonstrates how to calculate the BERT score using the bert_score library in Python:

Monitoring Generative Models without Reference Text

When generative models are generating text without any reference, it can be challenging to monitor the models since performance metrics such as ROUGE or METEOR can not be computed. However, just like non-generative models, proxy metrics such as drift can be used to monitor generative models. In this case, since the models’ outputs are text, text embeddings can be leveraged to track the change in predictions. These embeddings provide a representation of the model’s output and can be used to compare different outputs to identify changes in the model’s behavior over time.

Specifically, euclidean distance between prediction embeddings can be computed in order to track model change over time. However, just simply tracking model drift might not be enough to improve the model performance and make sure model behavior is consistent. As an additional step, the computed prediction embeddings can be visualized in a lower-dimensional space where predictions inside similar clusters would suggest similar semantic meaning. If there are any outlier points inside the lower-dimensional space, those points can be analyzed and might be even used for re-training purposes. Finally, with embedding visualizations, machine learning engineers can also block certain clusters of words so that generative models are not biased.

To demonstrate the use of prediction embeddings, let’s consider an example of a language model trained on a dataset of news articles. Suppose we have a model that produces an article about politics, and we want to compare its output to another article produced by the same model six months later.

First, we can use the transformer library to tokenize the two articles:

Next, we can generate the embeddings for the two articles using the model’s output:

Finally, we can use the mean of the model’s last hidden state as the embedding for each article. We can then calculate the euclidean distance between the two embeddings to compare the two articles:

Now, we can compute the euclidean distance over all of our prediction embeddings over time and see if our model behavior is changing or not. An ML observability platform like Arize (full disclosure: I work for Arize) can automatically generate embeddings out of your generated text models and compute euclidean distances over time.

You also might find it useful to look at a UMAP visualization of your embeddings in a lower-dimensional space to find clusters with similar semantic meaning. If you would like to learn more about how observability for generative AI models works in real-life, you can check out an example from a generative AI company here.

Next Steps of Generative Model Monitoring

Even though traditional metrics such as BLEU, ROUGE, or METEOR can be promising for model performance monitoring, using large language models (LLMs) such as BERT is expected to be more common in the next few years.

Using LLMs to evaluate LLMs on complex tasks is an emerging area of research that aims to enhance the performance of language models. The use of LLMs for evaluation can be advantageous since they can capture complex patterns and dependencies within large datasets that traditional evaluation metrics may overlook. Additionally, LLMs can be trained on a wide range of tasks, which can aid in the evaluation of other LLMs across multiple domains. As an example, LangChain provides some chains/prompts to evaluate question answering tasks by using LLMs.

Conclusion

Monitoring of text-based generative AI models is a crucial process that ensures their performance and fairness over time. Using performance metrics such as BLEU, ROUGE, and METEOR scores, we can evaluate the quality of the model’s output and track changes in its behavior. Additionally, prediction embeddings are a valuable tool for identifying drift and monitoring embedding drift, which can help improve the model’s accuracy and fairness. However, there are limitations to generative AI model monitoring, and additional measures such as diverse training data and regular retraining may be necessary to ensure model performance. Overall, by incorporating monitoring techniques and best practices, we can ensure the continued success of text-based generative AI models in a variety of applications, from chatbots to content generation and beyond.