Text Generation in Customer Service (Part 2)

Published in

Emplifi

9 min readApr 14, 2023

Before you read

In Part 1 of the article, we introduced our Retrieval Based Response Letter Generation solution for Customer Service. In Part 2, we continue this topic to show the evaluations of our solution with real industry data, using both standard NLP evaluation metrics and human evaluation. The full test dataset consists of 1690 samples, and out of it 910 samples were randomly selected for human evaluation.

Standard Evaluations

In this section, we use standard NLP evaluation metrics to report the performance of the three modules of our framework: 1. Retrieval and 2. Core Response Generation and 3. Paraphrasing of templates. Additionally, we also look into several other aspects to better understand the model performance.

Retrieval Performance

A higher similarity between retrieved and reference response is a prerequisite for better generation and hence indicates a better retrieval. Our analysis finds that for 46% of test queries, our retrieval model fetches at least one reference-like (similarity > 0.9) historical response within the top-10 candidates of each retrieval. Furthermore, our manual evaluation on randomly sampled retrievals finds 49% of the retrieved responses suitable for generation and an additional 20% are somewhat relevant. With meaningful thresholds, there are enough retrieved historical cases that can be used to generate better responses.

Generation Quality (Standard Scores)

The automatic evaluations of all the methods is conducted on the full test set with the optimal hyper-parameter setting and the results are shown in Figure 1.

Figure 1: Test set results of the proposed response generation model on DC and TT dataset. Automatic scoring metrics are: AverageSentenceSimilarity (S), BLEU-4 (B), NIST (N), METEOR (M), ROUGE-L (R) and CIDEr (C).

For the Retrieve Only method we consider only the fetched historical response (without refinement) as a hypothesis. For baseline, RetRef and Hybrid methods, we consider hypotheses that are produced using (top-k,top-p with temperature) decoding setup, as its corpus-level score was better than other combinations. In ranking enforced versions of RetRef and Hybrid, instead of new generation, for each query we picked the hypothesis with the highest rank score for evaluation.

To assess our model, we utilize commonly used metrics such as Average Sentence Similarity, BLEU, NIST, METEOR, ROUGE-L and CIDEr. The average sentence similarity score is a measure of semantic similarity between reference response and hypothesis. We use sentence-BERT [1], a trained Siamese BERT-networks to encode a reference and a hypothesis and then calculate the cosine similarity of the resulting embeddings. The final similarity score is the mean value over the test set.

For both datasets, our proposed retrieval-based response generation model (RetRef+Rank) outperforms all other baselines in all the metrics (Figure 1). Specifically, it achieves an average improvement of 16% for DC and 7% for TT dataset in all metrics over the fine tuned GPT-2 baseline without retrieval. Understandably, knowledge retrieval plays a key role in this improvement. On the other hand, without refinement, the Retrieve Only approach yields the worst scores. The Hybrid version can switch between baseline and RetRef based on the availability of suitable retrieved responses and is evaluated including such test cases. Nevertheless, it outperformed the baseline model by a significant margin across all metrics and datasets.

To measure the extent to which model incorporates a retrieved knowledge in generation, following previous work, we measure word overlap between generated and retrieved responses. The result shows that our RetRef+Rank model retained more than 70% words from retrieved information in 51% and 57% of the test generation of DC and TT dataset, respectively. This is a clear improvement over baseline and the basic RetRef model which shows such overlap less frequently.

Qualitative Analysis

We also conduct a manual inspection to assess the relevance and informativeness of small-scale randomly selected hypotheses by 3 experts in this field. Relevance measures if a generated response is based on the corresponding product and reason, whereas informativeness checks for its information consistency with respect to the reference response (both scored out of 5). The result shows that responses produced by our RetRef+Rank model yield roughly a 9% higher relevance (4.05 for DC, 4.49 for TT) and a 12% better informativeness (3.75 for DC, 4.24 for TT) score than the baseline model for both datasets.

Ablation Study

Apart from the inclusion of retrieved knowledge, two other notable contributors to the performance of the framework are data augmentation and response ranking. Our experiments reveal that the creation of more training instances with multiple candidate responses increases the automatic score by 12% in BLEU-4, 6% in CIDEr and roughly 2% in other metrics. The role of ranking is also evident from the significant raise of RetRef+Rank and Hybrid+Rank model scores from their base version as shown in figure 1. This can be attributed to the ranker’s policy to penalize irrelevant generation while favoring the one that integrates quality retrieval.

Examples and Discussion

Table 1: Sample response generation using our RetRef+Rank model

Table 1 provides a randomly selected generation. It suggests that our model’s response is aligned with the customer letter type. The customer inquiry is responded accordingly with clarification, appreciation and information. In addition, having historical knowledge, our model is not only capable of producing an informed response but also refines that according to the query. A few limitations of our model include its inability to verify time-sensitive historical information and handling multiple questions from a single message. Additionally, the way it offers a coupon or commits a follow-up to a customer poses a risk. To resolve these issues, a risk or confidence measuring system can be introduced which will seek human inspection before dispatching a risky response. We leave this as future work.

Paraphrasing Outcome

We obtained 2826 unique new ending templates that were not in the original input ending templates. After sorting them in the ascending order of their perplexity, lexical and semantic similarity, we conduct a manual evaluation. It shows that 90% of the first 100 candidates are good to use, whereas this fraction drops to 36% for the last 100 candidates.

Human Evaluations

Automatic evaluation is not enough to measure the true quality of a text generation system, so we also conduct human evaluation on randomly selected 910 test samples using trained raters. For any example case, each human rater is shown a customer query, generated response (hypothesis), retrieved query-response pair, and reference response. Then they were asked to respond to the four following multiple choice questions:

Table 2: Question and options settings for our human evaluation task

For each question, there are four choices and they are ordered from bad to good. The first two questions (a and b) are focused to evaluate the quality of generated response (hypothesis) from two perspectives: 1. Relevance of response without knowing the actual information and 2. Informativeness of the response with the access to true information. The last two questions (c and d) are set to assess the performance and impact of the retrieval system respectively. Each case is rated by 3 different raters and the final score is average of their ratings.

Figure 2: Distribution of the options selected by the annotators

Based on the Fleiss’ Kappa score, the raters’ agreement for the choices of question a) and c) was fair and for question b) and d), it was moderate and good respectively. In around 75% cases of both question c) and d) the rater chose ‘option 4’. This indicates for those cases raters found that the retrieved query is very similar to the corresponding customer query and the generated responses are aligned with the retrieved information.

The evaluation also shows that for around 59% of the cases, the raters found the generated response is plausible or relevant without reference. With respect to the reference, the raters considered the models’ response at least equivalent to the reference in 61% cases.

Table 3: Comparison of distributions between copy and non-copy cases

In 75% cases the model’s generated responses are partly or fully copy of the retrieved response. Out of these cases, 39% generated responses lexically match with the corresponding reference responses. According to human evaluation, in 65% of the cases, the copy-driven generated responses are equivalent or preferable to reference response. These human supported 65% cases include almost all the machine-evaluated matched cases (38% out of aforementioned 39%). Furthermore, the human and machine evaluated label of a generated response (whether equivalent to the reference response) has co-occurred 71.98% times of all the copy-based generations.

On the other hand, in around 25% cases the generated responses are found to be lexically different (not copied) from the corresponding retrieved responses. 18% of such generated responses lexically match with the reference response whereas human evaluation considers 54% of them are equivalent or preferable to reference response. The human and machine evaluation of the generated responses matches for 80.99% cases.

Identification of applicable case

While in many cases our model is able to produce plausible responses, there are some cases where it struggles. Based on the human evaluation, we want to identify cases that our model is likely to generate human-favored responses. Our manual observation suggests that there are three scenarios when the model struggles:

Retrieved query is not a good match to the current query.
Retrieved response is very loosely connected to the current query.
Complex, infrequent (less experienced) current query (same holds for less generic, complex retrieved response).

In the above scenarios, the model tends to craft responses instead of following retrieved responses and in the process often generates generic responses that it is frequently exposed to during training for the corresponding reason code. On the other hand, in the absence of the above three scenarios, the model is often seen to follow the retrieved response which typically results in human-favored response.

Therefore, we want to capture the above three scenario with the following measures:

Retrieval score.
Training an entailment model between query and response.
Complexity has been seen in larger queries, so query length can be a feature to measure query complexity. Sometimes simple queries from infrequent reason code also make the mode hallucinate. For retrieved responses, a larger length won’t prevent generating a faithful response if it is a generic one.

Table 4: Relation between the generation quality and query-response similarities

Based on the previous three observations about when the model typically succeeds, we check to see if the data supports that. In table 4, generated response quality is defined by sum of raters rate, semantic and lexical similarity between reference and the generated response. And each row gives the correlation and p-val (level of significance) between items in the first column and generated response quality. We notice that the correlation of retrieval score (Query-Retrieved Query similarity) is highly significant (p-val << 0.05) in determining model’s generation quality. The other significant correlation between generation quality and retrieved response length indicates the model is typically right when it copies large generic retrieved responses.

Query-Ret. response similarity, is not exactly entailment and its correlation is not significant as well. Finally, query length is used as a heuristic of complexity. However, its correlation to response quality is also not strong.

To further improve the predictability as to when the model can be useful we apply three entailment models on the human rated dataset. A cross-encoder model is utilized to predict the human-like rating as to whether the entailment of a query-retrieved response pair or a query-hypothesis pair is likely to make a good final output.

We compared the results of the three cross encoder models for both pairs.

Table 5: Performance of entailment models on query-retrieved response and query-hypothesis pairs

Table 5 lists the accuracies of the three entailment models. The result shows our BERT-based cross-encoder improves the accuracy by around 100% over the other two models. It indicates we can effectively estimate the generation’s usefulness in around 70% cases

Conclusion

We propose a neural response generation framework to reduce human labor in a customer care setting. Considering a real-world scenario where a structured knowledge base is scarce, our framework extracts knowledge from history and utilizes that for informative response generation. Our evaluation shows the efficacy of the ranking system and provides evidence for the applicability of the framework in a real-life business operation. We plan to facilitate the framework with a response validation module for further improvement.

Reference

[1] Reimers, Nils, and Iryna Gurevych. “Sentence-bert: Sentence embeddings using siamese bert-networks.” arXiv preprint arXiv:1908.10084 (2019).