Text Generation in Customer Service (Part 1)

Renhao Cui
Emplifi
Published in
11 min readJan 27, 2023

Before you read

The article will be easier to follow if you have some understanding of generative models in the text domain. It would be even better if you are familiar with Transformer or GPT models. For more technical details, take a look at our NAACL paper Retrieval Based Response Letter Generation For a Customer Care Setting.

Background

Most modern businesses operate a customer care service to support customers with product information and to address their concerns. Letter-like communication (such as emails) serves as a major channel for such customer relationship management. These exchanges are initiated by requests from customers and then responded to by the organization within the same channel. This support service plays a vital role in ensuring a good customer experience and is a key factor in developing goodwill.

Figure 1: Example of a customer-agent email exchange

Traditionally, a trained human acts as a customer care representative or agent and provides customer support through live conversation or via email exchanges (Figure 1). Doing this manually at scale demands an enormous human effort, as the task involves understanding each customer’s query and then replying appropriately. Given a massive volume of customer queries, the process becomes very time-consuming.

Automating the response generation process can go a long way towards solving this problem. That said, while the rule-based systems that exist today can handle very specific requirements, they often struggle to capture the linguistic complexity of real-world communications. To address these issues with advanced technology, we explore an AI-driven approach that is robust, efficient and pragmatic with respect to available resources.

Use cases

The major use cases covered in our task focus on the automatic generation of response letters in customer service. Typically, this step is handled by human agents with the help of rule-based or template-aided systems. In contrast, we plan to build a system that can generate appropriate, consistent, and diverse responses.

The benefits of such a system are:

  1. Reduction of human labor and increase of support volume
  2. Diversity in agent responses, giving customers the sense of a personalized reply
  3. Query analytics can help to understand customers’ inconveniences and expectations
  4. Consistency across customer service cases

Targets and goals

A complete automatic generation system is always desired, however, there are certain concerns when applying it to a real customer-facing process. That’s why we decided to give human agents the option to adopt the final text generated by the system. By having human agents verify and make necessary modifications to the generations, we can ensure improved accuracy and customer experience.

Challenges

Recent advancement in language modeling has shown significant success in generating fluent text. In goal-oriented response generation systems, user utterances are usually fact-finding queries. To learn to understand such queries, existing solutions use datasets annotated with slot value pairs (e.g. ‘Find a park near area51’, destination: ‘park’, close_to: ‘area51’). Corresponding framework first maps the values to slots and then uses the slot-value pairs to retrieve facts from a knowledge base to reply.

Unfortunately, in reality, both the slot-value annotation and structured knowledge base are often unavailable and difficult to manage. Besides, in a customer care setting, a user prompt may not be limited to just an inquiry about facts, but may also include a complaint, suggestion, compliment, request, etc.

In this work, we factor in these challenges and present a response generation framework that automatically produces and ranks response letters addressing customers’ queries or feedback, and which doesn’t require fine grained annotation.

Our Approaches

Explore the data

In the experiments, we use two proprietary datasets from the restaurant and adhesive tape industries, naming them DC and TT. They consist of 8,448 and 14,938 unique email exchanges respectively between customers and agents. Each exchange contains a case ID, product and reason code of the service, as well as the customer query letter and the human agent’s response letter. The reason code stands for the type of customer query. A simple example of a customer service case in the dataset is listed below.

Customer Query: I just signed up for X-org this morning and have not received my coupon for free pancakes yet. When will I receive it?

Agent Response: After signing up, it may take up to 24 hours to receive your initial offer.

Product Code: SERVICE RELATED, Reason Code: GC — PROMO — I

Data Preparation

We mask specific personally identifiable or proprietary information elements such as names, email addresses, phone numbers, prices, franchise names, and dates in our dataset with corresponding generic tokens (“X-email”, “X-phone”, etc.). This serves two purposes. On one hand, this anonymization protects the privacy of the customers and the organization. Second, it forces the model to learn from and generate generic tags instead of considering noise in the form of irrelevant details such as specific names and values.

Note that a typical response may be split into two parts: the Core and the Template.

Core responses directly answer or otherwise address the customer’s query and essentially has a causal relationship with it. The responses involve a diverse set of products and reasons.

Template responses express gratitude and provide contact information for further communication. These remarks are independent of the nature of the customer letter, and hence they have very little diversity across the dataset. We postulate that expansion of this set will add useful diversity to the response letter.

Given this categorization (of Core vs. Template), we split each response in our dataset into these two categories with a view to treating them separately. To locate the splitting position, we consider the start of the latest sentence that contains the frequently used template phrases (e.g. bi-grams or trigrams). After this separation, each dataset is randomly divided into training (~60%), validation (~20%) and test (~20%) sets.

Design the workflow

We next present an end-to-end framework for generating context-aware formal responses to customer letters. As discussed earlier, a formal response letter consists of a core response and closing remarks. To handle these tasks separately, we split our framework into 2 modules: (1) Core Response Generator, and (2) Closing Remarks Diversifier.

A conditional text generation model is trained for the core response generator. For closing remarks, we create additional templates from the existing ones, by fine-tuning a separate language model for diversifying the responses through the use of paraphrasing. Following this, a closing remark is selected from the pool of expanded ending templates and concatenated to the generated core response to form a formal response letter.

Core Response Generation

In this section, we first describe a potential baseline approach for response generation and then put forward the retrieval guided response generation framework. In both approaches, we use a pre-trained causal language model, GPT-2 [1] to train our baseline generation model. GPT-2 is largely built upon the decoder block of the initial transformer architecture [2] and employs a stack of masked self-attention layers where each token is allowed to attend to tokens from the past (left context) but is blocked from upcoming (right context) ones. The auto-regressive nature of the model makes it the preferred choice for conditional text generation. Furthermore, because it has been pre-trained on a massive web-text dataset, the language model is known to produce fluent responses when given a prompt.

Figure 2: Core response generation steps

Baseline generation models

We approached our goal of producing a response for a customer query as a conditional text generation task and hence adopted the pre-trained GPT-2 model to tune the parameters. Given a customer query letter we want to generate a response that is conditioned on the query.

History Guided Generation

In the aforementioned approach, the model lacks access to factual information while responding to a query and tends to make up a safe or hallucinated reply. For instance, in response to a customer’s question regarding a restaurant’s service availability, the baseline model is seen to generate “don’t know” or “open” although the dataset indicates its closure.

To address this issue, in this task, a Retrieve and Refine (RetRef) [3] mechanism is employed. The idea is to retrieve valid responses for similar queries used in the recent past and utilize those responses, in addition to the query, to generate a refined (coherent) response.

We split the whole task into three steps: 1. Knowledge Retrieval, 2. Response Generation and 3. Response Ranking. The framework is depicted in Figure 3 and detailed in the following subsections.

Figure 3: Retrieval-guided core response generation framework

1. Search for similar historical cases

Given a current customer query (q𝒸), knowledge can be extracted from agents’ past responses (rₚ) to a similar query (qₚ). To find such cases, we assign a candidate score, c=sim(q𝒸, qₚ)+bleu(q𝒸, qₚ) to each of historical query-response pairs where bleu is BLEU-1 score between the corresponding queries and sim(q𝒸, qₚ)=cos(Eᵩ𝒸,Eᵩ𝒸 ), is cosine similarity between the embedding of the current and of the past query, respectively.

Figure 4: Retrieval steps

The embeddings are obtained using a sentence-transformer and can be pre-computed to make the retrieval fast. While training, the similarity between the corresponding responses, sim(r𝒸, rₚ) is also added with the weight 𝛾<1 to ensure that the model finds a relation between them to transfer knowledge. But this response similarity is not used during testing, as the reference response is unknown then. Furthermore, for training, we choose all the responses from candidate pairs that have c>𝜏, where 𝜏 is a hyperparameter. More potential candidate responses essentially augment the training instances and we explain its use in the next section.

2. Generate the response using historical cases as the reference

We adopt the same GPT-2 used in our baseline model for retrieval-based training and generation. However, as input, a retrieved response is prepended to the current query. The objective is to teach the model to generate the reference response utilizing the retrieved knowledge in addition to the query.

Since only one retrieved response is used at a time, having more than one above the threshold (c>𝜏) allows us to create more training instances with the same query-reference response pair. Conversely, in the absence of suitable retrievals, the reference itself is used as a retrieved response to make the model mirror the fetched response. This technique essentially works like teacher forcing and is intended to avoid ignoring the retrievals. However, in the event of such a scenario during testing, we resort to the baseline model for generation and call this mix-model setting a hybrid generation.

Moving on to the prompt formation, it has three segments: the retrieved responses from the agent, the current query from the customer, followed by the reference response from the agent. Despite separating them by special token, a model will have little idea of the author of a token. To address this, we add a segment embedding to the token embedding as well. The way positional embedding helps the model understand the relative position of the tokens, a segment embedding of the corresponding author is likely to add more meaning to the model.

3. Response Ranking

Although transformer-based models produce better contextual representation by capturing long-term dependencies of input text, much depends on the decoding technique at the inference stage when it comes to generating context-aware, informative responses. Unfortunately, even with state-of-the-art decoding mechanisms, neural text generation is known to suffer from blandness or inconsistency. Hence, we generate multiple responses using different sampling methods (e.g. top-k, nucleus, etc.) and employ a ranking mechanism to measure the context-awareness of generated responses (by evaluating them as hypotheses of the source query).

Such a hypothesis would indicate stronger correspondence when it’s probability of producing the query, i.e. p(query|hypothesis) is higher and loss is low. However, after we built the “backward” model, we compared the original model loss (loss(h,q𝒸)) with the inverse model loss for each query (q𝒸)-hypothesis(h) pair and (interestingly) found a very high correlation. Thus we were able to avoid training an additional inverse model and performed the ranking process using forward loss only.

The final rank score of a hypothesis is computed using the formula:

Figure 5: Explanation of the response ranking formula

Here, sim(h,q𝒸) indicates similarity between query and hypothesis. sim(q𝒸,qₚ)*sim(h,rₚ) takes into account the correspondence between retrieved response(rₚ) and hypothesis(h) weighted by the query similarity (sim(q𝒸,qₚ)). The rationale behind the last product is: a generation that retains knowledge from a good retrieval is likely to offer better response. Consequently, a higher rank score is expected to indicate better hypothesis quality.

Hybrid Model

Now not all queries have a usable similar query in the past and thus it becomes difficult for the model to handle all the situations using the past response. Instead we can use a hybrid model where we would use past history if found with certain confidence (similarity score), otherwise, we would use the structured mode of the model for generation.

Template generation

Figure 6: Paraphrasing Model for closing remarks

One way we can enhance the ending template set is by paraphrasing them to produce new ones. To this end, a GPT-2 model is fine-tuned with ParaNMT-5M [4], a dataset of more than 50 million English-English sentential paraphrase pairs. We then utilize the trained paraphrasing model to paraphrase the currently available ending templates. From the paraphrased ones, we first filter out those that have a grammatical acceptability score less than 0.9 according to a BERT-based classification model trained with CoLA dataset [5]. To ensure diversity without sacrificing semantics, we further extract those that have BLEU-1 < 0.7 and cosine similarity > 0.8. Finally, we check if a paraphrased version retains all the delexicalized tokens such as email address, phone number, etc.

Coming up next

In the second part of the article, we will focus on the evaluation of the solution, in a real-world industry customer service setting. We will discuss the results on both standard evaluation metrics as well as human evaluations.

References

[1] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI blog, 1(8), 9.

[2] Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. “Attention is all you need.” Advances in neural information processing systems 30 (2017).

[3] Weston, Jason, Emily Dinan, and Alexander H. Miller. Retrieve and refine: Improved sequence generation models for dialogue. arXiv preprint arXiv:1808.04776 (2018).

[4] John Wieting and Kevin Gimpel. 2018. ParaNMT-50M: Pushing the Limits of Paraphrastic Sentence Embeddings with Millions of Machine Translations. In Proceedings of ACL, pages 451–462.

[5] Warstadt, Alex, Amanpreet Singh, and Samuel R. Bowman. “Neural network acceptability judgments.” Transactions of the Association for Computational Linguistics 7 (2019): 625–641.

--

--