A first attempt at making sense of the LLM hype
Authors: Norbert Maager & Marian Tietz (Otto Group data.works GmbH)
tl;dr: In the quest of answering customer service mails automatically we explore smaller and larger LLMs as well as more classical approaches and compare them on two real-world datasets (customer support mail and chat data). We conclude that tf-idf / logistic regression is still a solid solution but that LLMs are competitive, especially as baseline models when only limited data is available.
We’re part of a team of data scientists working at the Otto Group, the biggest online retailer with European origin (among other things) and our objective is formally an easy one: we want to shed some light on the usual hype-cycle that is currently happening with Large Language Models. To this end we explore single-shot and few-shot retrieval as well as single-shot and few-shot classification with prompting. The results will be compared to in-production models as fair as possible but with unavoidable biases which we will try to disclaim as good as possible.
We are setting out to implement automatic e-mail response generation as this is something that is currently handled by a simple affirmation response (i.e. “thank you for your message, we’ll get back to you”). The data is non-trivial as it includes a good variety of topics from shipping delays, payment complications to returns. Therefore, it is a natural language (reasoning) task we have data for and that can be built in varying degrees of complexity. To make it useful to the customer, as well as the customer service agent, and to limit the scope we will simply offer the customer a top-k selection of answers on their first inquiry, provide them with the ability for feedback (“my case is resolved!”) and then let the conversation between humans unfold naturally — without intervention, if still necessary.
The e-mail task can be seen as analogous to user intent classification in chat bots (“where’s my parcel?” → shipment.parcel.ask_status
) for which we, coincidentally, have a lot of labelled data for because that is what we do in our daily business. Therefore, as a first step, we approximate the e-mail task with intent classification to get a feel for which models are performing best. Then we move on to the e-mail data. Of course we also test the performance of directly generating an e-mail response using LLMs.
Limitations
There are several limitations that explain some of the shortcomings in our experiments. Firstly, we were limited by time, there were more tasks than the ones we report, and the dynamics of the field meant that we needed to do several preliminary experiments before committing to bigger experiments. Secondly, it is unclear to us how companies like OpenAI will use the data that gets sent against their APIs. Therefore, particular care was taken not to send sensitive (i.e., secret or Personally Identifiable Information (PII)) data to these companies and limit the exposure of domain-specific data.
What data do we have
We have a considerable number of utterances, a multiple of ten thousand, labeled with their respective (customer) intent. To give you an example, the customer utterance “where’s my parcel?” might be labelled with shipment.parcel.ask_status
. We also have a list of FAQ data, i.e., answers mapped to questions as you would find them on a website such as otto.de.
According to our data policy, we choose a randomly sampled subset of the intent data (“reviewed”, 381 samples) which we split into two parts using a stratified random split (80% “train” and 20% “test”). Manually matching the questions in the FAQ data set allows us to use the intersection of the intent and FAQ data set (“train FAQ”). This maps real customer utterances to FAQ questions, answers and — implicitly — intent labels as seen in Table 1.
Furthermore, we have access to around 80 thousand mails containing customer inquiries that were sent to the OTTO customer service. To generate answers for the customer mails, we use a total of 85 FAQs stemming from three FAQs pages on the otto.de website (shop FAQs, marketplace FAQs & OTTO payments FAQs).
For evaluation of our approaches, it is critical to have a mapping assigning customer inquiries to answers from our FAQ set. To reduce manual labeling effort, we prelabel a set of 500 randomly sampled mails using one of our FAQ classification approaches. After manually labeling and cleaning the data off any PII, we are left with 229 of the initial 500 mails that are mappable to FAQs.
Bias warning: even though we reviewed every label manually we may introduce a bias here which tips the scales toward the FAQ classification approach that helped pre-labeling the data. In the results section we will disclaim this bias once again for the model that it concerns.
Experiments
Intent Data Experiment
Since we have chat intent data, we can easily set up single-shot and few-shot classification tasks using a retrieval pipeline (as shown in Figure 5). The training process consists of building the index from FAQ embeddings and their respective intent (and hand-mapped answer, for FAQ retrieval). The experiment is considered single-shot when there is only FAQ questions and their answer in the index. The experiment is considered few-shot, if additional samples from the intent data set mapped to answers are added to the index. During prediction time user utterances (for which we know the correct intent) are embedded and queried/ranked against the index. This way we can measure the accuracy of the retrieved intents and, in consequence, how well the embeddings encode the user query.
As a comparison we re-train our baseline intent classification model, a logistic regression with tf-idf features, to match the data available to the LLMs. If the LLM is evaluated in the single-shot regime, the baseline model only has the question/intent pairs from the FAQs available. For the few-shot regime the baseline model is trained on the same additional data. With the data we have, this would yield the following two meta-experiments we can do:
- Use the “train FAQ” data split for training, the “reviewed” data split for evaluation (single-shot)
- Use the “train FAQ” data split and the “train” data split for training, use the “test” split for evaluation (few-shot)
We use top-k accuracy as our evaluation metric to allow for some room for interpretation (we’re dealing with natural language data after all). Every experiment is compared to a version of the baseline model trained on all available intent data to give us a sense of what is possible for a model with more data.
We select models that yield good performance on German queries since we’re dealing mostly with German data and that were easily available at the time: miniLM and gBERT DPR. The latter is chosen to accommodate for the fact that retrieved text and retrieval query may not align well and Dense-Passage-Retrieval (DPR) models might have an edge here.
Mail Data Experiments
In a second experiment we repeat the intent classification experiment with the labeled mail data. Here we match customer mails to FAQs instead of chat utterances to FAQs. As baseline models we use the tf-idf chat intent models from the previous experiment, i.e. one tf-idf model trained on all production chat data (“BYT cheat”) and one that is refit to the data available to the respective LLM that it is compared to (“BYT refit”). We predict the top-k matching intents for the mail text and, using our mapping (see Table 1), we can match these predicted intents to FAQs (Figure 6a).
As the first approach incorporating LLMs, we set up a single-shot retrieval pipeline with solely the FAQ embeddings in the index. These are compared to embeddings of incoming customer mails (Figure 6b).
Secondly, we setup a few-shot retrieval pipeline augmenting the previous index with embeddings from our chat intent data. This is possible since we can associate chats with corresponding FAQs as well, as is done in the intent classification experiment. We argue that this is helpful because there is at least a partial overlap between mail and chat data so it should improve retrieval performance (Figure 6c).
In our third approach we use an LLM in a generative way to find matching FAQs. Here we create a prompt containing the numbered FAQs and the customer mail, requesting the model to return the numbers of the top-k most relevant FAQs. The prompt looks essentially like this:
Here is a list of numbered FAQs-Answers:
{list of FAQ answers with ids}
Please return the top 3 FAQs that best fit the following customer inquiry.
Examples:
Customer inquiry: Where is my parcel?
FAQ id: 01, 22, 05
Customer inquiry: {customer mail}
FAQ id:
By limiting the LLM outputs to FAQ ids we can control the possible outputs of the LLM to avoid hallucinations.
We also tried an LLM to freely answer customer mail inquiries (Figure 6d), i.e., just prompting the LLM to write an answer for a mail. This, however, had two major drawbacks. Firstly, automated evaluation of freely generated answers by LLMs is highly non-trivial and still an open field of research. For example, classic scores such as ROGUE and BLEU (and their variants) were massively unhelpful. So, we would be left to tediously evaluate answers by hand. Secondly and much more critically in practical terms, model hallucinations make it problematic to let LLMs communicate to customers directly. For example, in an answer to a mail the model suggested to call the customer hotline and handed out the number to the customer hotline of a completely different organization, namely the corona hotline of the British NHS which, we assure you, is not associated with our company. As it is non-trivial but critical to avoid these hallucinations and the other approaches looked promising we did not pursue this specific evaluation further.
Results
Intent Classification
Intent retrieval/classification results will always be accompanied by the “BYT cheat” reference model that has access to the full corpus of chat training data. In general, it can be said that for the low data regime it might very well be worthwhile to employ an LLM to get things started. Especially gBERT DPR performs quite well in comparison to all other approaches when only presented with the FAQ dataset as knowledge base — maybe due to being specifically pretrained on German data. Interestingly, the OpenAI Davinci embeddings (text-davinci-003
) perform especially bad in this setup. Unsurprisingly, the tf-idf-based baseline model “BYT refit” performs the worst of all models in this test since there was no prior knowledge to source from.
The picture changes drastically when presented with more data in the few-shot setting. Both miniLM and gBERT DPR do not seem to generalize too well over the new samples. It seems that whatever the embeddings represent, glances over whatever variance the new data brings to the table. The OpenAI embeddings, as well as the baseline tf-idf model, pick up the nuances of the new samples rather well and perform comparably. It seems that, at least for this task, the sweet-spot for using LLMs seems to sit somewhere between “almost no data” and “a little bit of data” since the tf-idf model is performing equally well as the OpenAI models with significantly less parameters in the few-shot setting and is easily trainable.
To ascertain how the OpenAI Davinci model scales in terms of training data we plot the top-3 accuracy as a function of additional training samples in the few-shot setting and find that it develops logarithmically. It might be coincidental but OpenAI itself says that for fine-tuning O(2^n) scaling for a linear increase in performance is expected, n being the number of data samples.
Mail answering
In the mail data experiment the single-shot approach of the Davinci model performs poorly and is unsurprisingly outperformed by its few-shot counterpart that utilized chat utterances in addition to FAQs. It is quite noteworthy that the generative classification approach predicting FAQ indices outperforms the few-shot retrieval approach, even though the generative approach is exposed to only a handful synthetic customer inquiries in its prompt, while the few-shot approach has 381 supporting embeddings of real customer chat utterances available. This suggests that a significant part of an LLM’s prowess is lost when omitting its generative layers. The tf-idf-based intent model “BYT cheat” (as in Figure 7) outperforms the LLM-based approaches, admittedly not quite as dominantly as in the previous experiment. As this is the model that helped pre-label the mail data there might be a bias, as mentioned before, that favors this model. Without as much labeled data (“BYT refit”) the LLMs performed better. In terms of efficiency the results are inconclusive. The tf-idf model “BYT cheat” uses vastly more labeled training data but is not that much more effective. However, the refit variant is orders of magnitude smaller in terms of parameters (and thus generally easier to handle) than the LLMs while performing only a little bit worse.
Conclusions
In recapitulation of the things we’ve tried, what can we say about the then-recommended open source LLMs and the hype around ChatGPT?
Let’s start with our opinion that these models are unlikely to remove the need for labelled data. This analysis was immensely helpful to us to determine the right tools for the job, and it would have been near impossible without ground-truth standard labels for the task. It is true that tools, such as ChatGPT, will make label acquisition easier in the future but it is a tool and not a replacement you trust blindly to generate your gold standard references.
Speaking of trust, we found that the generation is, unsurprisingly, not very trustworthy. Hallucinations, especially around numeric facts such as telephone/IBAN numbers are common. While there will certainly be advancements that will reduce hallucinations, it is worthwhile to decompose your task into smaller ones instead of striving directly for the end-to-end solution. For example, we found it immensely valuable to be able to classify intents with just a few examples, basically alleviating the cold-start problem for new intents. In the future it might be viable to remove the need for intent-classification completely and base the decision on what to tell the customer solely on retrieval-based approaches, as already highlighted in architectures such as StarSpace. A non-technical solution would be, of course, to have such an end-to-end approach with a human in the loop — which brings its very own host of problems.
One overarching problem, especially with the open-source models, but also with mixed language prompting with ChatGPT, was the capability to understand German texts. Many of the available models performed so poorly on German texts that they never made it into the comparison (even big ones), and it is certainly a case for future work and investigation.
A lot has happened in the meantime since we started working on this project and since we finished this blog post (~5 months is a long time!). A lot of new models have surfaced and we’re eager to run new benchmarks to further explore potential strengths (and weaknesses) of these LLMs beyond being very good baseline models. We hope that you took value from our results, and we could shed some light on the actual usefulness of ChatGPT, even though it is measured on a narrow use-case.
Special thanks to Alex Grimm and Sarah Gritzka for their outstanding work. Without you this would not have been possible!