The role of context in conversational agents — an “intent” point of view? | Part 1

Published in

PromethistAI

5 min readSep 5, 2023

Today’s conversational agents rely on an understanding of human language. They can either employ predefined conversational rules, generating responses based on patterns they’ve learned from data (known as generative Large Language Models — LLM), or a combination of these two approaches.

Agents working with a predefined set of rules are easy to manage and their responses are safe as they are usually human-supervised. However, they are less robust and face the problem known as the “open-world” assumption. On top of that, designing these rules effectively requires substantial human effort and domain expertise as well as experience with designing conversational agents in general. Because it is a multidisciplinary field, you also need proper tools to enable non-technical contributors (usually with linguistic or psychological backgrounds) to design the technical aspects of conversation easily.

On the other hand, generative Large Language Models (LLMs) are robust and capable of responding to any arbitrary input, operating effectively under the open-world assumption. However, they possess limited controllability, which can be partially managed through prompt engineering (see this introduction) or fine-tuning (see this course). Additionally, their output usually requires some control, e.g. a profanity filter to manage vulgarisms, fact-checking to eliminate hallucination, and so on. These “guardrails” can be implemented through various libraries. To mitigate these problems, you also need to find open-source models (with available weights and permissible licenses for commercial use) because while a proprietary LLM can be more powerful, it often comes with constrained controllability.

Based on the advantages and disadvantages of both approaches, systems combining both approaches started to emerge. A rule-based system guides the conversation towards predefined goals and an LLM steps in when users provide unexpected inputs.

Firstly, these agents try to classify the input based on what the user wants to do, known as their “intent”. Intents are groups of similar sentences expressing the same thing differently. For example, the “agree” intent could include phrases like “yes” or “of course”. However, the context is a very important element here. If a user expresses the “agree” intent in response to a yes/no question, then it is valid. But if they express agreement in response to a question like “What genre is your favorite?” then it constitutes an unexpected response, referred to as an Out-of-domain (OOD) intent.

How to detect these OOD intents remains an ongoing research challenge, but there is a growing development in this area. For instance, we found that fine-tuning the embedding model (a model for vector representation of natural text) with a metric learning constraint and an adaptive decision boundary can significantly enhance OOD detection. We conducted tests on standardized datasets for OOD detection (codenames CLINC150 and BANKING77) and achieved state-of-the-art results (feel free to check out our code on GitHub). What we also found is that these commonly used datasets deviate from the expected conversational nature one would anticipate in a virtual agent or persona. They classify intents into an unrealistic number of classes (similar to question-answer) and evaluate the detection of OOD without taking context into account (therefore these OODs need to be different from a high number of classes and easy to detect). We call them contextless datasets. Of course, there is a possibility to subdivide CLINC150 into several domains, but then the task of detecting OOD is getting even easier as the OOD examples are substantially different from in-domain data and we do not have the information of the context.

With that in mind, we started to create our own dataset where we included the context of the conversation (limited to the previous turn for simplicity, with plans to expand). Since each conversational agent should respond to common commands, we also include a set of shared intents — so-called global intents — which are applicable across various contexts. Among global intents are, for example, “controlling” intents like repeat or stop.

Our datapoint can look like this:

The dataset was created by combining human effort and generative LLM involvement. Humans designed a few conversational trees (similar to what was shown above) while an LLM (specifically GPT4) was used to generate OOD texts. We found that GPT4 was not able to output human-like texts (distribution of words, length and so on) but it was good enough. Therefore, we left the prompt engineering for future versions. After automatic generation, we filtered out the examples that collided with in-domain intent examples. The used prompt can be seen here:

{
    "bot_response": [
        "Do you usually eat out, or at home?"
    ],
    "user_response": [
        {
            "train": [
                "I don't like eating in restaurants",...]}
        {
            "train": [
                "I prefer eating inside cause I like cooking",...]
     ]
    "out_of_domain": []
}

Can you rewrite "out_of_domain" field to be more aligned with the context "bot_response" but be still out-of-domain based on "user_response". Firstly describe bot_response, then written "user_response" and after that start to generate "out_of_domain" as JSON format. Please generate more diverse beginning of sentences as it could happen in conversation. Also, follow the lenght and style of given examples.

We’ve thus created a dataset of 25 dialogues. We will discuss the specific measurement and comparison in a future post, but we highlight the model used for this task. Leveraging the context of the conversation, we can view OOD detection as a form of Next Sentence Prediction (NSP), which was popularized by the original BERT paper. The idea is pretty simple — if the following sentence is In-Domain, then it should yield a high Next Sentence Prediction probability. Therefore we trained the original BERT model (retaining the NSP task) over several conversational datasets (such as Daily Dialog and Commonsense) where we take the context of the dialogue and following sentences (as a positive pair) and randomly selected sentences (as a negative pair). We wanted to train a general model as it would be unfeasible to train a specific model for each dialogue context, i.e. for each dialogue tree. (Note: With Adapters or LoRA techniques, it should be possible to train multiple models for specific contexts and we plan to explore it in the future.)

This model can be used for context-aware intents (as they are coupled with context) but it won’t work with shared intents (as they are valid in multiple contexts). Therefore we will propose a two-level classification where we first use the NSP BERT model to detect context-aware OOD. If it is detected as OOD with respect to context, we look at the cosine similarity with shared intents (which is kind of an old and also still researched approach). This approach has a huge disadvantage in setting the right threshold. However, when you have a limited number of trained examples you can perform cross-validation to find the best value. As stated above, we will discuss the specific measurement and comparison in future posts.

Stay tuned!

To learn more about Digital Personas based on conversational AI and how they’re developed at PromethistAI, visit our website.

The role of context in conversational agents — an “intent” point of view? | Part 1

Written by Petr Lorenc