Designing Conversational Agents for Multi-party Interactions

How does one extra person impact a conversation?

Angus Addlesee
TDS Archive
Published in
10 min readApr 6, 2023

--

This is an abridgement of a paper published at IWSDS 2023. I co-authored it with the Heriot-Watt University SPRING team detailed below. If you would like to cite anything discussed in this article, please cite the paper titled “Data Collection for Multi-party Task-based Dialogue in Social Robotics”:

Harvard:
Addlesee, A., Sieińska, W., Gunson, N., Garcia, D.H., Dondrup, C., Lemon, O., 2023. Data Collection for Multi-party Task-based Dialogue in Social Robotics. Proceedings of the 13th International Workshop on Spoken Dialogue Systems Technology (IWSDS).
BibTeX:
@inproceedings{addlesee2023data,
title={Data Collection for Multi-party Task-based Dialogue in Social Robotics},
author={Addlesee, Angus and Siei{\'n}ska, Weronika and Gunson, Nancie and Garcia, Daniel Hern{\'a}ndez and Dondrup, Christian and Lemon, Oliver},
journal={Proceedings of the 13th International Workshop on Spoken Dialogue Systems Technology (IWSDS)},
year={2023}
}

Consider an interaction with a conversational agent, maybe with Siri on your phone, Amazon Alexa in your home, or a virtual customer service agent on a website. These interactions are ‘dyadic’, that is, they involve just one person (yourself) and one agent. This is typical for all voice assistants and chat-based agents.

Conversational agents are designed for one-to-one interactions.

Both of these photos show two-party (dyadic) interactions. One with a phone voice assistant instead of a person (left source), (right source).

We are social creatures however, and people can naturally handle conversations with more than just one other person. Consider coffee with a few friends, family dinner conversations, or even work meetings with the entire team. These are called ‘multi-party’ conversations, and today’s conversational agents are not designed for this type of interaction.

Do conversational agents ever need to work in a multi-party setting? Do multi-party conversations add new challenges? What next steps do we need to do to progress? I will answer these three questions in this article.

Multi-party interactions. Consider how these conversations differ from the dyadic ones above (left source), (right source).

Conversational agents in a multi-party setting

Today’s systems have been designed to handle dyadic interactions for a reason of course, that is typically how we interact with them. Siri does not passively listen to your conversation with a friend and interject where needed - it listens to your single request when activated. Google Assistant and Alexa are very similar, listening for their wake-word and one utterance. Arguably, these voice assistants could benefit from multi-party understanding in a family home for example, but this is not pressing.

Conversational agents are being embedded within virtual agents and social robots in public spaces like museums, airports, shopping centres, and hospitals, etc… People go to these locations with family members, friends, and carers - so these agents must be able to handle multi-party interaction.

I will use many examples in this article to illustrate points, so it is best to set the context. Let’s imagine a robot assistant in a hospital memory clinic waiting room. The robot is called ARI, and patients come to their appointments with a companion. The pair may need directions, coffee, hospital information, or just some entertainment. This is coincidentally the setting of the EU SPRING project, and all examples will fit this setting.

The ARI robot in a multi-party setting. Copyright PAL Robotics

Are Multi-party Conversations so Different?

Considering the hospital setting above, we have only one additional person in an interaction. So instead of the robot only interacting with the patient, the robot must interact with both the patient and companion together. Does this change a conversation? Spoiler: Yes, a lot!

Speaker Recognition

In a two-person dyadic interaction, the agent does not need to identify the speaker. It is trivial, the speaker is the only other person in the conversation. Alexa has a neat feature which only allows me to purchase items if it recognises the voice as mine - but this is not the type of speaker recognition that I mean. It does not matter whether an utterance is tagged as speaker 1, speaker 2, or speaker xyz - the response from conversational agents will be the same.

Identifying the speaker is critical to understanding a multi-party conversation however. Let’s imagine that the patient and companion want to play a quiz with the robot. The robot asks “What is the capital of Germany?” and this is the following interaction:

1) I think it is Berlin.
2) Or Munich.
3) Yes, Munich.

Without speaker recognition, we cannot determine whether these two people have agreed or not. There are multiple options, let’s look at two (where P = Patient and C = Companion):

1) P: I think it is Berlin.
2) P: Or Munich.
3) C: Yes, Munich.

In this PPC case, the patient and companion have come to an agreement that the answer is Munich. The robot could then let them know that they are incorrect, tell them the right answer, and continue with the next question. Alternatively:

1) P: I think it is Berlin.
2) C: Or Munich.
3) C: Yes, Munich.

In this PCC case, the patient proposed the correct answer, and the companion suggests a second incorrect option. The companion then reaffirms their certainty, but importantly, the patient has not agreed. If the robot took Munich as the final answer in this case, the patient would be very frustrated as they proposed the correct answer but was then ignored.

Hopefully this example is clear. In the PPC (or PCP) case, agreement is reached and continuing the quiz is the correct action. In the PCC case, the robot should stay silent and wait for the patients response.

The conversational agent can only know which action is correct if the speakers are recognised. This is not true in dyadic interactions.

Addressee Recognition

Similar to speaker recognition, figuring out who is being spoken to is trivial in a dyadic interaction. The speaker is obviously speaking to the second person/agent. Again, however, this is not the case in a multi-party interaction. The speaker may be addressing one individual, the other individual, or both of them together. To illustrate this, consider (where R = Robot and P = Patient):

1) P: What is my appointment about?
2) R: For your privacy, I don't know that, sorry.
3) P: What's my appointment about?
4) R: For your privacy, I don't know that, sorry.
5) P: Stop!

In this example, the patient originally addressed the robot in turn 1. The robot then responded correctly. In turn 3 however, the patient turned to their companion and repeated the same question. As the robot has no addressee recognition capability, it responded again with the same response, frustrating the patient.

The ARI robot being tested for SPRING in Heriot-Watt University

Response Selection or Generation

It is difficult to decide what a virtual agent should say in response to a user, and this is true for both dyadic and multi-party conversations (unlike the last two tasks). It is additionally challenging in a multi-party setting due to the fact that your agent has to decide who to address. The robot’s response will be different if it is addressing an individual or everyone for example. Once again, I will illustrate this:

1) P: I would like a coffee.
2) C: and I desperately need the toilet.
3) R: ???

The robot may decide to address the patient first as they asked for assistance before the companion. The robot may decide to prioritise the companion due to urgency however. What the robot says next is dependent on who it decides to address.

The three tasks above (speaker recognition, addressee recognition, and response selection/generation) are collectively known in the literature as “Who says what to whom?”. This is what current research focuses on. In our paper, we highlight two other important tasks for multi-party agents.

Dialogue State Tracking

It is important to understand what the point of each user’s utterance is in the context of a conversation. Dialogue State Tracking (DST) aims to do exactly this with popular challenges like DSTC and MultiWoZ. Many research institutes and companies allocate resources to this task, but all of the datasets are dyadic. Once again, DST differs when in a multi-party context.

Current DST models can output that a user is requesting certain information, affirming something, providing information, etc… but they cannot detect agreement or determine that a request was fulfilled, as that does not happen in dyadic interactions. For example:

1) P: Where is the lift
2) P: It is to the left of the reception

This would never occur in a dyadic interaction. It makes no sense for a person to ask a question and then instantly answer it to themselves. Utterance 2 could be said by the companion however, and the robot would have to track that the companion provided information to the patient.

Goal Tracking

Finally, tracking people’s goals is also more difficult in the multi-party setting. Similar to the DST difference, people can answer each others goals which does not happen in a dyadic interaction. The agent must be able to determine if a user goal has been accurately satisfied to not repeat what the person answering just said. If the person’s answer was incorrect however, the robot is still required to answer as the goal is not complete.

Another major goal tracking difference is something that we humans are very good at - determining when people have a shared goal. If two people enter a cafe to order coffee, the barista will interact differently depending on whether it is two seperate people ordering coffee, or two people ordering coffees together.

Two people ordering coffee (source)

In the above image, the two people may be ordering separately or together. In the latter case, the barista may say “are you paying together?”, but this would be odd if the two people do not know each other.

People can indicate shared goals very explicitly (e.g. “We would like…”, “my son needs…”, or “me too”). But we hypothesise that goals are shared when people finish each other’s sentences (split-utterances). For example:

1) P: Whaere is the cafe?
2) C: because we are very hungry.

From the above two utterances, we can assume that the two people have a shared goal. This does not occur in dyadic interactions.

I hope I have convinced you that multi-party interactions are very different to dyadic ones, and they contain a number of extra challenges that must be solved if we are to have naturally interactive agents in public spaces.

How do we Progress?

As the majority of research in this field has focused on dyadic interactions, suitable data is very limited, and none exists with DST or goal tracking annotations. In order to train systems to do the above tasks in a multi-party setting, we must collect data. We - the SPRING project - are collecting multi-party conversations in a hospital memory clinic with the ARI robot.

I presented this paper at IWSDS 2023 in LA, showing the ARI robot we are using for data collection.

People that are visiting the hospital memory clinic with their companions are given role-play scenarios with varying goals. In order to collect conversations with the varying challenges described above, we designed six conditions. The third round of data collection is ongoing as I write this!

I provide more detail in the paper, but the six conditions are as follows:

Helpful Companion

The patient is given a goal, but the companion is simply told to assist the patient. They have to interact with the robot to complete these goals (e.g. get coffee, find out when the cafe closes, etc…).

This condition is what we expect a typical interaction to be like. The patient may need something, but the companion has no goal themself.

Shared Goals

As discussed in the goal tracking section above, sometimes people have shared goals. In this case, both the patient and companion are given the same goal, so they both may want lunch for example. This is only slightly different from the “helpful companion” condition, but initial observations suggest that more split utterances occur in this condition.

Reluctant Patient

People visiting the hospital may be too shy (or even too apprehensive) to talk to the robot directly. In this case, the patient has a goal, but they do not talk directly with the robot. The companion must therefore act as an intermediary between the patient and robot to complete the patient’s goal.

Different Goals

People do not always have the same goal. For example, the patient may want a coffee while the companion need the bathroom. In this condition, the patient and companion are given separate goals.

Missing Info

As alluded to earlier, the robot cannot always answer questions due to privacy. The robot cannot ethically disclose why the patient is visiting the hospital for example. Additionally, the robot cannot run face recognition on patients to identify them. In this case, the companion is given some missing information that the robot cannot know. The companion must provide this information to the robot in order to achieve the patient’s goal. As this is difficult to explain, here is an example:

1) P: Where is my appointment?
2) R: Sorry, I don't have access to that information.
3) C: It's with Dr Smith
4) R: Dr Smith is in room 17.

As you can see, the patient’s goal is to find their appointment location. The companion is given extra info about which doctor the appointment is with.

Disagreement:

Finally, multi-party interactions can involve disagreements. If the robot provides the location of the coffee machine, the companion may disagree. In this case, the patient is given a goal and the companion is given extra information that contradicts with the robot. The robot should be able to identify the conflict, re-supply the correct information, and reassure the participants that the information is correct.

You can find the full paper with citations here, and you can reach me on Medium, on Twitter, or on LinkedIn.

--

--

TDS Archive
TDS Archive

Published in TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Angus Addlesee
Angus Addlesee

Written by Angus Addlesee

Applied Scientist at Amazon AGI (Alexa) with a PhD in Artificial Intelligence. Contact details at http://addlesee.co.uk/

Responses (2)