The Future of Voice Assistants: What Are the Early Research Trends?
Five Years of PhD Discussions at YRRSDS and SIGdial
If you don’t want to read the whole article, you can skim read the best bits. They are formatted like these two sentences to highlight them throughout.
New areas of interest crop up in conversational AI, while others fade from discussion. In order to analyse the early research trends, I decided to explore the proceedings and roundtable discussion notes from YRRSDS 2018, 2019, 2020, 2021, to the most recent in 2022 (5th and 6th September).
What is YRRSDS?
The Young Researcher’s Roundtable on Spoken Dialogue Systems (YRRSDS) is an annual open forum for early-career spoken dialogue researchers to discuss current issues in the field, share tools, and stimulate new ideas.
I first attended YRRSDS in 2019 and enjoyed it so much that I volunteered to organise YRRSDS 2021 and 2022. It is a brilliant place to meet like-minded people, inspire each other, and make new friends.
All attendees submit a two-page position paper outlining their research, topics of interest, their suggested discussion areas, and where they think spoken dialogue system (SDS) research will be in the medium-term future.
If you want to read more about attending YRRSDS, scroll to the bottom. There are also a lot of photos from YRRSDS and some from SIGdial.
Timeless Trends
I will start with the hot topics that have persisted over the last five years. Even though these topics have remained unshakeable, the contents of the discussions have transformed over time (numbers in brackets are the years a topic was discussed) — let’s explore and start with the big one.
Note: the opinions throughout are collective viewpoints from YRRSDS participants and not necessarily my own.
Ethics and Privacy (18, 19, 20, 21, 22)
By far the biggest ethical discussion a few years ago was data-privacy for a two main reasons. The first was GDPR being implemented in mid-2018. Universities and their ethics boards had to be very careful to ensure they understood and followed it to avoid legal ramifications. This put a lot of pressure on early researchers in 2018 and 2019 who did not really know how to work with it. The second reason was the lack of cost-effective on-device processing or secure cloud solutions - blocking research.
Data privacy is by no means ‘solved’, but the GDPR awareness boom and industry pressures did improve the issue. Cloud platforms provide more security options and on-device processing is much cheaper. Sadly, documentaries like The Social Dilemma and The Great Hack come out, cause a slight bump in awareness, and then fade from discussion without real behaviour change. It seems that society is willing to trade privacy for convenience, and this also reduces the drive to consider and discuss it.
We started to consider whether we should make SDSs more human-like? One participant shared a story of a relative that would phone the old “speaking clock” (people would phone 123 and an automated voice would say the current time). Even though the clock voice sounded robotic compared to today’s standards, this person thought it was a human on the other end. They would thank the clock, apologise for bothering them, and say things like “I have to go now”. More recently, shop staff thought Google Duplex was a human - is this ok even though they did not explicitly ask?
We assume users are tech savvy enough to distinguish our system from a human being. The above examples show this is not always the case, and this issue becomes more problematic as voice assistants are designed for older adults and applied in the healthcare domain.
This year’s hottest topic was bias. We are all biased, and our individual biases are aggregated into cultural/societal biases within our language models. Once revealed, we can mitigate for these problematic biases in downstream tasks - but revealing them is tough! It is especially difficult as we do not even recognise most of our biases ourselves, and societal norms change over time.
Media likes to pick up ‘controversies’ due to biases in our voice assistants, and they give a lot of bad press to the creators. We discussed that this is not entirely fair, as they only reflect our society - and in fact, they might be one of our best tools to shine a light on today’s prejudices.
The primary solution discussed was transparency. We must highlight:
- How was the data collected? (In-person or online? Did people require a laptop or smartphone? Where was the collection advertised? Is it scraped from an online forum that attracts certain user groups?).
- Who was the data collected from? (demographics like gender, age, wealth of area they live in, and country of birth as verbal feedback is different in different countries and students are often international).
- Who annotated the data? (Students in a particular uni? Was it a company? What are the demographics of the annotators?).
- Who is in your team? (Is there an age, class, or gender bias? Will that be an issue in your domain? How do you know if yes or no? Maybe you exclude certain disabled people because it was not considered?).
It is not possible to have a perfectly unbiased dataset, so transparency is absolutely critical to reveal biases and enable further science. Universities and large organisations are hampering SDS research when it comes to collecting demographic data during a data collection. Universities should stop focusing on blocking critical research, and start supporting the safe dissemination of collected data (e.g. in the CHILDES corpus).
With dialogue challenges to win and papers to publish, sometimes ethical discussions get left on-hold in order to progress faster on specific metrics. Hopefully with the inclusion of ethics statements in paper submissions, and more ethics-conscious researchers signing up to review, this publish or perish culture will gradually change to include ethical considerations.
Multimodality (18, 19, 20, 21, 22)
Five years ago, “multimodality” discussions involved voice + gesture and/or gaze information. In 2019 there was a short discussion about not only monitoring the user’s gaze but generating gaze as well (with the Furhat robot in this specific conversation), but this changed dramatically. In 2020 we were understanding and generating using:
- Speech
- Gestures
- Emotion
- Gaze
- Paralinguistic Cues
- Prosody
- Eye colour (generation only, e.g. red robot eyes to display anger)
And later this list expanded to include:
- Text
- Facial cues (nodding, furrowing brow when confused, etc…)
- The environment (e.g. object detection)
- More gestures (generating with hands and full body with robots)
- Touch (buttons, 2-finger swipes, etc…)
With this explosion in discussed modalities, various problems arose for discussion like how to handle all of this input data - just concatenate it all? Maybe weight a certain modality to be more important? There is a lot of work on multimodal fusion as this isn’t so simple. Each modality has its own complexity and set of challenges. For example, speech contains disfluencies, silence is important, facial cues are culturally dependent, children produce different paralinguistic cues, ASR errors still occur, as do object detection errors, etc… Then generation requires a huge amount of data that is not available in all languages and cultures.
We don’t need a ‘super-modal’ SDS for every domain application. The additional computational cost of adding a modality is not often necessary. Embodied agents and tasks like emotion recognition obviously benefit from multiple modalities (and see a huge boost in task performance when they are added) but consider whether it is worth it in your domain.
Evaluation (18, 19, 20, 21, 22)
The evaluation discussions a few years ago centred around open-domain SDSs as they are extremely difficult to automatically evaluate. Without a constraining task, subjective metrics were used (perceived naturalness, relevance, did it elicit positive emotions, etc…). The only objective metric discussed was engagement in terms of number of turns. This combination of metrics are likely inspired by the 2017 Alexa Prize Social Bot Challenge. Automatic metrics are discussed more now as task-based SDSs rise in popularity (see the “Application Domains” section).
The ConvAI challenge showed that progress towards higher scores on these automatic metrics will not necessarily result in SDSs that satisfy the user. The winning system was not preferred by humans, but these human evaluations are expensive and time-consuming. Additionally, we could not even agree if human judgement is a good gold standard!
Human annotators do not tend to agree or know what the system can do, making results hard to reproduce. People are also swayed by external factors that may be useful for final evaluation before deploying a finished system, but not for one experiment. People are influenced by politeness, factual correctness, ethical responses (to e.g. “should I sell my stocks”), empathy, enthusiasm, gaze, and facial expressions, etc...
An embodied agent might affect the users opinion further (the robot may be small and cute, or large and authoritative). This subjectivity is chaotic, so maybe a group of weighted automatic evaluation metrics could be selected to better reflect real world performance? These could include metrics like USR, lexical diversity, and quality of intermediate hypotheses.
Trends On the Rise
The behemoth discussions are out of the way. While reading through the YRRSDS proceedings, some topics have gained popularity in early-stage SDS research recently. These are not topics that are brand new to the field, but topics that are currently being talked about more consistently:
Application Domains (21, 22)
More early-stage researchers are now working to improve SDSs in very particular settings. As customer service chatbots and open-domain voice assistants are applied more in industry, research discussions have involved more varied applications - often embodied. For example: SDSs to assist older adults in retirement homes, interact with autonomous underwater vehicles, assist disabled people, coach non-native speakers, advise on mental health issues, teach children social skills, psychotherapy, etc…
There are huge potential benefits to these domains. People will be less anxious to ‘look stupid’ in front of their parents, teachers, or peers when interacting with an SDS to teach them something or ask about things that worry them. Lonliness could be partially reduced, and people with visual impairments could ask questions about their environment (e.g. cooking).
There are of course many risks discussed too. Psychotherapy is extremely sensitive and potentially harmful if the agent provides poor advice.
Huge datasets exist for evaluating general customer service / open-domain SDSs, but these do not exist in abundance for every specific domain. This is especially true for sensitive and private applications like the ones being researched today. The danger is that researchers will use general datasets to measure improvements in models to be deployed in a risky setting.
Data Collection (21, 22)
The wider range of more specific and often more sensitive domains that early-stage SDS researchers work in causes one huge challenge - data collection due to limited existing resources. See the “Ethics and Privacy” section above for the related ethical approval, bias, and privacy challenges.
People behave differently in different domains. For example: vocabulary changes as people use finance-related terms when discussing insurance, and certain user groups produce speech that differs from the general population (e.g. children or people with cognitive impairments). The large general datasets are therefore unsuitable to evaluate an SDSs use in a lot of domain-specific research. Most participants at YRRSDS 2022 were building their own resources!
Consent was a debated topic. There are obvious concerns around getting consent from vulnerable user groups (e.g. people with cognitive impairments), but some researchers cannot get data they need due to consent. Some topics like offence detection and abuse mitigation require people to interact with the SDS very openly. The act of getting consent itself modifies user behaviour as they know people will capture their interactions - they will be less angry, less abusive, and will refrain from swearing or making requests that stem from gender bias (e.g. “call me daddy”). Ethics boards do not usually allow you to ‘trick’ or ‘mislead’ people, so how can we collect this data from consenting adults?
The increase in multimodal research again adds additional data collection complications. Data in each of the several modalities must be annotated independently and then aligned really accurately which is difficult to do (one reason why clapperboards are used in movie production). Anonymisation is also more difficult, and simple techniques like blurring faces often makes the collected data unusable (e.g. emotion detection, gaze tracking, use of facial cues, etc…).
Datasets are extremely difficult to obtain for multilingual low-resource language uses, multiparty SDSs, incremental systems, and the problematic application domains discussed above. Sometimes they even exist, but the creators did not get consent or approval to share the data with other researchers… this is critically important in general, but especially within sensitive domains. When collecting data, make sure to include data sharing in the ethics approval and consent forms (you can share data securely through TalkBank, DementiaBank, etc…). Often dataset curation takes years, so it is a shame to use it for just one or two papers.
The discussions often lead to potential paths that avoid data collection due to the hurdles, time, and costs involved. Data efficiency, augmentation, and bootstrapping are increasingly popular topics for this reason (I predict these will become roundtable discussions of their own soon).
In order to measure our progress in certain domains, data collection must be valued more, planned for more, and accounted for in a project’s budget.
Interdisciplinary Collaboration (21, 22)
As pressures build to collect data with more specific user groups - collaboration becomes one key to a project’s success. Many YRRSDS 2022 participants were even trained formerly in another field and then transitioned to work on SDSs. Many others reported that they worked directly with advisers from other disciplines in their teams and labs.
Networking was discussed as being an important skill while doing a PhD today, and that it makes finding potential collaborators much easier. Improving your own visibility even increases the chance of someone reaching out to you! Channels like Twitter, LinkedIn, podcasts, and writing on Medium were suggested (hint hint, feel free to follow me). You can also try attending conferences in another field, this worked for me personally.
There are clear benefits to collaboration. Other researchers can shine a light on any outdated techniques you are using for example, but it is not all golden. There may be funding goal clashes and ‘experts’ have to accept that they do not know everything on every topic… Collaboration with industry introduces new challenges too. Businesses want to focus on the implementation of a stable product rather than a publication. Academics often value evaluation metric optimisation over business metrics like latency and cost, etc…
Context, and Knowledge Representation (21, 22)
A growing number of YRRSDS participants found that simple representations of dialogue (like slot-filling) did not sufficiently represent their real-world applications. More hierarchical and temporal representation structures were discussed, but these complex structures are of course much harder to learn.
Knowledge graphs are booming in research and industry, especially to represent world knowledge (resources like Wikidata and DBpedia were discussed). The challenge is mapping ‘what was said’ to a graph’s ontology - particularly for task-oriented applications where vocabularies differ.
Ingesting and grounding internal knowledge is a very different challenge however. We can learn about user behaviours, learn from their stories (a bit like a Black Mirror episode), or learn rules about the world from interactions with the user (this was inspired by a keynote talk by Chris Howes earlier in the day at YRRSDS 2022 - see below).
Dialogue contexts can get quite enormous, particularly in multimodal or multiparty domains. Our models currently take the entire context as input and we just let attention handle it all. In order to reduce computational cost without sacrificing performance, the YRRSDS discussion pointed out that scene-graphs are used in computer vision to ‘summarise’ a model’s input. This could be possible with language, but it is more difficult to determine what is ‘relevant’.
Clarification and Repair (21, 22)
Clarifications are used to establish common grounding, for example:
- A: “Want to meet next Monday?”
- B: “On the 21st right?”
- A: “No, that’s this Monday, let’s meet on the 28th”
But they are also used to elicit additional information, for example:
- A: “Can you grab the red jacket?”
- B: “The one on the left?”
- A: “Oh sorry, yes”
Traditionally SDSs would repeatedly clarify their understanding with the user to avoid completing the wrong action. This gets frustrating however as it essentially parrots what the user said back to them every few turns. More intelligent methods of detecting when clarification or repair is required have been discussed at the last couple of YRRSDSs. This would allow our SDSs to initiate clarification in a much more natural and fluid manner.
These include:
- Semantically parsing the user’s utterance through two or three slightly tweaked models. When a portion of the parse structures differ, our system could initiate clarification about that particular section of the meaning representation.
- Training our semantic parsing models to output an ‘unknown’ tag if it identified that the utterance was underspecified (inspired by robotics).
- Using facial cues like brow-furrows to indicate that the user is confused and repair is needed.
- Using knowledge graphs in conjunction with an SDSs understanding. If the understanding does not match Wikidata’s knowledge or ontology (e.g. “When is Paris’ birthday?”), clarification is needed.
- Checking with an internally built knowledge graph if a user requests something unusual (e.g. requests to be woken up 2 hours earlier than usual - best to ask and make sure).
People do alter their speech when an SDS is minsunderstanding them. They clean disfluencies from their speech and soften their accents in certain areas (like Scotland). They are not very patient otherwise though, getting very frustrated very quickly. People usually clarify three times with other humans before politely giving up or just pretending to understand. This patience with other humans is likely due to the fact that people accept shared responsibility for the confusion. When interacting with SDSs however, people put the full blame on the system - it is stupid for not understanding. Further work on clarification and repair strategies should ease this frustration and improve trustworthiness.
Linguistic Theories (22)
I think this is one of the most surprising trends on the rise amongst early-stage SDS researchers, but we had an entire roundtable on linguistic theories at YRRSDS 2022. Why is this?
Communicating ‘meaning’ correctly and effectively is hard - really hard! There are entire PhDs on the different meanings communicated by different laughter types for example (see work by Vladislav Maraev or Chiara Mazzocconi). Our speech is so nuanced, as is our visual communication, and then we also speak disfluently and make mistakes. Google Duplex showed us how generating this material makes the output of an SDS even more realistic. With this proven benefit of use and the increase in multimodal and healthcare domains (where people speak more disfluently), we see an increased interest in the research of ‘where’ and ‘why’ humans use particular phenomena.
It was noted that linguistic annotation schemes are incredibly complex (like ISO 24617–2). These schemes cover everything we want as SDS researchers, but also a lot more. There was a lengthy discussion around whether we should create a slightly simpler version for our field.
Large Language Models (20, 21, 22)
While it was maybe a surprise to see linguistic theories above, it is definitely no surprise to see large language models (LLMs) here. And of course, we predominantly discussed transformer models as they continue to dominate tasks from semantic parsing to NLG and speech synthesis.
Transformers are of course brilliant, but they are not perfect when applied to Dialogue. They often struggle to exploit conversation history effectively or perceive contextual changes - there is a larger problem though. While these models approximate what a good dialogue looks like, they do not help the user reach their goals!
Transformers also do not appear to generalise well overall for SDSs. Dialogue data is biased in terms of domain, so LLMs work well for booking things through conversation - but that doesn’t necessarily transfer to handle conversations about managing anxiety. For example, a laugh’s meaning can be very different in various contexts. Was the system’s response funny? Is the user being sarcastic? Or are they nervous laughing while talking about their worries?
As mentioned earlier, people are very sensitive to system mistakes. This is detrimental to whether the user is happy using the SDS, and whether they trust it. Trust is completely wiped out if the SDS hallucinates to generate factually incorrect responses. This is critical in more sensitive domains like healthcare where factually incorrect responses could harm the user (for example, voice assistants designed for visually impaired users). More controllable rule-based systems are still preferable in these domains - at least for response generation.
Personalisation, Personality, and User Experience (20, 21, 22)
Over the years we seem to want to further personalise our SDSs to each user, and develop our SDS’s personality itself to improve user experience.
As discussed above, large language models are trained on general datasets which make it hard to personalise. For example, would this model know that the understanding of ‘on the first floor’ is different in the US and UK? Similarly, conversations about insurance are very different in certain cultures, and an SDS apologising for a mistake could be ‘admitting guilt’ risking legal issues in the US. This requires a lot of private user data.
It was reinforced that it is critical for SDSs in the future (if not now) to identify themselves as not human as personalisation and personality improves. Pet embodiments could help here as they are non-human, genderless, and perceived to be just as capable as human-embodied agents.
Explainable AI (21, 22)
As our models become more complex, our ability to describe why they made a certain decision becomes more difficult. Researchers trust rigorous evaluations even if the model cannot ‘explain’ itself. For example, computer vision models outperform trained doctors trying to detect early signs of cancer in scans. However, the doctor can explain why they think you do or do not have cancer, and they can talk you through their reasoning.
Decision makers in the medical, legal, and financial fields do not like putting their money and trust in some black box that they do not understand. They prefer to rely on trained humans - even if they do make more mistakes. This has predominantly driven Explainable AI research.
What exactly do we want to explain though? And how do we know if a model has explained itself correctly? If we don’t even know if the explanation is correct, how would the decision maker? Does it even matter? Really good object detection models can run without any interpretation or explanation, are there huge consequences if the model explains itself incorrectly?
Our conclusion was short and simple: If a model is contributing to real-world decisions or having a societal impact, explainable AI techniques should be applied.
Trends on the Decline
Finally, I identified a couple of topics that were popular or large discussions a few years ago, but that have been absent in recent years:
Engagement (19)
Obviously researchers today still want their systems to be engaging, but the discussions have shifted to focus on personalisation and personality. Engagement was discussed thoroughly in 2019 with a surprising inclusion of both open-domain, and task-based systems.
Older systems relied on more rule-based approaches that caused repetitive interactions and even conversation loops. As a user talked more with the system over time (for example: an assistant at work or in their home), they would quickly hear every templative response. The ‘cool’ new voice assistants became stale and repetitive, failing to sustain the user’s attention or use. This spurred on the discussions around engagement.
Today’s systems often use neural response generation models that can talk with emotion and nuance, and can phrase sentences in countless ways. Hence, this particular engagement problem is no longer a huge issue. We now worry more about the factual correctness of these models, and giving them personalities to fit the specific domain (e.g. an empathetic therapist).
We also do not worry as much about engagement in the same way when developing task-based systems. In fact, long-term engagement is improved if the interaction can remain short - allowing the user to complete their goal efficiently.
Mirroring and Mimicking the User(18, 19)
Finally, older SDSs were more robotic in their interaction. Both in regards to the flow of the conversation, and the synthetic voice itself. People don’t like robotic voices, mispronounciations, or dialogues that statically follow a set path. This lead to research SDSs mimicking and mirroring users.
Mimicking was used to try and increase the user’s empathy towards the system. People were also working on mirroring the user’s vocabulary and accent to build trust… Today we would deem this ethically concerning (as well as difficult to evaluate), and this was brought up in 2019. Researchers asked what the final goal was here - was it to build false trust and ‘bond’ with the user? Why? Presumably to sell something or persuade the user.
Do you want to attend YRRSDS 2023?
If you have enjoyed reading about these discussions, I recommend keeping an eye out for YRRSDSs next call for papers (it is usually collocated with SIGdial). You can follow the YRRSDS Twitter account if you want.
What happens at YRRSDS though? Here is some further info:
People submit 2-page position papers describing their work, their interests, suggested discussion topics, and their thoughts about the future of SDSs. Everyone that attends gets to present a poster and attend the roundtable discussions (which are on popular topics in the position papers).
The organisers are all early-stage SDS researchers themselves! Here is the 2022 team (who all submitted and presented work with everyone else):
We usually have a few invited keynote speakers too. Their talks often inspire some of the roundtable discussion directions. This year we had three brilliant speakers and two industry talks!
Finally, YRRSDS is collocated with SIGdial, so once we have made friends and had exciting discussions about our field - we can enjoy the conference!