How We Use Speech Functions to Build Scenario-Driven Skills with DeepPavlov

Ksenia Petukhova
DeepPavlov
Published in
10 min readDec 5, 2022

Authors: Ksenia Petukhova, Veronika Smilga

Introduction

One of the problems that a chatbot has to solve each time it receives a user request is selecting the most appropriate response. But dialog is a rather unpredictable thing — the same request can be answered in very different ways. How to foresee all possible cases when developing a dialog scenario? And how to understand what is better to say in various specific cases?

There are​​ three main types of approaches to dialog management:

  • Handcrafted approach. Developers define all possible states of the system and a set of rules for transitioning between those states. Thus, a simple dialog system can be represented as a finite-state automaton. Such dialog systems can be made more flexible by adding a data model which keeps track of slots. This method is called frame-based and it enables a system to manage a dialog in a more free order.
  • Probabilistic approach. System learns rules from actual conversations. Thus, a system can learn appropriate answers from a corpus of pairs of user-system utterances, it’s called an example-based system. When such a system receives a response, it matches it with corpus examples and returns an answer. In a more advanced form neural networks are used to predict the closest example from a corpus for such utterances that do not have an exact entry in a corpus.
  • Hybrid approach. This approach combines both rule and data-based approaches. It’s more convenient for those cases when the system uses external services to provide answers to solve some tasks. Thus, a system can use rules to solve tasks, and a data-driven approach in all other cases.

But the thing is, all methods have strong weaknesses. Handcrafted approach requires a lot of effort and human resources to think of scenarios and create custom rules to classify user’s utterances. Probabilistic approach doesn’t consider the context of an utterance, and the dialog is managed on a step-by-step basis. However, people behave quite differently. In a human-human conversation, dialog is managed with consideration of context and our plans and goals. At every dialog turn we perform some kind of action; these acts are called speech acts or dialog acts. For example, when we thank someone, we perform the “acknowledgment” dialog act, because by saying “thank you” we express our attitude towards our interlocutor concerning their action. The concept of speech acts was first suggested by Wittgenstein, then worked out by Austin and reinterpreted by Searle. Thus, it allows us to model interaction within individual turns. Two dialog turns can form an adjacency pair. In the adjacency pair the first turn causes the second one. But dialog is not just a set of adjacency pairs, since utterances can be connected to each other through more complex relationships. According to the Speech Functions (SF) theory, each dialog turn can be classified depending on its role in the context. Thus, a dialog can be represented as a sequence of Speech Functions, where each next SF logically comes out from the previous ones. And we decided to adapt SF theory to the dialog management to make it more advanced. Therefore, in this article we describe how we use Speech Functions to make dialogs with Dream Socialbot more smooth and natural. Namely, we describe a skill that was built to discuss books, and how we improved it with SFs.

DFF Book Skill

Our DFF Book Skill handles typical book questions and recommends books according to the user’s preferences. It was built using the Dialog Flow Framework (DFF), an open-source python software stack for developing chat-bots also designed by DeepPavlov. DFF enables skill writers to code skills using a domain-specific language (DSL). In DFF the dialog is represented as a graph with nodes corresponding to bot responses, where developers can specify the text of bot response, the processing functions needed (e.g. slot-filling), and the conditions for transitioning to other nodes. An example of nodes can be seen below.

"user_liked": {
RESPONSE: loc_rsp.append_question(
initial="I see you love it." "It is so wonderful that you read the books you love. "
),
PROCESSING: {"set_confidence": int_prs.set_confidence(SUPER_CONFIDENCE)},
TRANSITIONS: {
("bible_flow", "bible_start"): cnd.true(),
"denied_information": int_cnd.is_no_vars,
},
},
"user_disliked": {
RESPONSE: loc_rsp.append_question(initial="It's OK. Maybe some other books will fit you better. "),
PROCESSING: {"set_confidence": int_prs.set_confidence(SUPER_CONFIDENCE)},
TRANSITIONS: {},
},
"offer_best": {
RESPONSE: loc_rsp.append_unused(
initial="You have a great taste in books! "
"I also adore books by {cur_book_author}, "
"especially {cur_author_best}. ",
phrases=loc_rsp.ASK_ABOUT_OFFERED_BOOK,
),
PROCESSING: {
"get_book": loc_prs.get_book,
"get_author": loc_prs.get_author,
"get_book_by_author": loc_prs.get_book_by_author,
"execute_response": loc_prs.execute_response,
"fill_responses_by_slots": int_prs.fill_responses_by_slots(),
},
TRANSITIONS: loc_cnd.has_read_transitions,
},

As you can see from the image, each node has a RESPONSE field, in which the text of the bot utterance is specified, a PROCESSING field with special functions such as the ones that extract the author of the book mentioned and their most popular book, and TRANSITIONS field, where next nodes and transitions are specified.

An illustrative schema of a small part of the DFF Book Skill scenario is represented below.

And here is an example conversation with the DFF Book Skill:

Overall, the skill is quite limited and ignores the user if they show at least some proactivity. Such skill can be good only for those users who are strictly willing to follow the scenario we built. And that is why we need Speech Functions.

Speech Functions

Theory

Eggins and Slade in their work introduced their vision of connection between individual dialog turns and cross-turn discourse structure patterns specific for spoken-language. This connection is the higher-level abstraction, in the sense that it connects single turns on the discourse level, enabling interactive and sequential conversational experience. At the level of turns, Eggins and Slade extended Halliday’s concept of Speech Functions, an alternative to Dialog and Speech Acts. At a higher level, they introduced the concept of Discourse Moves that are directly connected to the Speech Functions.

Speech Functions are defined through the utterance’s role in discourse and express pragmatic goals of speakers. According to Eggins and Slade, Speech Functions make up the so-called Discourse Moves. There are three types of them:

  • opening moves: used to start a dialog or a new dialog topic;
  • sustaining moves: used to develop the current topic;
  • reacting moves: denote the change of the speaker; there are two subtypes of reacting moves:
  • responses, leading the completion of the topic;
  • rejoinders, prolonging the discussion.

Thus, for example, Speech Function “Open.Initiate.Demand.Fact” stands for demanding factual information at the beginning of the dialog or new dialog topic. Such multi-layered tags represent topic organization and development, turn management, information type, social relationships, and speaker’s intentions.

Therefore, the dialog that we have seen above would be annotated this way:

It’s important to note that we annotate only the last sentence of the utterance, since we believe it to be the most significant one for the development of the further conversation. So, analyzing the SFs of this dialog, we can say that the bot is way too proactive, since in every utterance it asks questions (“React.Rejoinder.Support.Track.Clarify”), while the user shows no proactivity and just answers bot’s questions. However, in real life the structure of the dialog is far more complicated and we need to remember that when developing chat-bots.

How we started to use Speech Functions

When we started to think about how to collect a dataset of annotated dialogs and develop an instruction, we understood that the taxonomy proposed by Eggins and Slade wasn’t strict enough, so we couldn’t just give it to assessors. That is why we modified it and developed the original versions of Speech Function Classifier (SFC) and Speech Function Predictor (SFP). Speech Function classifier determines the SF of the utterance and then the Speech Function Predictor yields probabilities of SFs that can follow the one given by the classifier.

We then ran an experiment on real users. Namely, we asked people to speak with the bot with only book skill enabled, and then we analyzed user answers and how user experience could be improved if we used Speech Function Predictor to recommend possible user and bot utterances. In the course of this experiment, we:

  1. collected book-related dialogs;
  2. classified user utterance with Speech Function Classifier;
  3. ran Speech Function Predictor;
  4. computed number of times SFP predicted user utterance classes correctly;
  5. analyzed the results.

As a result, we created a scheme that shows current conditions for transitioning to the next nodes, predicted transitions by SFP, the number of user utterances that correspond to current conditions, and the number of utterances that would correspond to the predicted conditions. A part of the schema can be seen below.

We can see that 6 users answered negatively (“int_cnd.is_no_vars” in current conditions and “React.Respond.Confront.Reply.Disagree” in predicted conditions), and 22 users answered in a different way and fell under the default condition (“cnd.true()”). But if we used SFP, we could distinguish at least 5 new conditions: “React.Respond.Support.Reply.Affirm” — positive answer, 14 answers would correspond to this condition; “React.Rejoinder.Support.Track.Clarify” — question, 2 answers would correspond to this condition; etc. The distribution of conditions in DFF Book Skill can be seen below.

Thus, the conditions are primitive (ignorance, yes/no, custom). If we used SF conditions for transitions, we could cover all current conditions, add a lot of new ones, and most of the user utterances would be classified according to the new conditions. The image below shows the distribution of utterances corresponding to current conditions and possible SF conditions.

To sum up, we made sure that the use of Speech Functions can be very beneficial when building scenario-driven skills.

Later, Speech Function Classifier and Speech Function Predictor became a part of a discourse-driven recommendation system proposed by Daniel Kornev and Lidia Ostyakova. This system would enable developers to consider the most likely dialog scenarios when developing skills and therefore design more natural dialogs (Kuznetsov et al. 2021). This system became a part of DD-IDDE, a discourse-driven integrated dialog development environment built on top of Draw.io as a Visual Studio Code extension.

DD-IDDE enables developers to create DFF skills using customized Draw.io interface (more information can be found here):

How we actually re-built DFF Book Skill using Speech Functions

As we have demonstrated before, initially DFF Book Skill had limited capacities and provided generic responses taking only the bare minimum of the user’s previous answers into consideration. We decided to modify DFF Book Skill by introducing Speech Functions as conditions for transitioning between the nodes of the dialog instead of a simplistic system of yes- and no-intents. The full list of Speech Functions available for use when building custom DFF skills using DD-IDDE can be found here.

Based on the predictions made by Speech Function Predictor, we introduced several new transitions and nodes to allow for a greater variety of the bot’s reactions to user’s utterances. As an example, let us consider the starting node. Initially, there were two nodes with intent-based transitions, one for a negative answer and one for all other answers (considered positive):

"book_start": {
RESPONSE: loc_rsp.append_unused("", [loc_rsp.START_PHRASE]),
PROCESSING: {
"set_confidence": int_prs.set_confidence(SUPER_CONFIDENCE),
"set_flag": loc_prs.set_flag("book_skill_active", True),
"execute_response": loc_prs.execute_response,
},
TRANSITIONS: {
("books_general", "dislikes_reading", 2): int_cnd.is_no_vars,
("books_general", "likes_reading", 2): cnd.true(),
},
},

Introducing SF-based transitions instead of the intent-based ones allowed for a greater variety of conditions. Now the system can also handle cases in which the user does not give any certain (positive or negative) answer to the question and either continues to discuss some other topic (indicated by Sustain.Continue.Prolong.Extend, Sustain.Continue.Prolong.Enhance, or Sustain.Continue.Prolong.Elaborate functions), in which case the bot changes topic to discussing the Bible, or asks the bot a question (indicated by React.Rejoinder.Support.Track.Clarify, React.Rejoinder.Support.Track.Check, or React.Rejoinder.Support.Challenge.Rebound functions), in which case the bot provides a generic answer. What is more, the system is now able to recognize a wider variety of negative answers, as now it takes into consideration both intent and SF classification results (React.Respond.Confront.Reply.Disagree, React.Respond.Support.Reply.Disavow, React.Rejoinder.Confront.Challenge.Counter functions for negative answers):

"book_start": {
RESPONSE: loc_rsp.append_unused("", [loc_rsp.START_PHRASE]),
PROCESSING: {
"set_confidence": int_prs.set_confidence(SUPER_CONFIDENCE),
"set_flag": loc_prs.set_flag("book_skill_active", True),
"execute_response": loc_prs.execute_response,
},
TRANSITIONS: {
"change_subject": cnd.any(
[
dm_cnd.is_sf("Sustain.Continue.Prolong.Extend"),
dm_cnd.is_sf("Sustain.Continue.Prolong.Enhance"),
dm_cnd.is_sf("Sustain.Continue.Prolong.Elaborate")
]
),
"bot_answer": cnd.any(
[
dm_cnd.is_sf("React.Rejoinder.Support.Track.Clarify"),
dm_cnd.is_sf("React.Respond.Support.Track.Check"),
dm_cnd.is_sf("React.Rejoinder.Support.Challenge.Rebound")
]
),
"dislikes_reading": cnd.any(
[
dm_cnd.is_sf("React.Respond.Confront.Reply.Disagree"),
dm_cnd.is_sf("React.Respond.Support.Reply.Disavow"),
dm_cnd.is_sf("React.Rejoinder.Confront.Challenge.Counter"),
int_cnd.is_no_vars
]
),
"likes_reading": cnd.true(),
},
MISC: {"speech_functions": ["Open.Demand.Fact"]},
},

Thus, we have demonstrated a possible way to modify and expand chatbot skills using additional information about discourse, i.e. the Speech Functions taxonomy developed by Eggins and Slade. Such an approach allows to enrich the possible dialog scenarios and cover most significant types of possible user responses without having to introduce custom conditions.

See more

Remarks

Since the described approach is experimental, we didn’t deploy it to production. But if you want to try this skill out, you can build it on your machine:

  1. Clone Dream repository:

git clone https://github.com/deeppavlov/dream.git

2. Go to cloned repository:

cd dream

3. switch to feat/sf-bookskill-new branch:

git checkout feat/sf-bookskill-new

4. Run SF Dream distribution:

docker-compose -f docker-compose.yml -f assistant_dists/dream_sfc/docker-compose.override.yml -f assistant_dists/dream_sfc/dev.yml -f assistant_dists/dream_sfc/proxy.yml up -- build

5. Let’s chat! In a separate terminal tab run:

docker-compose exec agent python -m deeppavlov_agent.run -pl assistant_dists/dream_sfc/pipeline_conf.json

--

--