#2 Voice Bot Components

Published in

SOGEDES tech savvy

6 min readApr 20, 2023

Hey there! Welcome back to our series — How to build a voice bot for your business. Today, we’re going to take a look at the technology behind a Voice Bot and review all the key components necessary to build it. You are also going to learn what are the open source options, as well as stablished technologies available on the market.

Three main Voicebot components: Speech to Text (STT or ASR), Natural Language Processing (NLP) and Text to Speech (TTS)

My name is Bruno, AI Engineer at Sogedes. We are in the second blog of this series where I show how to implement a voice bot with real examples and implementation details. If you are new here, don’t forget to read the first episode where I explained the main concepts of a Voice Bot project.

So… a Voice Bot has three main components: Speech Recognition, Natural Language Processing (NLP) and Text to Speech. And in the last few years, there was a boom in the AI field that significantly improved algorithms performance by using the Transformers Neural Network architecture. We’ve seen Speech Recognition and NLP components deliver incredible results, and now Voice Bots are a hot topic because we can build more robust and reliable applications.

Speech Recognition, often referred as Speech to Text, is the component responsible for understanding the user’s speech by converting audio data to text data. Text data is much easier to process compared to an audio signal, that is why we need to convert it to text in order to understand the meaning of the speech. In my opinion, this is the most critical Voice Bot component because it captures the user input. If you capture it incorrectly, the next components can’t work properly.

You can look for mature Speech Recognition services, like Google, Microsoft, IBM, AWS, Deepgram and so on. The quality is better and they have the data advantage, as their models have been trained in huge amounts of audio hours. But, of course, that comes with a higher price tag. The good news is that AI is becoming more more democratic every day and we have open source systems that are performing really well, like VOSK, Scribosermo and Wav2Vec. Regardless of the technology you choose, you need to make sure the speech recognizer works well with a good silence detector, so the Voice Bot response times are fast enough. There is also a blog post where we explain more about this topic.

As I said before, we need to understand the meaning of the user’s speech. For that, we have the NLP module that processes the transcribed text. There are many tasks that a NLP algorithm can perform, such as text generation, summarization, classification and so on. In the context of Voice Bots, especially Closed Domain Voice Bots developed via Rule-based frameworks, we are mainly interested in two tasks: text classification and entity extraction. The first one analyse the whole text and assign the user intent. Depending on the detected intent, there are predefined responses to continue the conversation flow.

The second task is responsible to extract key information from the text, like person name, location, numbers etc. With these extracted entities, you can perform data validation, store information in databases or do whatever you need with them. The good news is that there are many open source frameworks that not only focus on text classification and entity extraction, but they are specialized in building conversations. RASA, Microsoft Bot Framework, Botpress and Wit.ai are some examples and you can find the full list here. Of course you could start from scratch with the BERT family models in Huggingface to build text and entity classifiers, but if you choose a more mature framework like RASA, you will boost your project development.

There are also many Conversational AI providers that are very easy to use, like Google Dialogflow and Microsoft LUIS. Depending on the framework or provider you choose, it may vary how you design the conversation flow, but the idea is the same: you define many training phrases for each intent of your dialog and a predefined set of responses. Besides that, you also define possible entities that you want to extract and many times you can choose simple REGEX or Machine Learning based models. We will understand these details better in the next blog, where we’ll check Google Dialogflow together.

Besides the aforementioned Conversational AI frameworks, Generative AI space is progressing at light speed and will gradually impact the traditional Rule-based chatbots by integrating these flows with Large Language Models (LLM), like ChatGPT, thus enabling more powerful conversations. You can see as an example how RASA is developing intentless bots using LLM, or how you can integrate knowledge bases with LLM for Q&A bots.

The last component is Text to Speech. After getting our response from the NLP module, we need to say it to the user so we continue the conversation flow. Again, you can choose from many mature services, like Google and Microsoft, or you can check Open Source projects, like MaryTTS, Kaldi and some of the most recent models in the Huggingface hub.

Besides all the components we just talked, we must not forget that Voice Bot applications are usually used in phone calls, so we are still missing one important technology: the Telephony Gateway. This enables that a phone call reaches the Voice Bot and the user can speak with it. This telephony gateway can be implemented in various ways and many times it is provided by a Contact Center or a Communication Platform, such as Twilio, Genesys and our SogedesX product. In addition, there are many open source solutions that are very flexible and let you build anything you want in telephony, like Asterisk or FreePBX.

Usually, the workflow is simple: the user calls the Voice Bot number and the phone gateway creates a dedicated channel for this call. At this point, the logic is very similar to IVR systems, where you program different conversation paths based on the digit entered by the user on the phone. In Asterisk, for example, this logic can be programmed with the Dialplan scripting.

But now, instead of processing which digit the user clicked during the call, we can build more natural conversations by connecting the phone call to the Voice Bot. When the call starts, the bot begins the conversation using the Text to Speech component with a default welcome message. Then, the bot waits for a response using the Speech Recognition component and finally passes the transcribed text to the NLP algorithm, which analysis the text and extracts the intent and entities of the user’s speech. Each intent has a predefined response that is used by the Text to Speech module, closing the loop. Now the conversation continues until the end of the dialog or until some expected event is triggered.

If you remember well the good design practices we talked about in the first episode, we need to handle the cases where the Bot does not understand the user’s speech. For example, we need to transfer the call to a human when the bot detects two Unrecognized Intents in a row. Or maybe you also want to recognize that the user requested to speak with a human, a very important scenario to improve the customer experience.

Nice! We covered all the main components and how they interact with each other. Of course, we’ve only seen an introduction and we could do a more detailed blog for each of them. Let us know in the comments if you are interested in a deeper technical analysis.

However, if you don’t have the time or resources to create your own Voice Bot project, here at Sogedes we offer the whole package as a Service for you, so you don’t have to worry about anything. We’ll help you design the best use case, we take care of the technical implementation and we’ll make sure your customers are happy with it. Let us know your use case, we will be very happy to take a look at it.

In the next episode, let’s put our hands on a real project using Google Dialogflow, an easy tool to get started and that also has a generous free tier. I’ll share with you how to create a simple Voice Bot and how to set up a phone number for testing it. I’ll also share a more complex case that we created for our website, where a Voice Bot sells itself. See you soon!

Note: if you wanna try a Voice Bot demo we have available, you can call: +49 621 92109138

If you have any technical questions, just leave a comment or reach me on linkedin. Thanks! (https://www.linkedin.com/in/brunofcarvalho1996/).

#2 Voice Bot Components

Written by Bruno Fernandes Carvalho