Real-Time Google Speech Recognition on Asterisk — with python EAGI script
When we think about voice technologies, for sure Asterisk is a name that comes to mind. It is a great free and open source framework for building communications applications. And when you build voice applications, speech recognition is key to do many things, like Voice Bots, Virtual Assistant, phone call transcription and user sentiment analysis. In this blog I want to share how you can integrate Asterisk with Google Speech Recognition, so you can implement robust real time applications.
Non Real Time Applications
If we are talking about transcription or sentiment analysis, maybe your use case does not require real time speech recognition, since you can easily post-process this information based on the phone call recording. For this, you can find useful information here:
- https://github.com/zaf/asterisk-speech-recog
- https://gitlab.com/ppittavi/asterisk_asr_google/-/tree/main/volumes/agi-bin
The two links use the same approach: it is an AGI script that records the user speech in a temporary file until it detects the silence in Asterisk. Then it sends the file to Google Cloud for the audio transcription.
If AGI is a new term for you, it means Asterisk Gateway Interface. AGI provides an interface between the Asterisk dialplan and an external program that wants to manipulate a channel in the dialplan. This external program can be in any language, like Perl and Python from the links above. Summarizing, it is just a script in your preferred language that will be triggered by the Asterisk dialplan and can perform Asterisk commands, like setting a variable, recording the call or playing an audio file. Besides Asterisk commands, you can do anything you want, like importing third party libraries.
Real Time Applications
But what if you want a real time application like Voice Bots, where response time is critical? The Voice Bot needs to understand quickly when the user stops saying something and give a proper response. You can imagine that recording user speech, creating a temporary file, reading the file and sending data to Google is not an optimal solution. The Voice Bot response time is around 3.5s and the silence detection is not good enough in Asterisk. How can we improve this?
We need to stream the phone call audio to Google in real time using gRPC protocol. You can check some examples from Google Documentation. But an AGI script does not provide the phone call audio, so we need an EAGI script, which stands for Enhanced AGI. The EAGI provides the phone call audio in the OS File Descriptor number 3, so you can read the audio chunks and send them real time to Google.
How to use
To access the github repository, check here.
The EAGI script was implemented in python and you execute it from the dialplan using the following command:
Before running these commands, make sure streaming-asr-google.eagi is under /var/lib/asterisk/agi-bin/ folder and check if it is executable (just run ‘chmod +x streaming-asr-google.eagi’). You also need to put Google Credentials Json file in the same folder with the filename “google-cloud-credentials.json”.
As you can see, streaming-asr-google.eagi expects some parameters as input arguments (check readme.md for more info) and gives the transcript output in the variable “TRANSCRIPT”. This script is using single_utterance mode to detect when the speech is over and it is also able to detect if the user is in silence for many seconds. If this last scenario occurs, then the “TRANSCRIPT” variable is set to “_SILENCE_”.
You can also extend this script to other scenarios, like interim results processing. For that you need to check the recognition configuration in Google Documentation.
How it works
Let’s go step by step through the script. First we import some libraries, start the AGI connector and read the command line arguments that will come as input when we execute the eagi. You can also set the DEBUG flag to True if you want to see all the logs.
Now we need to define the Audio Stream Generator, a class that is going to yield audio chunks. Notice that this class is special because we are going to use it as context manager with the with python statement. So there is the __enter__ method that opens the file descriptor 3 and the __exit__ method that closes it.
gRPC protocol creates a bidirectional streaming, so our script sends the audio chunks as requests and receives in parallel the real time responses from Google. Whenever we are iterating the responses, we need to parse it and detect END_OF_SINGLE_UTTERANCE event (that means the user is not speaking anymore). The Parser class below checks this and interprets Google json response.
Awesome, we have all components to start running the script effectively. The next lines of code initialize Google Speech Recognition model and start the Audio Stream Generator. Every time the stream yields an audio chunk, a new gRPC request is also yielded and subsequently a new response is yielded whenever Google sends a response back. More insights about python generators here.
Conclusion
Now that you have a better understanding about EAGI scripts and real time speech recognition, you can start building any use case, like Voice Bots or Voice Assistants!
Here in SOGEDES we extract the most out of the technologies to improve user experience. Our Voice Bots are optimized to handle multiple concurrent calls and have the best response time to give the user a “human feeling" when talking with a robot. If you are new in the Voice Bot world or want to get one implemented in your company without worrying about anything, don’t forget to reach us here! We take over the whole process and do everything as a service for you.
If you have any technical questions, just leave a comment or reach me on linkedin. Thanks! (https://www.linkedin.com/in/brunofcarvalho1996/).
Extra
In the same repository, there is a python AGI script that integrates with Google Text to Speech. Check it out!
References
https://github.com/ederwander/Asterisk-Google-Speech-Recognition