Illustration created by Yufei McLaughlin

The future of meetings is here (part 2): how Machine Learning is applied in speech analytics and voicebots

Yufei M
ringcentral-ux
8 min readMar 30, 2020

--

In the previous article, The future of meetings is here (part 1), we talked about how Machine Learning (ML) is applied in RTC optimization and computer vision. In this article, I want to summarize how ML is applied in speech analytics and voice bots and its influence in video conferencing’s future.

As we discussed in part 1, it is really hard to process data in real-time video because there is lot of data to process for computer vision on video. As a result, today, a lot of use cases around ML in RTC (Real-time Communication) are leaning toward voice and text-to-speech scenarios.

To implement AI in RTC, big cloud vendors such as IBM Cloud, AWS, Azure, and Google Cloud have been aggressively competing with one another on speech and language interaction. But if one dives deeper, he/she would find that most of the solutions from big cloud vendors are only touching the surface.

However, there are many specialty vendors who do basic speech detects and then go into different areas and specializations. These include:

(1) speech analytics: a) meeting productivity b) paralinguistic and sentimental analysis c)sales and call center optimization

(2) voice-enabled virtual assistants: a) meeting voicebots b)call center voicebots

Speech analytics

Illustration created by Yufei McLaughlin

The state of speech recognition and understanding is also expediting the happening of the future of meetings. The future is the idea, meaning, intent, and context.

However, understanding and transcribing speech in a real-world, noisy environments was never an easy task. With the use of deep learning tools in real-time, speech recognition & analysis has been brought to the revolutionary stage right now. This kind of speech recognition developments have enabled advances in real-time, batched transcription for NLU (natural-language understanding), and applied to interactive RTC streams. Here are the 4 business use cases that are benefited from speech analytics driven by AI:

(1) Meeting productivity

In a meeting, a lot of things could happen and a lot of important information is shared and discussed. Once a meeting is ended, the information is lost — enterprise rich data is lost. To provide value for enterprise, many RTC vendors specialize in transforming the unstructured data into structured one by turning speech / voice into actionable, archive, and searchable data. The capability of ML on real-time speech analytics brings huge value both in and after a meeting.

In a meeting, we often bring our laptops or notebooks to take important notes. But once we open laptops, messages come in from everywhere; we are distracted and couldn’t focus on people and discussion. With the use of ML on speech analysis, automatic note-taking and live transcription gives us a safe feeling that we can always come back and find the information about what’s discussed. During the meeting, we can just keep our focus on the meeting and outcomes. We can spend less time managing the meeting, more time focusing on the meeting content itself, and never missing a minute. For meeting facilitators, note-taking eliminates manual processes and now they can focus on leading more productive and engaged meetings. Even if we are distracted during the meeting, real-time meeting highlights give us ability to quickly catch up on what’s left.

After a meeting, people often don’t have time to go through the entire recording/notes. ML doesn’t just summarize the meeting and give us a transcript view, it also extracts important data from the meeting and give us recommended highlights. We can also interact with ML by customizing what’s important and it will learn and become more intelligent. ML helps speech analytics transform meetings from aimless to actionable. Automatic action items and post-meeting follow-ups boost productivity and efficiency. These are key contributions that ML and speech analysis brings to (once considered “future”) meetings.

(2) Paralinguistic and sentimental analysis

There are many business use cases where a meeting could be benefited by ML powered paralinguistic and sentimental analysis. For example, when interviewing a candidate, people would love to know if he or she is truly interested in and passionate with the position. Or, during a sales pitch, it will be helpful for a salesperson to know if a key stakeholder is interested and if not, which part of conversation threw him/her off.

A lot of time, speech is more than speech; it contains so much more than the words. How a thing is said could change the meaning of what’s said. Take sarcasm for example, he or she says the same thing but in different tones and convey different contempt. To really understand one’s speech, we also needs to know the signals such as pitch, tone, pace, timing, voice stress, sarcasm, and inflection. Understanding paralinguistic is critical to understand meanings. When ML has the capability of learning paralinguistic, it will have accurate information about what’s really going on in the speech. As a result, it could predict things such as a good candidate for hiring or a potential sales lead.

Some RTC vendors has leveraged paralinguistic for improved speech analytics and sentimental analysis. Instead of working on the text and providing just transcription, they provide paralinguistic analysis; looking beyond the words and interpreting tones and sentiments based on the actual speech wave forms. In an intense meeting or a sales pitch, to deeply analyze a meeting participant could be game-changing for enterprise — if a decision-maker became calm or frustrated after a call, if his/her sentiment changed in the beginning and end of the call, or even predicting his/her words or emotions.

These RCT vendors extract knowledge and meaning and do prediction based on audio conversations. They have been training their automatic speech analysis system to recognize the states and traits of the speaker — the independence of the speaker, the spoken content, the spoken language, the speaker’s cultural background. The deep learning of speaker’s emotion from its pace, silence, over-talk, strange gaps, changing pitch, energy, and voice dynamism, has also lead to the significant improvement and the promise of “future” of meeting.

(3) Sales and call center optimization

As we discussed above, ML-powered speech analytics is no longer about text and it has nothing to do with text; it has to do with the answer that people care about — meaning. NLU (natural-language understanding) has provided amazing solutions that give directly to meaning. For example, it could now provide speech to angry (not just speech to text). In a call center, when AI can predict angry then the business can make a difference. The business can now take angry customers and turn them into happy customers. This is what people care about — the power of meaning. It tells us if users inclined to use our product or if they are happy with our product or service.

In real-time sales presentation, such a prediction powered by AI can now provide real-time sales coaching based on what participants said during a sales meeting. It can coach a sales person to say the right thing, to follow the optimal script, to improve sales interactions, and then to better handle potential leads.

This kind of sales optimization can also be seen in call centers’ real-time agent coaching. In traditional call centers, because of the lack of insight, mishandling calls could lead to a substantial loss of sales and revenue (calls sent to voicemail, transferred to the wrong department or even dropped). With AI’s capability of detecting complex events (such as someone trying to schedule an appointment at a dealership) and automating the evaluation of every call to fully understand the TRUE voice of the customer, agents can now know someone is trying to schedule an appointment at a dealership or a prospect who was interested in purchasing a vehicle.

Voice-enabled virtual assistants

Illustration created by Yufei McLaughlin

Voice-enabled virtual assistants have become the biggest thing in RTC domain. The major cloud vendors has commoditized voicebot technology which surpassed traditional conversational Interactive Voice Response (IVR). In RTC, the new cognitive IVR voicebots as virtual assistants are not only applied in meetings but also call centers.

(1) Meeting voicebots

The artificial intelligence-driven virtual assistants are now starting to make their way into the workplace. In RTC, voicebots play different roles in, during, and after meetings.

Before the meeting, bots can automatically schedule meetings based on the invitees’ calendar availability, identify a participant list based on expertise or role, and distribute the right documents and resources before meetings.

During the meeting, bots can automatically identify participants and present additional relevant information. They can also provide useful web resources such as articles, documents, images, or videos. Furthermore, any participant in the meeting could simply voice his/her commands to turn on live transcription or turn on microphone / camera, so he/she can keep focus on the meeting at hand.

After the meeting, as we already discussed in “Meeting Productivity” session earlier, bots can not only assign tasks and summarize the important topics, but also follow up on assigned tasks and deadlines.

Bots with a conversational interface have enhanced unified communications before and after meetings, and we will see them play more and more roles during meetings.

(2) Call center voicebots

Call center voicebots are automated customer representatives that interact with humans using natural language in real time. Traditionally live agent is very expensive, and voicebots are replacing it as the modernization of call center.

The technology of voicebot is originally from text-based chatbots on browsers. And now we are seeing more and more call center voicebots, though not a fully mature technology yet. The hardest part is integrating conversational AI with established telephony environments — to connect a bot to a VoIP network.

However, with the advancement of speech recognition and speech synthesis technologies, new ML approaches have continued to help improve voicebot technology. New ML has improved speech recognition accuracy and made computer speech sound more like a human. In addition, a couple vendors such as Amazon and Google have brought bot and speech technologies together.

As the technology of speech, tone, and sentimental analysis improves, the future of voicebots will also be more and more mature and efficient.

Conclusion

The new way of machine learnings has empowered RTC vendors to create new values for types of businesses. It has improved worker’s productivity and reduced frustration during meeting lifecycles, and created better business outcomes.

There is still a lot of room for RTC vendors to explore a bigger picture of “future” of meeting, but for us as end users, the “future” is happening. ML has been providing more insights and performing more mundane but important tasks for us, saving our time, and consuming less of our energies and efforts to get things done. And with further development, and a bit of imagination, the possibilities are endless.

What would be the endgame? Perhaps using AI to synthesize and provide meaning out of the video streams, and to get the most out of what needs to be seen — behind the scene, the face, and the speech.

Ready for the future? One, two, three, smile!

--

--