Build Real-Time Phone Conversation Analytics

Phong Vu
RingCentral Developers
10 min readMar 25, 2020

What is Real-Time Conversation Analytics?

In contrast to post-call conversation analytics, which provides insights after the fact, real-time call conversation analytics can illustrate them at present times.

In this blog, I will walk through the essential steps to build a Web app that can analyze call conversations in real-time to assist an agent. Once we’re finished, we’ll have an app which will:

  1. Transcribe phone conversations in real-time and display the text on the screen.
  2. Translate the transcripts from English language to Spanish language and display the translated texts on the screen.
  3. Allow an agent to record the conversations separately (customer and agent tracks) and save into audio files.
  4. Display comprehensive analytics such as sentiment and emotion scores on the screen.

In order to build the demo app, you will need to have a RingCentral developer account and an IBM Watson developer account.

I assume that you already know how to setup a RingCentral sandbox account such as adding user extensions, assign a phone number to an extension so that you can make a direct phone call to that user extension. I also assume that you are familiar with getting IBM Watson IAM API keys for accessing Watson AI services.

We need to create a RingCentral application and get the app’s client id and client secret, which will be used in our demo app. We will keep the demo app implementation as simple as possible by choosing the password flow authentication and select the following permissions for the demo app:

Call Control , ReadAccounts , WebhookSubscriptions

The associated demo application is built using the Express Web application framework, React JS and Node JS. Thus, for conveniences, we will use the RingCentral JS SDK and the IBM Watson Node JS SDK to access their services.

Note: The code snippets shown in this article are shorter and just for illustration of essential parts. They may not work directly with copy/paste. I recommend you download the entire project from here.

Real-Time Phone Conversations Analytics Development

Call monitoring setup

In order to supervise a call, we need to setup a call monitoring group in our sandbox account. We will login our RingCentral sandbox account and browse to the “Call Monitoring” view, then click the “+ New Call Monitoring” button to create a new call monitoring group. Give the group a name and follow the onscreen instructions to set it up.

Setup a Call Monitoring Group

Click the “Can Monitor” tab and select a user (or users) from the list as a supervisor.

Select a Supervisor Extension

Then click the “Can be Monitored” tab and select a user (or users) from the list as an agent, whose call conversation can be monitored.

Select a Monitored Agent

In this demo, I selected the user extension 119 as a supervisor and the user extension 120 as the monitored agent. This means that I will use the username and password of the user extension 119 to login the demo app, and I will be able to supervise the user extension 120. You can choose different user extensions in your sandbox account for your own demo.

That’s all for the call monitoring group setup. Alternatively, you can create a call monitoring group and add members to the group programmatically using the Create and Update APIs.

Detect active calls and call information

To supervise a call in real-time and possibly from the beginning of a conversation, we need to detect an active call as soon as an incoming call to the monitored agent is connected. To do this, we implement webhooks notification and register for telephony session notifications for the monitored agent.

The code snippet below shows how to read monitored agent’s extension id which belong to a supervisor identified by the name (“Paco Supervisor”) and keep the ids in a list named agentsList.

We also detect and keep the supervisor’s extension id just for the purpose of having a unique id when saving necessary information to a database.

Then we subscribe for telephony session notifications for the monitored agent as follows:

After subscribing for notifications successfully, our app should receive notifications when the telephony status of the monitored agent is changed. The notifications will come via the webhook callback URL as shown below:

When an incoming call is answered, we iterate thru the agentsList to identify which agent has accepted the call. Once we find the agent from the list, we will read the call information as shown in the code below:

As you can see, we parse the response to get the telephonySessionId and get the partyId of each party participating in the call. It’s worth mentioning that the telephony session id identifies “this” call, and the party id identifies which party (customer or agent) is participating in the call. After getting each partyId, we call the submitSuperviseRequest() function where we send a request for a supervision session. The reason we send a request for supervision session twice with different partyId is because we want to get separate audio streams, one from the customer and another from the agent.

Notes: If you want to get a mixed audio stream, you can call the supervise API without the party/[partyId] in the URL path.

Sending the supervise request above, we are asking RingCentral server to make a call to a phone device, identified by the deviceId, and stream the audio of the call to a SIP device so that we can listen to the audio. But where can we get the device id? Let’s look for it.

If a supervisor is a human, who is supposed to listen to the call on a phone device, then the deviceId can be one of the device ids retrieved from the supervisor’s extension device list using the extension device list API. However, in our use case, we want the audio streams to be streamed directly to our app so that we can analyze the conversation using AI solutions. That is why we need to implement a “softphone” which will receive the audio streams.

It is fairly complicated to implement a soft-phone from scratch using SIP over Web socket. Thanks to our engineer Tyler Liu, who created the RingCentral Softphone SDK, making it very easy to implement a soft-phone engine for our app.

As you can see, with just a few lines of code, we have setup our soft-phone and get the deviceId which we were looking for as discussed earlier. Remember that the rcsdk parameter is an instance of the RingCentral SDK we used earlier. We can save the deviceId in a database so that we can retrieve it when we call the supervise API.

Note: We must start the softphone engine before we subscribe for the telephony session notifications to make sure that when there is an incoming call, we already have the deviceId.

Let’s see how we use the softphone object to accept a SIP invite and to answer a SIP call.

Every time we submit for a supervise request, we will get a SIP invite message. As discussed earlier, we submitted for a supervise request twice, one with the customer’s call partyId and one with the agent’s call partyId. Thus, we will receive two SIP invite messages. How do we detect which invite is for the customer’s audio channel and which invite is for the agent’s audio channel, so that we can create resources to handle each audio stream separately?

Let’s first define the data model to keep necessary information about a channel.

After we submitted for a supervise request, we save the partyId of that request and other metadata in a channel object. And we add the channel object to the channels list.

When we receive a SIP invite message, we parse the SIP message’s header and extract the party id. Then we compare the party id with the party ids we saved in the channels list to identify a channel and create necessary resources for that channel.

Each SIP call is identified by a call id which is included in the SIP headers. We need to extract the call id and save it into the channel data object. This is needed for identifying a channel to reset the resources for that channel when the call hangs up. We also create the Watson engine object (we will look into the Watson engine implementation shortly), then answer the SIP call.

After answering the SIP call, we will receive the audio buffer via the softphone callback ‘track’. Within the callback function, we create an audio sink and start reading the audio data as shown below:

There are a few tricks in the implementation above. Because we want to use IBM Watson real-time transcription service, we need to create a Watson socket and set the sample rate of the audio we receive from the track. For some reason, the first audio data packet we receive does not have the correct audio sample rate. That’s why we discard a few audio packets before we pass the sample rate data.sampleRate to the Watson engine. The size of the audio data packet is normally too small (10ms — 40ms) and it is not efficient to feed Watson real-time transcription with such a small data packet. That’s why we need to create a data buffer and concatenate those small packets to make a bigger packet (32k — 64k) before we feed it Watson socket for transcription.

In this demo, we want to transcribe the conversations and translate the text from English to Spanish. We also use Natural Language Processing technology (NLP) to analyze sentiments and emotions of the conversations. Let’s have a look at the Watson engine implementation.

To use Watson real-time speech recognition, we need a Speech-to-Text API key, and use the key to get the access token. Once we get the access token, we can create a Web socket URI as shown in the code above. If you want to transcribe other languages than English, you can check other language models supported by IBM Watson and replace the language model with your selected one.

Then we create a Web socket object with the Watson Web socket URI created earlier. We define a configs data object and use the sample rate of our audio data to set the ‘content-type’ and set other features accordingly. Finally, we send the ‘configs’ after the Web socket is opened successfully.

When the audio buffer is ready, we call the transcribe() function with the audio buffer then we send the audio buffer to Watson service. The transcript text will be returned to the callback function shown below, where we will parse the response to get the transcript.

Because we set the ‘interim_results’ feature to true in the configs, we will receive the interim transcript. It’s worth to note that the interim transcript is more instant but may not be accurate and the transcript maybe changed in the final result. That’s why we check if the transcript status is final, then we will analyze the transcript, otherwise, we just merge the text and display it. To create a dialogue which is mixing between interim and final transcripts and between customer’s and agent’s speeches, we need to implement an algorithm to join the transcripts properly. I will let you explore by yourself the detailed implementation of the mergingChannels(thisClass.speakerId, thisClass.transcript) function from the index.js file. Let’s move on to discuss about using Watson NLP to analyze the transcript.

To use Watson Language Translator and Natural Language Understanding services, we need to get the API key for each service. Then we use the IBM Watson JS SDKs and use the API keys to access the services as shown below:

To translate a transcript, we specify the translate API parameters with the transcript and the language model. In this demo, we want to translate from English to Spanish, but you can choose other language models if you want to.

To analyze sentiment and emotion of a transcript, we specify the analyze API parameters with the transcript and the features set. In this demo, we use the keywords which will be detected by Watson and analyze the sentiment and emotion based on the keywords.

Remember that the data analytics we use in this project is just for demonstration. It may not solve a real-world problem. But once we get the transcript, we can apply any AI solutions to analyze the transcript to assist agents such as compliance monitoring and guidance that helps agents answer questions accurately and concisely.

To push the transcript and other metadata from the server app to the client app, we fire a transcriptUpdate event with the data object.

We use React JS to implement the app UI. You can explore the client side code to see how the transcripts and analytics scores are rendered if you want to. The source code is stored under the client subfolder.

Those are all the essential steps to build the demo app from my Github repo. To learn more details, I recommend you to clone the demo project, have a closer look at the source code and try to run it in your local machine with your own app settings.

Run the demo on a local machine

Follow the instructions in the README file to setup the project’s environment and run the demo.

Congratulations! Now you should be able to build and further develop this project with more features if you want to. For example, you want to improve call quality by monitoring calls for talk speed, talk time, and other variables indicative of tone and mood.

Finally, I hope you enjoy your reading and will find this information useful.

To learn even more about other features we have make sure to visit our developer site and if you’re ever stuck make sure to go to our developer forum.

Want to stay up to date and in the know about new APIs and features? Join our Game Changer Program and earn great rewards for building your skills and learning more about RingCentral!

--

--