Introducing the RingCentral Artificial Intelligence API

Published in

RingCentral Developers

7 min readJan 26, 2024

RingCentral’s Audio and Video AI API makes it quick and easy to transcribe, diarize, detect emotions, summarize, and even create action items from most audio and video formats. This means regardless of if your audio or video file originated from RingCentral, you can garner incredible insights while increasing overall business efficiency.

For example, imagine if you had a large amount of audio files that you need to process and more are produced every day. This can occur in a call center where “calls are recorded for training purposes” or in a university or college where lectures are broadcast and audio is recorded. Another example could be an international conference, business negotiations, or summit meetings with multiple contributors. The time it would take to have a human transcribe this content would be expensive and lengthy. The API can take the audio file and transcribe it into text while at the same time determine who is talking and make those text assignments. To know who said what in a timely fashion is invaluable.

But converting speech to text is just the first step. AI can also summarize a conversation for you. So instead of spending the full hour listening to the conversation again or trying to read an hour’s worth of dialog, how about just reading a brief two to three sentence paragraph? Or if you need more detail, a summary you can read in 5 minutes? AI API can deliver both the brief and summary to you from any audio you provide and produce an even better brief and summary with any RingCentral media file you create.

Now let’s take the brief one step further. Most CRM (Customer Relationship Management) integrations ask the Sales Agent to summarize the call details as a “disposition” of the call. What was discussed during the call? Is there potential to have a sale with the customer? What questions did the customer have, or what action items are outstanding? These details can be identified by the AI API and you can build the integration in harmony with your favorite CRM so that our AI API can automatically do the summarization work for your Sales team, and your Sales team can focus on selling.

But what if context matters? A financial planner has many conversations with their portfolio owners on a daily basis and we expect them to know our financial plans each time we talk with them. Most conversations can cover savings, investing, stock purchases, and fund management. But how do they keep track of what was said and recommended to whom? The advice to a young 30-something up-and-coming executive would be totally different to a 57 year old who was planning to retire in a few years.

You can use the AI API to identify key words that were used during recorded conversations. Using the AI API to transcribe the conversations it can not only identify key words but it can also link the key words to the point in the transcription to help identify context. For example, did you recommend buying Tesla stock to one client and advise selling it to another client? By quickly jumping to the point of the conversation, they can quickly see the context and determine that perhaps the conversation was just about considering investing in EVs and not specifically buying or selling at this moment in time. This can help to alleviate misunderstanding and incorrect actions.

RingCentral’s AI API offering can be used in converting speech to text from source video or audio files. AI, when used responsibly (See the movie Terminator 2 for an example of misuse!), can be invaluable with saving time and effort on repeating or mundane tasks. With AI, sales teams can quickly review longer calls or meetings to ensure they have a good understanding of their prospect’s needs, and even auto-generate questions and action items they need to follow up on. Meetings, events, and training can quickly be transcribed or summarized, ensuring attendees can quickly recall important details. Businesses with compliance needs can quickly capture call recordings and identify potential issues to ensure proper training and help mitigate business risk.

These use cases are merely the tip of the iceberg, enabling businesses to gain greater insights affordably in minutes. RingCentral’s AI API also has built-in speaker detection so it can identify who is talking and when. It also comes with emotional analysis which can help the user to better understand how team members are doing during a conversation or support call. Are stress levels rising in the conversation? Are stronger words being used repeatedly? These can all be indications that more training could be needed to better manage customer interactions or more therapy if it was a recording of a mental support group. The source information can also be summarized by the AI API to help quickly identify potential customer or product issues, even help to identify market or sales opportunities.

If the audio is in the context of a customer support call you can use the emotional analysis aspect of the API to identify either the negative or positive words and phrases that are repeatedly being used within the conversation. For example, a caller may be repeatedly saying they are having a problem with a product or they cannot make something work. The analysis of that call can help to find resolutions to issues for more than just the one caller; it may even help identify a flaw in a product that could trigger a safety recall and actually make the product in question safer for the consumer.

As well, the created transcript of a positive or negative support call can be collected in its entirety and sent to the whole support team for mutual education. Frustrated customers can more easily be identified for faster escalation, as well as used as examples for de-escalation training. Your company can even use the transcripts, summarization, and emotion analysis to identify broader issues, allowing your company to be proactive in notifying customers and helping them solve a potential issue before it occurs.

Another great use of the analysis and transcription feature of the RingCentral AI API is that it can also be used to create summaries from audio and video files to enable everyone to quickly recall what was discussed, what action items are needed — allowing everyone to get on the same page and ensure a successful agenda for your next meeting.

Additionally, based on the single audio file of a recorded conversation the AI can be used to create long and short text summaries as well as taking note of the key topics of the conversation based on the amount of repetition of words. Emphasis and volume on words or the energy and pace with which they are voiced is also taken into account in the level of key word identification. There are other factors that also are considered like the talk-to-listen ratio and the number of questions asked with a key word in the question. For those with high demands on their time this aspect of the AI in making summaries and highlights comes at a great time. The ability to quickly get a summary and understand the gist of a series of meetings can quickly get anyone up to speed on a subject or task and be well prepared for the next meeting or tasks.

Figure 1 — Features of RingCentral AI API as shown in a recorded video call.

In figure 1 you can see some of these valuable features in action while reviewing a recorded RingCentral Video meeting. [1] shows the meeting brief, [2] shows keywords that were used during the call, [3] provides a longer summary to the meeting than what is shown in the brief, and [4] shows who is talking and when so that you can forward the recording to a specific time stamp, if desired, to hear exactly what someone was saying at that point in the discussion. Also, just above area [1] is a toolbar where you can see the full transcript of the meeting, transcript highlights (text) and even add in some of your own notes if needed.

You may be wondering with all this power built into the AI API how difficult it would be to actually get this API started. Here is a simple code sample in PHP to show just how easy it is to make the API connection and get started in its use.

$platform->post(‘/ai/audio/v1/async/speech-to-text?webhook=’ . WEBHOOK_URL, array( ‘contentUri’ => CONTENT_URI,
 ‘encoding’ => ‘Wav’,
 ‘languageCode’ => ‘en-US’,
 ‘source’ => ‘RingCentral’,
 ‘audioType’ => ‘Meeting’,
 ‘enablePunctuation’ => true,
 ‘enableSpeakerDiarization’ => false
));

The API currently supports: AAC, MP3, MP4, MPA, MOV, Mulaw, PCM, WAV, and WMV.

As with most of RingCentral’s APIs the AI API does not only work in isolation on its own. It can also be used in conjunction with other APIs. For example, when the AI API creates a meeting summary and an agenda for a follow-up meeting that material can be sent to every team member or group that is located in the RingCentral Team Messaging app with the help of the Team Messaging API.

With all of the possibilities of AI, combined with RingCentral’s suite of communication APIs, your business has the ability to drive greater efficiencies, while creating great employee and customer experiences, regardless of your industry.

Visit our Audio and Video AI page to learn more.

Introducing the RingCentral Artificial Intelligence API

Written by Pbmacintyre