Transcription & Summarization for Video Calling Apps: Elevate your Conversations

Published in

QuickBlox Engineering

12 min readNov 13, 2023

video consultation app with AI powered functionality

In an era dominated by virtual communication, video calling apps have become indispensable tools for connecting people across the globe. While the visual aspect of these calls has significantly improved, the need for accurate and efficient communication remains a constant challenge. Enter the realm of artificial intelligence, where cutting-edge transcription and summarization functionalities are reshaping the landscape of video calling.

In this article, we will provide an overview of how AI powered transcription and summary creation works. We will also showcase how this functionality has been implemented into the Q-Consultation Lite application. Q-Consultation Lite is open source software, made available by QuickBlox. It offers video calling, virtual private meeting rooms, in-app chat and an array of admin features. Most recently the open source code has been integrated with OpenAI so that it can now efficiently transcribe audio recordings, automatically generating textual descriptions of conversation content.

Transcription Using AI

AI plays a significant role in the process of audio data transcription, providing significant advantages:

Automation: AI contributes to the automation of the transcription process, greatly increasing its speed and efficiency. This is especially valuable when dealing with large volumes of audio data, such as webinar recordings, interviews, or conferences.

Scalability: AI can simultaneously process multiple audio files without sacrificing quality or efficiency. This makes the transcription process more scalable and convenient for companies and organizations.

Advanced Processing Capabilities: AI can offer additional features during transcription, such as speech recognition, identification of specific accents, and even determining the speaker’s mood or emotions.

Indexing and Search: Text transcription makes audio data more accessible for searching and indexing. AI can automatically generate keywords and metadata, simplifying the process of finding the necessary information in large audio archives.

The Value of Transcription, Summarization, & Action Points in a Business Context

The Q-Consultation video calling platform, with its private virtual meeting rooms and in-app chat, has become an invaluable tool for a variety of industries seeking secure and efficient communication. The recent addition of AI transcription and summary functionality has elevated the platform’s capabilities, providing users with enhanced communication experiences.

In addition to transcribing the content of recorded video consultations, AI functionality in Q-Consultation also provides “summaries” of the transcription, a condensed overview of key discussion points. These summaries facilitate a quick understanding of the main ideas without the need to listen to the entire audio recording or ready the entire transcript.

Moreover, “action points” are also generated from the transcript — these are specific steps or actions that were highlighted during the discussion, such book a follow up appoint, or take medication, send a certain document etc. This makes it easy to keep track of necessary actions and commitments made by participants in the meeting.

Let’s explore how different industries can benefit from these new features:

Healthcare professionals using Q-Consultation can conduct virtual patient consultations securely. The AI transcription functionality ensures accurate and detailed records of medical discussions, aiding in compliance with regulatory requirements. Summarization features assist in quickly reviewing patient histories and treatment plans.

Businesses using Q-Consultation for virtual meetings can benefit from AI transcription to document discussions and decisions. Summarization features assist in creating concise records, making it easier for participants to recall key points and ensuring more efficient decision-making processes.

E-Learning platforms that integrate Q-Consultation can employ AI transcription to create accurate transcripts of lectures. Summarization features can help distill key concepts, making study materials more accessible. This benefits students with diverse learning needs.

Organizations with customer service departments using Q-Consultation for virtual support can leverage AI transcription to document customer interactions. Summarization features aid in identifying common customer issues, enhancing training programs, and improving overall service quality. It also provides a useful audit trail should a customer raise complaints.

Creating Text From Audio Data

To implement AI transcription in Q-Consultation we use the technologies of OpenAI and Whisper and rely on a method of text generation based on audio data. In this method a model utilizes audio input data, such as speech, to generate related textual output.

OpenAI has developed the Whisper model, which is a versatile automatic speech recognition system. It has been trained on a vast and diverse dataset of audio data and serves as a multitask model capable of performing multilingual speech recognition, speech translation, and language identification. The larger Whisper v2 model is currently available through our API under the name “whisper-1.” A system of this kind can be used for automatic audio transcription, creating textual descriptions of audio files, generating video subtitles, and much more.

OpenAI’s speech transcription system relies on the analysis of audio data using OpenAI’s speech transcription system relies on the analysis of audio data using the Whisper model. This innovative technology facilitates the generation of high-quality text based on audio content.

The operational process of the OpenAI speech transcription system:

Data Preparation: The initial step involves preparing audio data, which may include speech recordings, dictations, interviews, or other audio files that require transcription. It’s important to note that OpenAI supports a wide range of audio file formats such as flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, or webm, providing flexibility when working with various sound sources.
Feature Extraction: Audio data can be transformed into spectrograms or other numerical representations that the model can process. These representations contain information about sound frequencies and intensity at different points in time.
Data Input to the Model: The transformed audio data is fed into the Whisper model, which has been trained on a large volume of textual data. The model analyzes the audio features and begins generating corresponding text.
Evaluation and Correction: The generated text undergoes evaluation and correction to ensure its accuracy and alignment with the audio data. This may involve automatic checks for similarity with the original data and manual editing.
Output of Results: The final textual output can be provided in the form of a transcription, audio description, subtitles, or another desired format.

Using OpenAI GPT to Create Summaries & Action Points

In addition to transcription, Q-Consultation utilizes OpenAI’s GPT model to create concise summaries, and action points from the transcribed text, significantly enhancing its comprehensibility and utility.

GPT stands for “Generative Pre-trained Transformer,” and it refers to a family of language models developed by OpenAI. The models are based on transformer architecture, a type of neural network architecture that has proven highly effective for natural language processing tasks. The “pre-trained” aspect of GPT indicates that the models are trained on a vast amount of diverse text data before being fine-tuned for specific tasks.

The GPT models are designed to generate coherent and contextually relevant human-like text. They have been widely acclaimed for their ability to understand and generate language in a way that captures context, semantics, and syntactic structures.

GPT models are especially well-equipped to deal with summarization tasks. Using context and knowledge gained from training it is able to generate textual output in the form of a concise and informative summary, and to identify key ideas and action points.

The interaction between the Whisper model and GPT enables the efficient processing of audio data, and the generated text is transformed into readable and informative content. If necessary, multiple iterations can be performed to refine and improve the transcription and text generation results, including the summary and action points.This method combines advanced speech recognition and natural language processing technologies, contributing to more effective handling of audio materials and enhancing their functionality.

Implementing AI Transcription in Q-Consultation

Implementing AI transcription into Q-Consultation is a relatively straightforward process. In the following section, we will outline the key steps.

Step 1: Generating an API Key and Installing the Library

To begin, obtain your API key from OpenAI, which will allow you to interact with their API. Once you have obtained the key, install the corresponding library provided by OpenAI to facilitate the interaction.

To implement the AI translation feature, we will be using Node.js version 18. (It is important to note that the functionality provided below may work on other versions of Node.js, but it has not been tested).

First, you need to install the OpenAI Node API Library

Next, create an instance of the OpenAI object using the API key that you generated earlier.

import OpenAI from 'openai'


export default new OpenAI({
 apiKey: process.env.OPENAI_API_KEY,
})

To begin, we will need to create a transcription of the audio file with timestamps for each segment. To do this, we will create a function called createTranscriptionWithTime responsible for this process. It will accept the audio file in the following format:

interface MultipartFile {
buffer: Buffer
filename: string
encoding: string
mimetype: string
        }

By default, the methods of the OpenAI class cannot work with such files, but this library includes a toFile function that allows you to convert the data to the required format:

const file = await toFile(audio.buffer, audio.filename, {
type: audio.mimetype,
})

After this, you can create transcriptions of the audio recording using the openai.audio.transcriptions.create method from the OpenAI Node API Library.

This method accepts the following parameters to send in the request to OpenAI

`file` (required parameter) — The file in one of the following formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, or webm, up to 25 MB in size.
`model` (required parameter) — The ID of the model used for transcription. Currently, only the “whisper-1” model is available.
`prompt` (optional parameter) — Additional text to guide the model’s style or continue from a previous audio segment. The prompt should be in English.
`response_format` (optional parameter) — The format of the transcription output. The default is JSON but can be one of the following options: json, text, srt, verbose_json, or vtt.
`temperature` (optional parameter) — A parameter that affects how random or predictable the output generated by the artificial intelligence model will be. This parameter takes values from 0 to 1.

Considering that in Q-Consultation, the transcription needs to be time-stamped, the default response_format is not suitable. Instead we will use the srt format, which includes the start and end times for each segment.

const transcription = await openAIApi.audio.transcriptions.create({
file,
model: 'whisper-1',
response_format: 'srt',
})

Because we specified response_format: ‘srt’, the result of the function will be a string, but the library specifies a different type, and we need to explicitly override it.

const transcriptionText = resp.data as unknown as string

To extract timestamps and corresponding text from each segment of the transcription in SRT format, the following regular expression will be required:

const srtRegex = /^([\d:]+),\d+ --> ([\d:]+),\d+\s+(.*)$/gm

In the end, we parse the received text in SRT format using the regular expression and obtain an array of objects that include the start and end times of each segment, as well as its content.

return Array.from(transcriptionText.matchAll(srtRegex)).reduce<
Array<{ start: string; end: string; text: string }>
>((res, item) => {
const [, start, end, text] = item

return [...res, { start, end, text }]
}, [])

Listing for createTranscriptionWithTime:

interface MultipartFile {
buffer: Buffer
filename: string
encoding: string
mimetype: string
}

const createTranscriptionWithTime = async (audio: MultipartFile) => {
const file = await toFile(audio.buffer, audio.filename, {
type: audio.mimetype,
})
const transcription = await openAIApi.audio.transcriptions.create({
file,
model: 'whisper-1',
response_format: 'srt',
})


const transcriptionText = transcription as unknown as string
const srtRegex = /^([\d:]+),\d+ --> ([\d:]+),\d+\s+(.*)$/gm

return Array.from(transcriptionText.matchAll(srtRegex)).reduce<
Array<{ start: string; end: string; text: string }>
>((res, item) => {
const [, start, end, text] = item

return [...res, { start, end, text }]
}, [])
}

Creating Summaries & Action Points

For further audio processing, the function createAudioDialogAnalytics is applied, which also takes an audio file:

interface MultipartFile {
buffer: Buffer
filename: string
encoding: string
mimetype: string
        }

First, we need to obtain the transcription of the audio file. To do this, we use the previously written function createTranscriptionWithTime.

const transcription = await createTranscriptionWithTime(audio)

Then, from the transcription array, only the text is extracted, which is sequentially concatenated into one extensive string.

const transcriptionText = transcription.map(({ text }) => text).join(' ')

Now, having all of this, we can analyze the obtained text and create a summary and action points. For this purpose, the GPT model is configured to send a request to the server using openAIApi.chat.completions.create and passing the configuration.

We will use this method with the following parameters:

model (required parameter): The ID of the model to be used. See the table for detailed information on which models are suitable for use with the chat API.

messages (required parameter): A list of messages to create responses from OpenAI. This parameter includes additional options:

role (required parameter): The role of the message can be system, user, assistant, or function.
content (required parameter): The content of the message. The content field is mandatory for all messages and can be null for function calls in assistant messages

temperature (optional parameter): A parameter that affects how random or predictable the output generated by the artificial intelligence model will be. This parameter takes values from 0 to 1.

The openAIApi.chat.completions.create method will need to be used twice: once for creating a summary and once for creating action points. Therefore, let’s consolidate the common parameters into a variable:

const chatComplationConfig = {
model: 'gpt-3.5-turbo',
temperature: 0,
}

Next, let’s create two arrays of messages that are used to create the summary and actions:

const messagesForSummary: ChatCompletionRequestMessage[] = [
{
role: 'user',
content: 'Generate summary in English from this dialog',
},
{ role: 'user', content: transcriptionText },
]

const messagesForActions: ChatCompletionRequestMessage[] = [
{
role: 'system',
content: 'If you don\'t have enough information to make an action points, display the message "There is no sufficient information to generate an action points"',
},
{
role: 'user',
content: `Generate action points in English that the consultant said to do from my dialog. Display only list without title.`,
},
{ role: 'user', content: `My dialog:\n${transcriptionText}` },
]

Now you can call the openAIApi.chat.completions.create method with the parameters defined earlier:

const [summaryRes, actionsRes] = textRegex.test(transcriptionText)
? await Promise.all([
openAIApi.chat.completions.create({
messages: messagesForSummary,
...chatComplationConfig,
}),
openAIApi.chat.completions.create({
messages: messagesForActions,
...chatComplationConfig,
}),
])
: []

From the results of the function execution, extract the values for the summary and actions:

const summary =
summaryRes?.data?.choices?.[0].message?.content ||
'There is no sufficient information to generate a summary'
const actions =
actionsRes?.data?.choices?.[0].message?.content ||
'There is no sufficient information to generate an action points'

Now, with all the necessary data, return an object that includes the transcription array, summary, and the list of actions.

return   {
transcription,
summary,
actions,
}

Listing for createAudioDialogAnalytics:

interface MultipartFile {
buffer: Buffer
filename: string
encoding: string
mimetype: string
}



export const createAudioDialogAnalytics = async (audio: File) => {
const transcription = await createTranscriptionWithTime(audio)
const transcriptionText = transcription.map(({ text }) => text).join(' ')
const chatComplationConfig = {
model: 'gpt-3.5-turbo',
temperature: 0,
}
const messagesForSummary: ChatCompletionRequestMessage[] = [
{
role: 'user',
content: 'Generate summary in English from this dialog',
},
{ role: 'user', content: transcriptionText },
]
const messagesForActions: ChatCompletionRequestMessage[] = [
{
role: 'system',
content:
'If you don\'t have enough information to make an action points, display the message "There is no sufficient information to generate an action points"',
},
{
role: 'user',
content: `Generate action points in English that the consultant said to do from my dialog. Display only list without title.`,
},
{ role: 'user', content: `My dialog:\n${transcriptionText}` },
]
const textRegex = /[\p{L}\p{N}]+/gu

const [summaryRes, actionsRes] = textRegex.test(transcriptionText)
? await Promise.all([
openAIApi.chat.completions.create({
messages: messagesForSummary,
...chatComplationConfig,
}),
openAIApi.chat.completions.create({
messages: messagesForActions,
...chatComplationConfig,
}),
])
: []
const summary =
summaryRes?.data?.choices?.[0].message?.content ||
'There is no sufficient information to generate a summary'
const actions =
actionsRes?.data?.choices?.[0].message?.content ||
'There is no sufficient information to generate an action points'

return {
transcription,
summary,
Actions,
}
}

Q-Consultation already fully provides the implementation of all necessary steps. You just need to make a few settings and add your OpenAI API key to the appropriate file. This significantly simplifies the process, saves you from the need to learn and write code from scratch, and speeds up your work with OpenAI.

The transcription functionality in Q-Consultation works as follows: after the end of a video call, you simply start recording the conversation. Upon completion of the conversation, you can open the recorded video and find the automatically generated transcription in the left column.

This integrated approach significantly simplifies obtaining a ready transcription from your videos or audio recordings. You don’t need to worry about the complexities of data processing and analysis — Q-Consultation takes care of this work, allowing you to focus on more important tasks.

Conclusion

AI has significantly transformed and improved approaches to transcription and data processing. Thanks to speech recognition and natural language processing technologies, AI can quickly and accurately transcribe audio and video recordings, facilitating access to and analysis of information. This is of great importance in various fields, including medicine, law, academic research, and more.

Join the QuickBlox Developer Discord Community to share ideas and seek support for the AI enhanced chat applications you are building