Need for Speed: How Transcribe Turbocharged Whisper for Real-Time Conversations

Chow Keng Ji
AI Practice GovTech
9 min readFeb 5, 2024

Co-written with Jeffrey Lee, on behalf of the Transcribe team.

Transcribe is a speech-to-text platform available to government officers. We turbocharged its transcription services in the past year. This blogpost dives into the production-grade engineering work that enhanced Transcribe’s ability to handle conversations in real time. We’ll share how we adapted Whisper for live transcription, modified VBx, a speaker diarisation scheme, to separate voices on the fly, and optimised the speech-to-text pipeline for seamless, near-instant transcription while taking care of the end-user experience.

Introduction

Transcribe is a speech-to-text platform available to government officers. Previously, Transcribe supported two models for speech transcription: Google’s speech transcription models via their API, and a self-hosted model developed by the Institute for Infocomm Research under A*STAR.

In September 2022, OpenAI released the Whisper model for speech transcription, which we found to outperform our existing models. This prompted us to begin work on hosting the Whisper model for batch transcription within Transcribe’s infrastructure on the Government Commercial Cloud (GCC), and fine-tuning a version of Whisper on locally accented speech. In May 2023, we made both the original and fine-tuned Whisper models available for batch transcription.

Following this, we shifted our efforts to adapting Whisper for live transcription. As Whisper was designed for batch (i.e. audio file transcription) and not live transcription (i.e. non-real-time), this required some degree of experimentation. In August 2023, we successfully released the Whisper live transcription feature running on models self-hosted on GCC.

In addition to transcription, Transcribe also supports speaker diarisation, which is the task of separating the audio into segments to be associated with different speakers. Speaker diarisation is performed by our self-hosted model based on the VBx algorithm. Originally, Transcribe’s capabilities included speaker diarisation for batch transcription, but it did not extend this function to live transcription. However, following the integration of the Whisper system for live transcription, the VBx algorithm was modified to operate in real-time. This enhancement, facilitating real-time speaker diarisation in conjunction with live transcription, was rolled out in December 2023.

How Live Transcription Works

Live transcription presents greater complexity compared to batch transcription due to the necessity of processing audio streams in real-time. One of the primary challenges is in transcribing audio swiftly enough to ensure timely delivery to the client or front-end. Additionally, the accuracy of transcription models tends to improve when they have access to more of the upcoming audio before finalising their transcriptions. This need for immediate processing while considering sufficient context adds to the complexity of live transcription.

Hence, during a live transcription, the latest part of the transcription is often in a non-final or interim state, which means that it can continually be revised by the transcription model as it receives more audio to improve its transcription. While the earlier segments which have already been finalised are sent once to the front-end, the most recent segments which are non-final or interim have to be continuously sent to the front-end.

Whisper live algorithm

Whisper, in its original form, is not suited for real-time transcription and was not developed with this application in mind. The standard Whisper model processes audio in segments of 30 seconds each. However, for live transcription, it’s essential to transcribe shorter segments of audio as they are received. To adapt Whisper for this purpose, we needed to conduct some initial experimentation to assess the feasibility of the concept. This involved measuring the time it takes to transcribe each audio chunk. For instance, in a scenario where a 2-second audio chunk is received, the goal would be to transcribe it within 1 second and then relay the output to the client, so that the model would be ready to process the subsequent audio chunk when it arrives.

To reduce this latency, we made use of faster whisper, a reimplementation of the model using CTranslate2, which greatly speeds up inference through the use of several performance optimisation techniques. We were able to perform the processing of each chunk under 2s with the Whisper small model running on GPU, and hence update the transcript every 2s. The latency is increased if we use the Whisper large model, giving rise to one of the trade-offs between speed and accuracy which we face.

During our experimentation, we also discovered issues arising from Whisper’s processing. When processing audio chunks shorter than 30s, Whisper artificially pads each of these short chunks with trailing zeros into a 30s chunk before transcribing it. There are two issues with this. Firstly, these zeros separate audio chunks, causing Whisper to interpret each chunk as separate rather than continuous, breaking up words that span across multiple audio chunks. Secondly, while it is known that Whisper tends to hallucinate and generate repetitive text due to its sequence-to-sequence architecture, we found that this behaviour occurs even more when processing such padded chunks.

Thus, in our solution, we re-transcribe short audio chunks in the interim, and finalise the transcription only when the audio chunk has reached 30s long. This gets around the zero-padding issue and ensures that the audio chunks are continuous, without any padded zeros in between. This, however, does not address the issue of repetitive text in the interim. While Whisper has existing heuristics to reduce such text repetitions, we added a few of our own to further mitigate the issue.

Supporting a seamless live transcription session

Further complications arose from the need to ensure the transcription can continue automatically when the user’s client temporarily disconnects. To support this, we implemented a reconnection mechanism. This involves implementing caching of the intermediate states so that the transcription can continue smoothly after the user manages to reconnect.

In addition, we also introduced some modifications to the front-end to improve the end-user experience. Because the entire interim transcript is being continuously sent to the front-end, the sudden appearance of several new words at a time can degrade the readability of the transcript in real time. We thus introduced some logic to stream out the most recent and previously unseen words in quick succession, thus improving readability and making the live transcription appear more seamless.

Live Diarisation (VBx) algorithm

The goal of diarisation is to identify and separate out distinct speakers present in an audio file. Live diarisation ultimately assigns labels, or speaker IDs, to each audio chunk during live transcription. Current diarisation methods first convert audio chunks into embeddings before clustering the embeddings. The clustering step can be done using traditional clustering methods such as agglomerative hierarchical clustering or spectral clustering, or deep-learning-based end-to-end models which are able to distinguish speakers in overlapped speech.

To implement live diarisation, we explored three algorithms: VBx, UIS-RNN, and Diart. From our evaluation, we found that VBx produces the best accuracy amongst the three models for our purposes. However, unlike UIS-RNN and Diart which are designed for streaming purposes, VBx is an offline method, and needs to be modified for live diarisation.

We perform embedding generation on each incoming audio chunk as we receive them. The original method clusters all the embeddings generated from the entire audio file, which are not yet available while live transcription is ongoing. Instead, with each incoming audio chunk, we cluster the embeddings from the current chunk together with a selected subset of embeddings generated and clustered previously. The selected embeddings generated previously will have labels assigned to them, and these labels can be different from those assigned during the current clustering step. Hence, the new labels will need to be remapped to the old labels for consistency throughout the live diarisation.

For better clustering accuracy, we should ideally select more embeddings in each clustering step, but selecting more embeddings increases the latency. This exemplifies yet another trade-off between speed and accuracy which we needed to deal with.

Integrating the models

To integrate VBx live into our existing system architecture, we needed to introduce a new microservice, live-postprocessor, which sits between the Whisper live and VBx live engines and the main live transcription service. Its job is to forward audio from the live transcription service to the two models and merge the output segments they return. VBx live returns segments with a start time and end time and a speaker label, while Whisper live returns transcribed segments.

A key challenge in implementing the live-postprocessor service involved working out the logic for merging segments as the start and end times of the segments returned by Whisper live and VBx live may not be aligned, and the segments are also finalised at different rates. In a nutshell, the algorithm re-segments the Whisper live output based on the speaker segment timings returned by VBx live, taking care to label segments as final only after both the Whisper and VBx live output is finalised.

Because VBx needs to gather enough speaker embeddings to be accurate, we needed to extend the duration of the audio set as interim/non-final. To reduce latency in cases when one service is too slow, we also forcefully finalise the output to ensure that the interim portion of the transcript does not grow too large, since the interim needs to be continuously sent to the front-end.

Scaling the service

Another complication involved in implementing live transcription and diarisation was in load balancing and distributing incoming live transcription jobs amongst multiple Whisper live/VBx live replicas. In the context of batched jobs, the usual approach to load balancing workloads involves sending the job to a shared queue. Then, individual workers can act as subscribers to the queue, and pick up the job from the queue when they become available. However, this approach cannot work in the context of live transcription, where a persistent connection to each transcription engine needs to be maintained.

The solution we arrived at was to allow the workers to publish their own availability to a shared worker pool (using redis). The main live transcription service is then responsible for initiating connections to the most available worker based on the information in the worker pool.

Load testing and round-2 performance optimisation

The final step was self-hosting the Whisper and VBx models on GPUs. This involved choosing the most suitable type of GPU instances from various options to strike a balance between computing needs, performance standards and cost-effectiveness. To decide on the number of GPU instances required, we also conducted load testing to estimate the number of concurrent live transcription connections that a single GPU instance was able to support. In the case of live transcription and diarisation, load testing involved checking on the time taken to process each audio chunk as the number of concurrent connections increased. As mentioned earlier, the key criterion is that processing can keep up with the incoming audio so that the unprocessed audio does not start to build up over time.

During the load testing process, we also performed iterative tuning of the algorithm parameters (e.g. chunk size, number of embeddings) to ensure performance, sometimes at a potential cost to model accuracy. The key trade-off is between model accuracy and performance, which is also related to cost, since increasing the maximum number of concurrent connections routed to a single GPU reduces the total number of GPU instances we need to host.

Conclusion

The technical hurdles around live transcription and diarisation presented opportunities for our team to elevate our engineering capabilities to address the challenges involved in developing a real-world solution. Throughout the process of adapting Whisper for live transcription and VBx for live speaker diarisation, we worked in close collaboration to investigate various creative solutions before settling on one which balanced the demands of performance, cost and user experience.

As of January 2024, Transcribe has more than 1,200 monthly active users. The adoption rate was significantly boosted by the availability of Whisper live transcription and diarisation in the past few months. Moving forward, the Transcribe team will deploy Whisper Large for better transcription accuracy in 2024.

--

--