deMISTify

Monthly articles published by UTMIST’s technical writers on topics in the field of machine learning and artificial intelligence.

Whisper

--

From virtual assistants to voice-controlled devices, Automatic Speech Recognition (ASR) technology has enabled more efficient and natural interactions with technology. However, the accuracy of ASR systems is still far from perfect, especially when it comes to languages with limited resources. Enter Whisper, an open-source ASR system that demonstrates a potentially game-changing approach to tackling the scarcity of big datasets in the speech recognition area.

Whisper is an automatic speech recognition (ASR) system developed by OpenAI, which uses deep neural networks (DNNs) to transcribe speech into text. Specifically, Whisper’s architecture is built on the classic transformer model, involving layers of encoder and decoder blocks leveraging the attention mechanism [4]. In essence, the model takes in 30-second chunks of audio, processes them sequentially, then outputs the transcribed text.

What sets Whisper apart from other ASR systems is its open-source nature, allowing developers and researchers to use OpenAI’s pretrained model for their own projects. Moreover, Whisper’s API calls make it easy for businesses to integrate its technology into their existing apps, services, products, and tools. The Speak app, for example, utilizes Whisper’s technology to provide a more natural and efficient way of communication for people who are deaf or hard of hearing [1]. Alternatively, users can also deploy optimized, local versions using this implementation.

Whisper’s API call released by OpenAI[2]

The system is also multilingual, meaning it can recognize and transcribe speech in over 100 languages [3], making it a valuable tool for businesses operating in multiple countries. Whisper is also a multi-task system, meaning it can perform multiple tasks simultaneously. This feature allows Whisper to address the problem of the lack of large databases in speech recognition, as it can use the data from other tasks to improve its accuracy in recognizing speech.

But what really makes Whisper stand out is the amount of time that went into its training. According to the OpenAI developer team, Whisper has been trained for over 680,000 hours [1].

Training Process

The system is trained on a large corpus of speech data, which teaches the system how to recognize and transcribe speech. This training corpus consists of thousands of hours of speech data carefully selected to represent a wide range of accents, dialects, and languages [1]. This diversity inherent in the dataset prevented overfitting while avoiding usage of data augmentation or regularization techniques.

Limitations

However, like any other ASR system, Whisper is imperfect, with some limitations. For example, Whisper’s accuracy is not the same across all languages and dialects, which means it may not be the best choice for languages with limited resources. But, as the developers continue to increase the training data, it is hoped that Whisper will become more accurate for all languages [4].

Another limitation is Whisper’s usage in streaming. Due to the fact that the authors mentioned Whisper’s limitation to 30-second chunks of audio, continuous transcription of streamed input is not possible with the vanilla architecture.

Conclusion

Overall, Whisper’s advanced training algorithm and use of deep neural networks make it a game-changing open-source ASR system that has the potential to revolutionize the field of speech recognition.

Works Cited

  1. Hawkins, A. (2023, March 1). OpenAI debuts Whisper API for text-to-speech transcription and translation. TechCrunch. https://techcrunch.com/2023/03/01/openai-debuts-whisper-api-for-text-to-speech-transcription-and-translation/
  2. OpenAI. (2023, March 1). Introducing ChatGPT and Whisper APIs. Retrieved March 27, 2023, from https://openai.com/blog/introducing-chatgpt-and-whisper-apis/
  3. SuperAnnotate. (2022, February 8). OpenAI Whisper: Automatic Speech Recognition System. Retrieved March 27, 2023, from https://www.superannotate.com/blog/openai-whisper-automatic-speech-recognition-system
  4. Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, & Ilya Sutskever. (2022). Robust Speech Recognition via Large-Scale Weak Supervision.

--

--

deMISTify
deMISTify

Published in deMISTify

Monthly articles published by UTMIST’s technical writers on topics in the field of machine learning and artificial intelligence.

Zoey Zhang
Zoey Zhang

Written by Zoey Zhang

Computer Science, Statistics and Art History @ UofT

No responses yet