OpenAI Whisper : Robust Speech Recognition

Vishal Rajput
AIGuys
Published in
5 min readOct 19, 2022

--

Once again OpenAI has killed it, with their latest speech recognition model they have shaken the foundations of speech recognition. Beating the state-of-the-art speech recognition systems by leaps and bounds. The model is almost human-level in terms of recognizing speech even in extremely noisy situations.

Before we dwell deeper into Whisper let’s see its capabilities. The below examples show whisper tries to convert a heavily accented voice to text.

Data preparation

So, without wasting any time let’s get into the Whisper paper. Before the introduction of Whisper, there were two main pathways namely supervised and unsupervised in speech recognition. In the unsupervised setting (1000000 hours of audio data is available), pre-trained audio encoders learn high-quality representations of speech, but because they are purely unsupervised they lack an equivalently performant decoder mapping those representations to usable outputs, necessitating a finetuning stage in order to actually perform a task such as speech recognition.

On the other hand, speech recognition systems that are pre-trained in a supervised fashion across many datasets/domains exhibit higher robustness and generalize much more…

--

--