Transcribe audio from video with Azure Cognitive Services
While we’ve made great strides in analysing structured and semi-structured data, unstructured data is still running behind. Of course, this third type presents challenges not found in the other two, but we are steadily finding more options and tools for that job.
One of those tools is, as expected, Artificial Intelligence. Microsoft Azure has some interesting options for this job, namely Cognitive Services. As per their description,
“Cognitive Services brings AI within reach of every developer — without requiring machine-learning expertise”.
Today I bring you a guide on how this description is realized in practice by showing you how to transcribe speech from audio using the Speech service of Cognitive Services.
- An Azure subscription
- A Cognitive Services or Speech resource in Azure
- Python (to glue everything together)
- Azure Speech SDK for Python
- MoviePy library to separate audio and video
If you are starting with a video, then the first step is to extract the audio. This is a straightforward operation in MoviePy.
While .wav filers are larger than .mp3, we need it for Cognitive Services.
Set up authentication
Assuming you already have either a Cognitive Services or Speech resource in your Azure subscription, we can take care of the authentication part in the next Python snippet below.
Please note Speech is a part of the Cognitive Services offerings, hence why we can use them interchangeably here. However, we can use Speech because we have a specific need.
In your resource, navigate to the Keys and Endpoint menu (on the resource blade/left pane) and copy KEY 1 and location.
And this is the code snippet to handle authentication.
Transcribe the audio
Finally, we arrive at the script to transcribe the audio using Azure services.
The first 12 lines are the authentication snippet from earlier, with some added flair. Aside from the imports, note the
audio_config which is used to configure the .wav file as our source of audio.
Lines 14–17 declare a flag to keep the script running, and the list that will contain all transcriptions. You see, the transcriptions are not made word by word as they are understood from the audio, rather they are received in chunks of a couple of phrases (called an utterance in the documentation). We’ll come back to this topic near the end of the script.
recognised functions, albeit simple, are essential to the operation. The former handles reaching the end of the transcription and the latter handles each recognition/transcription received. See lines 38, 41 and 43 for the assignment of these functions.
On line 46 we set up the transcription per se, as continuous recognition. Instead of transcribing a single utterance, it continues the transcription until we tell the service to stop. That while loop on line 48 is where we use the
done flag to keep the script running. If at some point
stop_cb has been called, then
done has been changed to
True and we can take it as being done with transcriptions.
Finally, on line 52 we dump the list of string transcriptions to a Pickle file. Of course there are other output options, but this way we preserve the last state of the list and can load it in a separate script for further analysis.