Nerd For Tech
Published in

Nerd For Tech

Transcribe audio from video with Azure Cognitive Services

While we’ve made great strides in analysing structured and semi-structured data, unstructured data is still running behind. Of course, this third type presents challenges not found in the other two, but we are steadily finding more options and tools for that job.

One of those tools is, as expected, Artificial Intelligence. Microsoft Azure has some interesting options for this job, namely Cognitive Services. As per their description,

“Cognitive Services brings AI within reach of every developer — without requiring machine-learning expertise”.

Today I bring you a guide on how this description is realized in practice by showing you how to transcribe speech from audio using the Speech service of Cognitive Services.


  • An Azure subscription
  • A Cognitive Services or Speech resource in Azure
  • Python (to glue everything together)
  • Azure Speech SDK for Python
  • MoviePy library to separate audio and video

Extract audio

If you are starting with a video, then the first step is to extract the audio. This is a straightforward operation in MoviePy.

While .wav filers are larger than .mp3, we need it for Cognitive Services.

Set up authentication

Assuming you already have either a Cognitive Services or Speech resource in your Azure subscription, we can take care of the authentication part in the next Python snippet below.

Please note Speech is a part of the Cognitive Services offerings, hence why we can use them interchangeably here. However, we can use Speech because we have a specific need.

In your resource, navigate to the Keys and Endpoint menu (on the resource blade/left pane) and copy KEY 1 and location.

And this is the code snippet to handle authentication.

Transcribe the audio

Finally, we arrive at the script to transcribe the audio using Azure services.

Before the script, let me just acknowledge this solution is heavily based on these code snippets of the documentation and this code I found on a GitHub thread.

The first 12 lines are the authentication snippet from earlier, with some added flair. Aside from the imports, note the audio_config which is used to configure the .wav file as our source of audio.

Lines 14–17 declare a flag to keep the script running, and the list that will contain all transcriptions. You see, the transcriptions are not made word by word as they are understood from the audio, rather they are received in chunks of a couple of phrases (called an utterance in the documentation). We’ll come back to this topic near the end of the script.

The stop_cb and recognised functions, albeit simple, are essential to the operation. The former handles reaching the end of the transcription and the latter handles each recognition/transcription received. See lines 38, 41 and 43 for the assignment of these functions.

On line 46 we set up the transcription per se, as continuous recognition. Instead of transcribing a single utterance, it continues the transcription until we tell the service to stop. That while loop on line 48 is where we use the done flag to keep the script running. If at some point stop_cb has been called, then done has been changed to True and we can take it as being done with transcriptions.

Finally, on line 52 we dump the list of string transcriptions to a Pickle file. Of course there are other output options, but this way we preserve the last state of the list and can load it in a separate script for further analysis.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
José Fernando Costa

I write about data science to help other people who might come across the same problems