Logic Apps — Large Audio -Speech to Text Batch Transcription

5 min readNov 27, 2018

Introduction

We have multiple write-ups available on web describing how to use Microsoft cognitive services for speech to Text conversion either by using Cognitive API or Speech recognition. Both these approach works nicely for real time speech to Text detection and with limited media size file. You can learn more about these process of transcription at

https://azure.microsoft.com/en-us/services/cognitive-services/speech-to-text/

https://docs.microsoft.com/en-us/azure/cognitive-services/speech-service/speech-to-text

In this article we will describe another way to transcript large audio file into text by using logic apps (we will show how to use azure functions in next section) and batch transcription feature of Microsoft cognitive services. If you are new to Batch Transcription then “Batch Transcription API provide asynchronous way to convert large audio files into relevant textual format “

In real world various call centres has different style of audio batch transcription processing applications to analyse customer’s intent and find customer satisfaction index. With current technology advancement this process has also gained lot of popularity among enterprise who are looking at the best way to improve internal staff and customer interactions. If you want to learn more about batch transcription then read the link here https://docs.microsoft.com/en-us/azure/cognitive-services/speech-service/batch-transcription

Solution Design and implementation:

The current version of cognitive services Batch transcription API depends on Blob storage to read and write transcript audio files. The overall architecture of this solution shown below. You can implement similar architecture if you are working with real life application.

As solution requirement we have created blob storage and container within blob storage to hold the audio files. In case if you are thinking of replicating this process you can create V2 version of blob storage as per Microsoft documentation

https://docs.microsoft.com/en-us/azure/storage/common/storage-quickstart-create-account?toc=%2Fazure%2Fstorage%2Fblobs%2Ftoc.json&tabs=portal

Next to this we have created resource for Cognitive speech service in Azure portal .To do this login in to Azure portal and search for speech resource .Once you find speech resource blade ,click on create and populate the required artefacts and then click on finish .

Note: Copy the Speech to Text Cognitive service API key and location in which you have created your Cognitive services.

In the next step create blank logic apps and set trigger as event grid –When a blob is created. Blob trigger event will notify the logic app that a new audio file has been uploaded/created in to the blob container and logic apps should instantiate its workflow and get the required result from Batch Transcript API.

We can find the open API definition for batch transcript api and other API specification within cris at https://westus.cris.ai/swagger/ui/index#!/Custom32Speech32transcriptions58/CreateTranscription/swagger

Here westus is the location of speech cognitive resource within my resource group.In the next section complete the logic Apps workflow like below image

Here in compose action we construct JSON payload as per the cris api(batch transcription) Open API definition .Once the payload is constructed with blob URI we will send this request to batch transcription API

Once the payload is constructed with blob URI we will send this request to batch transcription API which will run over 202 status code and sends back the status 200 once audio file get converted into its relevant text .In this action you need to update the cognitive services API key with your own resource key .

After transcription delay logic Apps action “post async message to batch transcription api ”will get status code of 200 from cris transcription api, and in next action logic apps will poll transcription endpoint and get relevant transformed audio to text transcript file information

The response from the Batch transcript API will be list of array of multiple audio-to text transformation .We need to iterate through each audio file path and get the relevant content out of the converted file

In action “get base64 audio content from storage”, logic apps we will perform a get operation on the transcript path returned by the cris api

The action will return you base64 encoded text script. You can simply use base64toString() function to convert the incoming base64 message into relevant string . Next action’s are for storage purpose ,you can use either Cosmos db or any storage account to store the Transcript text file against the audio .

The last action is to delete the text transcript from cris transcription storage account by passing the id for the transcript which you can find in the iteration message

We have found this way of transcript audio to text much more efficient and cost effective. Instead of creating multiple layers of AI application cris api acts as single endpoint to automate pretty interesting AI algorithm .After analysis we have found that V2 version of cris api has success rate of 75 -88 percentage which might increase in next version of cris api .

Logic Apps — Large Audio -Speech to Text Batch Transcription

Written by Abhishek