Unified Cloud Transcription Framework

Published in

Craftsmen — Software Maestros

4 min readNov 2, 2019

With the rise of machine learning and cloud computing, it looks like every major cloud provider has their own transcription (speech-to-text) service. Giants like Google, Amazon, Microsoft and IBM support a good number of languages that can be transcribed from a given audio input. It can be hard to decide which one is better for which language before we have tested them all extensively. But if we want, we can consider leaving this choice to the user if we can integrate all of the services easily.

Each of the cloud transcription services have a different API, but the ways to deal with them are mostly similar. This insight allows us to come up with a unified framework for integrating them all. This is specially true for the common use case of transcribing long audio clips. Let’s talk about what we need to handle in general for all services. The following flowchart pretty much sums it up.

Audio format

We have seen that all the major services accept WAV, and 16-bit, single-channel, 22050 Hz WAV files have pretty good quality. So if we convert our input material to this format, we can feed it to any transcription service.

Transcoding means to convert from one format to another. In this case, we should convert the input media which can be a combination of video (H264, MPEG2 etc.) and/or compressed audio (MP3, AAC etc.), into the WAV format as described above. For transcoding, we can use FFmpeg or some other tool/service, e.g., AWS MediaConvert.

Media location

Some services expect an HTTP URL to the input file, which can be done by creating a presigned URL from a bucket path. Some expect it in their own bucket (e.g., Google), in which case, we need to upload there prior to starting the job. Some services support the audio content to be posted with the API (e.g. IBM).

Start the job

Each service has an API to start the transcription job, by calling it via SDK or HTTP. With this API, the service usually takes -

the media (presigned URL, path to file in a bucket, or posted with the API)
language code (usually in the form <language>-<region>, e.g. en-US)
other configurations: sample rate of the audio, channel count (if multichannel is supported), whether to enable speaker identification and confidence, output bucket path if needed, etc.

The API returns a job ID, to be able to later query the status or results.

For example, AWS SDK has TranscribeService.startTranscriptionJob, Google has speech.v1p1beta1.SpeechClient.longRunningRecognize in their @google-cloud/speech package, and IBM has regional endpoints like https://gateway-lon.watsonplatform.net/speech-to-text/api/v1/recognitions

Polling

Most providers do not support specifying a callback URL to let us know whether the job has been completed or failed. So we need to have a polling cycle of our own to periodically check the ongoing jobs. Of course, we need to maintain a map of these running jobs on our end, along with job IDs and other details.

Get job status

There is usually an API that allows you to get the status of a transcription job, whether it is complete, in-progress, or failed. Sometimes this even provides the progress percentage.

Fetch job result

The actual output of a successful job can be retrieved from the service with a separate API call, or some providers can return it as part of the same call that returns the job status (Google and IBM). AWS writes the output to a bucket that we need to read from.

Process/convert the output

The results returned from the services vary in structure and format, though most of them have the same information, like an array of words, with each word having a start time, end time, and confidence. We need to have a step to convert the service format into our own internal format that the rest of our system can rely on.

To illustrate how the formats differ from service to service, we present sample extracts from three providers.

Google

[
   {
      "startTime":"5s",
      "endTime":"5.400s",
      "word":"what",
      "confidence":0.9416568
   },
   {
      "startTime":"5.400s",
      "endTime":"5.500s",
      "word":"is",
      "confidence":0.92425334
   },
   {
      "startTime":"5.500s",
      "endTime":"5.800s",
      "word":"really",
      "confidence":0.82757676
   },
]

AWS

[
   {
      "start_time":"39.3",
      "end_time":"39.88",
      "alternatives":[
         {
            "confidence":"1.0",
            "content":"Creatively"
         }
      ],
      "type":"pronunciation"
   },
   {
      "start_time":"39.88",
      "end_time":"40.26",
      "alternatives":[
         {
            "confidence":"1.0",
            "content":"speaking"
         }
      ],
      "type":"pronunciation"
   },
   {
      "alternatives":[
         {
            "confidence":null,
            "content":","
         }
      ],
      "type":"punctuation"
   },

IBM

[
   [
      "and",
      0.19,
      0.32
   ],
   [
      "I",
      0.32,
      0.39
   ],
   [
      "want",
      0.39,
      0.69
   ],
]

Clean up

This step is for any housekeeping necessary, e.g., we may need to clean up the audio file uploaded to the provider’s bucket, in order to not consume space unnecessarily.

Conclusion

Having a unified framework really simplifies adding more transcription providers. At Craftsmen, we were able to successfully integrate these services following this approach —