How to Set Up AWS Transcribe to Batch Transcribe Hundreds of Audio Files

Published in

Attest Product & Technology

7 min readMar 2, 2023

At Attest, we provide a platform for our clients to design and create surveys. Survey respondents are sourced using ‘panel aggregators’ and we then transform respondent answers into actionable consumer insights. In the data science team, we use machine learning to help deliver these insights. We recently set up a pipeline to transcribe a few hundred audio files so that they could be analysed using our in house Natural Language Processing (NLP) models.

Why use AWS Transcribe?

AWS Transcribe is Amazon’s speech to text service. Transcribe uses Automatic Speech Recognition (ASR) and NLP models to convert audio or video files into text. It currently supports 37 languages as well as additional capabilities including punctuation, profanity detection and the option to include custom vocabulary tables and language models. Transcribe also supports multiple audio and video file formats.

Transcribing individual files through the AWS management console is relatively straightforward, but what if you want to transcribe multiple files at once? This article describes the steps to set up batch transcription for AWS Transcribe, without having to install any additional packages like boto3.

Prerequisites:

Read/write privileges for AWS S3.
Permissions to use the Transcribe service enabled for your AWS account.
AWS account access through your CLI set up. More info on how to do this can be found here (1).
Python 3 installed (Python 3.9.2 was used here) and access to an IDE.
Audio or video files in a format supported by AWS Transcribe. Supported file types can be found here (2). If your files are in an unsupported format you can use ffmpeg (3) to convert them.

Step 1: Create input and output S3 buckets

First, create the S3 buckets that will hold the input audio/video files. AWS Transcribe will only transcribe files to and from S3 buckets. In the management console search for S3 and then click on ‘Create bucket’.

Name your input bucket, select your region and any other optional settings you require, then scroll down and click ‘Create bucket’.

Repeat this to create the bucket that will store the output transcriptions.

Step 2: Log into AWS through your CLI

Ensure you are logged in to your AWS account in your CLI then run the command aws s3 ls. You should see a list of all of your available buckets in S3, including the ones made in Step 1.

Step 3: Upload files to the S3 bucket

In this example, we have the following 5 files saved in a local directory.

Figure 2: Audio/vido files saved locally

Assuming you have the files stored locally, the easiest way to upload them to S3 is to simply select all the files you wish to transcribe in your local folder and then drag and drop them into the input bucket.

Check that you can see these files in the bucket through the CLI by running:

aws s3 ls s3://<PATH-TO-S3-BUCKET>

for example it could look like: s3://input-bucket

Step 4: Create transcription jobs

Next save the below script locally as a python file called batch-transcribe.py .

import os
from os import listdir
from os.path import isfile, join

# create a list of all file names in specified folder
file_names = [f for f in listdir('<PATH-TO-LOCAL-AUDIO-FILES>') if 
              isfile(join('<PATH-TO-LOCAL-AUDIO-FILES>', f))]

# batch identifier to ensure trancription job names are unique
identifier = "_trial_1"

for x in file_names:
    
    # AWS Transcribe create job request and specify settings 
    command= "aws transcribe start-transcription-job \\
     --region <YOUR-REGION> \\ # eg. eu-west-1
     --transcription-job-name {job_name} \\
     --media MediaFileUri=s3://<PATH-TO-S3-BUCKET>/{file_name} \\ # eg. s3://input-bucket/{file_name}
     --output-bucket-name <OUTPUT-BUCKET-NAME> \\ # eg. output-bucket
     --language-code <LANGUAGE-CODE> > /dev/null".format(
        job_name=x+identifier, file_name=x) # language code used here was en-GB
    
   
    os.system(command)

This script loops through the files in the input S3 bucket and creates a transcription job for each one. The script gets the list of all file names from the folder specified in <PATH-TO-LOCAL-AUDIO-FILES>(in this case the audio-files folder shown in Step 3) and uses the corresponding file with the same name in the S3 input bucket to create a transcription job. If you don’t want to transcribe all the files in a specified folder you can simply pass in a list of the desired file names as the file_names variable instead.

Each transcription request passed to AWS Transcribe must have a unique job name otherwise AWS will return an error. To prevent this from happening and allow the files to be transcribed multiple times (for example to test different Transcribe settings), a unique identifier (eg. “_trial_1”) is added to the end of each file name.

More information on the optional settings for AWS Transcribe can be found here (4). Note that you must only specify an output bucket name, not the path to the bucket. For example writing the command below will return an error:

--output-bucket-name my-buckets/testing/output-buckets \\

Finally, in the CLI, navigate to the directory where the batch-transcribe.py file is saved and then run the script using the command below:

python batch-transcribe.py

Step 5: Check transcription jobs

Next, in the AWS management console, search for AWS Transcribe and then click on Transcription Jobs. You should see a job appearing for each of the files with a status either pending or complete. Once all the jobs are complete, move on to Step 6.

Figure 3: Transcription jobs completed/ in progress

Step 6: Batch download the transcribed files from the output S3 bucket

In the output S3 bucket you should see a JSON file corresponding to each transcription job.

Figure 4: Transcription JSON files in output bucket

Save the script below locally as a python file called batch-download.py.

import os
from os import listdir
from os.path import isfile, join

# create a list of all file names in specified folder
file_names = [f for f in listdir('<PATH-TO-LOCAL-AUDIO-FILES>') if 
              isfile(join('<PATH-TO-LOCAL-AUDIO-FILES>', f))]

# batch identifier to select transcription jobs to download
identifier="_trial_1"

for x in file_names:
    
    # AWS request to download object from specified bucket
    command = "aws s3api get-object --bucket <OUTPUT-BUCKET-NAME> --key {" \
              "job_name}.json {file_name} > /dev/null".format(
        job_name=x + identifier, file_name=x
    )
    
    
    os.system(command)

Note that here we have opted to remove the identifier from the end of the file name when downloading the file to preserve the original name. We have also kept the original file type extension in the name, meaning the files will end in .mp4.json. This is done to keep a record of the file format that was used in the transcription. If you don’t want this, you can edit the script above to strip the .mp4 file extension from the file name once it is downloaded from the S3 bucket. Alternatively, you can download the file with the job name (which will include the identifier) instead of reverting back to the original file name.

In the CLI, navigate to the directory where the batch-download.py file is saved and then run the script as follows:

python batch-download.py

Step 7: Create a dataframe from JSON files

The output JSON files will have the following format:

{"jobName”:”file-1.mp4_trial_1”,
"accountId":"<your account id>",
"results":{"transcripts":[{"transcript":"here's an example"
}
],
"items":[{
"{
"start_time": "1.05",
"end_time": "1.29",
"alternatives": [
{
"confidence": "0.9961",
"content":"here's"}],"type": "pronunciation"
},
{
"start_time": "1.3",
"end_time": "1.71",
"alternatives": [
{
"confidence": "1.0",
"content":"an"}],"type": "pronunciation"
},
{
"start_time": "1.71",
"end_time": "1.88",
"alternatives": [
{
"confidence": "1.0",
"content":"example"}],"type": "pronunciation"
}]},
"status": "COMPLETED" 
}

The function below extracts just the raw transcription from each json file and appends it to a data frame ready to be analysed. <PATH-TO-JSON-FILES> should be set as the path to the local directory your downloaded JSON files are in.

def extract_transcriptions():
  
  # create a list of all file names in specified folder
  file_names = [f for f in listdir('<PATH-TO-JSON-FILES>') if 
                isfile(join('<PATH-TO-JSON-FILES>', f))]

  # create an empty list to append transcriptions to
  transcriptions=[]

  for x in file_names:

    # read in the json file
    f=open(f'<PATH-TO-JSON-FILES>/{x}')
    t=json.load(f)

    # select only the complete raw transcript from the json file and append
    transcriptions.append(t['results']['transcripts'][0]['transcript'])
  
  # return a data frame of transcriptions ready to be analysed
  return pd.DataFrame(transcriptions, columns = ['raw_transcript'])

Summary

This article has provided a brief introduction to AWS Transcribe. It then described the steps to batch convert video files from speech to text, download the transcribed files, and extract the raw transcriptions into a python data frame ready for you analyse with some NLP models!

We hope you found this article useful. Stay tuned for the next one on how to repeat these steps using Google Speech-to-Text and a comparison with AWS Transcribe!

References

AWS CLI configuration guide: https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-quickstart.html
AWS Transcribe supported file types: https://docs.aws.amazon.com/transcribe/latest/dg/how-input.html
Documentation on ffmeg for file conversion: https://ffmpeg.org/
AWS Transcribe job settings guide: https://docs.aws.amazon.com/transcribe/latest/dg/getting-started-cli.html