If you’re like me, you might think it would be cool to generate a transcription of an online meeting now and then. Maybe you are conducting an interview or simply had a great conversation with a co-worker you’d like to reference again.
Whatever the case, there’s a zero percent chance that I’m paying for a premium Zoom account or coughing up multiple dollars per minute of to transcribe a meeting. Not when I’ve spent the last half-decade of my life learning how to use the AWS platform!
Please tell me there’s a better way?
The better way is called the Amazon Transcribe service, and opportune for our purposes is the newly-added feature to separate speakers in a multi-person audio. And it costs just $0.04 per minute to do so. [after a 60 min per month free tier].
Let’s give it a whirl, shall we?
We could perform the requisite setup manually in the AWS console — by uploading an
wav file to S3 and creating the Transcribe job there. Or we could automate the whole charade by launching a fancy pre-created CloudFormation stack in our AWS account.
We’re going to go with a happy middle approach, where most things are automated but not everything need be. We’ll start by crafting a
transcribe.py python script that drives the transcribing process by performing four main steps:
- Upload an audio file to S3 (can be
- Create and kick off an AWS transcription job via the boto3 client.
- Poll for the transcribe job for completion and finally…
- Parse the output JSON file into a CSV and download locally.
The full process looks like the following:
All we have to do before running the script is make sure the S3 bucket we upload the audio file to exists. We will create our bucket in the console.
Everything is in place! Before we can run the
transcribe.py script though (shared at the end), I organized it to take three input parameters, namely:
- The path of the audio file to be transcribed.
- The format of the file.
- The number of speakers present in the audio.
We will use the
argparse package to manage the inputs in a human-friendly way:
import argparseaparser = argparse.ArgumentParser()
aparser.add_argument("-i", "--input-file", required=True)
aparser.add_argument("-f", "--file-format", default='mp3')
aparser.add_argument("-n", "--num-speakers", default=2)args = aparser.parse_args()
Once we know what the values should be, we are ready to run the script! For example,
python transcribe.py -i jack_and_rose_first_argue.mp3 -f mp3 -n 2
To test AWS Transcribe out, I thought it would be fun to use a popular movie scene as the sample audio.
After going down a deep YouTube rabbit hole, I settled on this lovely scene from Titanic, where Jack prods Rose on her feelings for Cal, and Rose discovers Jack’s artistic talent.
To use the clip I converted it into an
mp3 file and downloaded it locally onto my laptop.
With everything in place, let’s see what results we get!
After a couple minutes, the
transcribe.py script finished and a tidy CSV file magically appeared on my laptop as the output. All that was left to do was read it into a Pandas DataFrame and see how it looked.
Pretty good! The full text (not split by speaker) looks like:
Jack. I want to thank you for what he did. Not just for for pulling me back, but for your discretion. You're welcome. Look, I know what you must be thinking. Poor little rich girl. What does she know about misery? No, no, it's not what I was thinking. What I was thinking was what could have happened to this girl to make her think she had no way out. Yes. Well, I you It was everything. It was my whole world and all the people in it. And the inertia of my life plunging ahead in me, powerless to stop it. God, look at that. He would have gone straight to the bottom. 500 invitations have gone out. All of Philadelphia society will be there. And all the while I feel I'm standing in the middle of a crowded room screaming at the top of my lungs and no one even looks up. Do you love him? Pardon me, Love. You're being very rude. You shouldn't be asking me this. Well, it's a simple question. Do you love the guy or not? This is not a suitable conversation. Why can't you just answer the question? This is absurd. you don't know me, and I don't know you. And we are not having this conversation at all. You are rude and uncouth and presumptuous, and I am leaving now. Jack. Mr. Dawson, it's been a pleasure. I sought you out to thank you. And now I have thanked you and insulted. Well, you deserve it, right? Right. I thought you were leaving. I am. You are so annoying. Ha ha. Wait. I don't have to leave. Things is my part of the ship. You leave. Oh, well, well, well, now who's being rude? What is the stupid thing you're carrying around? So, what are you, an artist or something? He's a rather good. Yeah, very good, actually.
The text results are pleasantly accurate. Not something you could copy verbatim into an article, but it’s an amazing head start compared to transcribing a recording yourself from scratch.
Unfortunately, it did struggle a bit with identifying the different speakers correctly. After watching the clip again, I can see how maybe this wasn’t the best choice of scene since they do sound kind of similar, especially in the low-quality format produced by a converted YouTube video.
From personal experience, I can say that with Zoom recordings it’s had greater success thus far in correctly identifying each speaker, though I still haven’t tried it on a meeting with more than two people.
Overall, I’m happy with the functionality I was able to produce in a few hours of Thanksgiving weekend tinkering.
In truth, the reason I’m interested in this capability is I’d like to start supplanting my OC articles here on Medium with interviews of other data professionals. With this functionality figured out, I feel ready to reach out to people in my network and see if they are interested in sharing their thoughts and knowledge about working in data.
Oh, and if you are curious for the full
transcribe.py script, I’ve included it below for your perusing pleasure!
Note: The code snippet above draws from this excellent AWS tutorial on the Amazon Transcribe service.
For more interesting content like this, follow me on Medium :)
More From Whispering Data
Tackling Fragmentation in Serverless Data Pipelines
How to stay sane when managing tens-to-hundreds of lambda-backed repositories…