Serverless Speech-to-Text

Published in

Picovoice

6 min readApr 13, 2022

Speech recognition is expensive because it requires large servers, specialized hardware (e.g. GPU), and dedicated ops teams to maintain the infrastructure. Serverless computing (e.g. AWS Lambda) can solve these problems. Alas, existing ASR engines are too hefty to run within a Lambda function.

We at Picovoice recently released a speech-to-text engine called Leopard. It only takes 20 MB of storage and can run even on tiny embedded processors like Raspberry Pi. Why not run it on AWS Lambda to make speech recognition much more affordable? I show you how to run Leopard on AWS Lambda with less than 0.01 USD per hour while cloud providers charge well above 1 USD per hour.

Accuracy

Leopard’s accuracy is competitive with Amazon Transcribe and Google Speech-to-Text. Benchmark’s code and data are here.

Architecture Overview

Pre-requisites

Register for an AWS account. We use it to access Amazon Lambda and API Gateway services via the console.
Sign up for Picovoice Console to get your AccessKey for free. We will need an AccessKey to use Leopard Speech-to-Text engine.
Download Postman. We will use this tool for testing.

Setting up Lambda

1. In AWS Console search forLambda and then go to the Functions tab.

2. Press Create a Function. Then set the function name and set the runtime to Python 3.9.

3. Download the zip file from the GitHub repository which contains the Lambda handler (source) and packaged pvleopard module.

The Lambda handler does the following:

Gets and parses the data from the request.
Saves the audio data into a temporary file.
Transcribes and cleans up the temporary file.
Returns the transcription.

4. Press on Upload from > .zip file and upload the zip file.

5. Once the function is uploaded, go to Configurations > General configuration tab. Press on Edit.

6. Set the memory limit to 512 MB and timeout to 30 seconds and press Save.

7. Go to Environment variables tab and press on Edit.

8. Add your AccessKey that you obtained from Picovoice Console and press Save.

9. Press on Copy ARN (button located in the top right corner) and save your function ARN. We will need to set up API Gateway later. You may also come back later to copy your function ARN.

Setting up API Gateway

1. In AWS Console, go to API Gateway, and go to REST API section. Press on Build to create a new Rest API.

2. Select Rest as the protocol, New API, and give a name to your API. Press Create API to create your API.

3. Press Actions > Create Method and select POST as your method. Select Lambda Function as the integration type, tick on Proxy Integration, and copy your Lambda function ARN saved from before.

4. Once created, go to Settings tab.

5. Scroll all the way down. In Binary Media Types press on Add Binary Media Type, add multipart/form-data and press on Save Changes.

6. Go back to Resources tab. Press on Actions > Deploy API. Set [New Stage] as the deployment stage, give it a stage name and press Deploy.

You will be redirected to Stages and your API URL will be shown. Now we can test our rest API.

Testing the API

We will use Postman to send an audio file and get the transcription.

1. Copy and paste the invoke URL from API Gateway in the request URL section and set the request type to POST.

2. In the Body tab, set the data type to form-data. Set the key name to audio_file, set the type as File, and press on Select Files and select the audio file you want to transcribe (Leopard supports the following audio formats: FLAC, MP3, Ogg, Opus, Vorbis, WAV, and WebM).

3. Once the request goes through, the result will show below.

In AWS Console, go to CloudWatch > Log groups if you would like more details on a specific API request and Lambda invocation.

Cost Estimation

We like to estimate how much it costs to process one hour of audio using this architecture. We assume that we are processing uniform 15-second files for simplification. This means that every 240 API calls are worth an hour of audio. The prices are estimated with costs taken from API Gateway pricing page and Lambda pricing page in the us-west-2 region.

API Gateway Cost

240 requests * 3.5$/million = 0.00084$ / hour

Lambda Request Cost

240 requests * 0.2$/million = 0.00005$ / hour

Lambda Compute Cost

A 15-second audio file takes around 3000ms to execute. With 512MB of memory size, the total compute cost is:

240 request * 3 seconds = 720 seconds
0.5GB * 720 seconds = 360 GB-seconds
360GB-seconds * 0.0000166667$ = 0.006$ / hour

Total costs per hour: 0.00084 + 0.00005 + 0.006 = 0.00689$.

Comparison with other Speech-to-Text services

We compare costs with Google Cloud STT and AWS Transcribe with the same setup as before.

Google Cloud STT: 240 files * 0.006$ = 1.44$ / hourAWS Transcribe: (240 files * 15 seconds) / 0.024$ = 1.44$ / hour

In this setup, the price of serverless speech-to-text is noticeably cheaper than using existing services.

Limitations & Future Explorations

API Gateway has a 29-second timeout and 10MB payload size limit. Lambda has a 15-minute timeout and a 6MB payload size limit.

Hence, the audio size must be less than 4.5MB (the audio file is base64 encoded) and less than 2 minutes to transcribe successfully before a timeout occurs.

In this article, we focused on how to set up Leopard to perform short audio transcription. But one may try some solutions to go past these limitations:

To transcribe audio files larger than 4.5MB, use S3 to get a pre-signed upload URL, upload your audio file, fetch it in Lambda, and process it.
To transcribe audio files greater than 2 minutes (longer processing time), use Asynchronous Lambda to ping and get the response when the transcription finishes.

Conclusion

We have seen how to set up Leopard and integrate it with Lambda and API Gateway. Keep in mind this setup is for smaller audio files that require transcription. Since this is a starter code, you can always modify the code, re-zip, and upload the zip file again to update your function.

Take a look at Leopard GitHub Repository or Leopard Docs Page to learn more about Leopard.