Speech recognition is expensive because it requires large servers, specialized hardware (e.g. GPU), and dedicated ops teams to maintain the infrastructure. Serverless computing (e.g. AWS Lambda) can solve these problems. Alas, existing ASR engines are too hefty to run within a Lambda function.
We at Picovoice recently released a speech-to-text engine called Leopard. It only takes 20 MB of storage and can run even on tiny embedded processors like Raspberry Pi. Why not run it on AWS Lambda to make speech recognition much more affordable? I show you how to run Leopard on AWS Lambda with less than 0.01 USD per hour while cloud providers charge well above 1 USD per hour.
Accuracy
Leopard’s accuracy is competitive with Amazon Transcribe and Google Speech-to-Text. Benchmark’s code and data are here.
Architecture Overview
Pre-requisites
- Register for an AWS account. We use it to access Amazon Lambda and API Gateway services via the console.
- Sign up for Picovoice Console to get your AccessKey for free. We will need an AccessKey to use Leopard Speech-to-Text engine.
- Download Postman. We will use this tool for testing.
Setting up Lambda
1. In AWS Console search forLambda
and then go to the Functions
tab.
2. Press Create a Function
. Then set the function name and set the runtime to Python 3.9.
3. Download the zip file from the GitHub repository which contains the Lambda handler (source) and packaged pvleopard module.
The Lambda handler does the following:
- Gets and parses the data from the request.
- Saves the audio data into a temporary file.
- Transcribes and cleans up the temporary file.
- Returns the transcription.
4. Press on Upload from > .zip file
and upload the zip file.
5. Once the function is uploaded, go to Configurations > General configuration
tab. Press on Edit
.
6. Set the memory limit to 512 MB
and timeout to 30 seconds
and press Save
.
7. Go to Environment variables
tab and press on Edit
.
8. Add your AccessKey
that you obtained from Picovoice Console and press Save
.
9. Press on Copy ARN
(button located in the top right corner) and save your function ARN. We will need to set up API Gateway later. You may also come back later to copy your function ARN.
Setting up API Gateway
1. In AWS Console, go to API Gateway
, and go to REST API
section. Press on Build
to create a new Rest API.
2. Select Rest
as the protocol, New API
, and give a name to your API. Press Create API
to create your API.
3. Press Actions > Create Method
and select POST
as your method. Select Lambda Function
as the integration type, tick on Proxy Integration
, and copy your Lambda function ARN saved from before.
4. Once created, go to Settings
tab.
5. Scroll all the way down. In Binary Media Types
press on Add Binary Media Type
, add multipart/form-data
and press on Save Changes
.
6. Go back to Resources
tab. Press on Actions > Deploy API
. Set [New Stage]
as the deployment stage, give it a stage name and press Deploy
.
You will be redirected to Stages
and your API URL will be shown. Now we can test our rest API.
Testing the API
We will use Postman to send an audio file and get the transcription.
1. Copy and paste the invoke URL
from API Gateway in the request URL section and set the request type to POST
.
2. In the Body
tab, set the data type to form-data
. Set the key name to audio_file
, set the type as File
, and press on Select Files
and select the audio file you want to transcribe (Leopard supports the following audio formats: FLAC
, MP3
, Ogg
, Opus
, Vorbis
, WAV
, and WebM
).
3. Once the request goes through, the result will show below.
In AWS Console, go to CloudWatch > Log groups
if you would like more details on a specific API request and Lambda invocation.
Cost Estimation
We like to estimate how much it costs to process one hour of audio using this architecture. We assume that we are processing uniform 15-second files for simplification. This means that every 240 API calls are worth an hour of audio. The prices are estimated with costs taken from API Gateway pricing page and Lambda pricing page in the us-west-2 region.
API Gateway Cost
240 requests * 3.5$/million = 0.00084$ / hour
Lambda Request Cost
240 requests * 0.2$/million = 0.00005$ / hour
Lambda Compute Cost
A 15-second audio file takes around 3000ms to execute. With 512MB of memory size, the total compute cost is:
240 request * 3 seconds = 720 seconds
0.5GB * 720 seconds = 360 GB-seconds
360GB-seconds * 0.0000166667$ = 0.006$ / hour
Total costs per hour: 0.00084 + 0.00005 + 0.006 = 0.00689$.
Comparison with other Speech-to-Text services
We compare costs with Google Cloud STT and AWS Transcribe with the same setup as before.
Google Cloud STT: 240 files * 0.006$ = 1.44$ / hourAWS Transcribe: (240 files * 15 seconds) / 0.024$ = 1.44$ / hour
In this setup, the price of serverless speech-to-text is noticeably cheaper than using existing services.
Limitations & Future Explorations
API Gateway has a 29-second timeout and 10MB payload size limit. Lambda has a 15-minute timeout and a 6MB payload size limit.
Hence, the audio size must be less than 4.5MB (the audio file is base64 encoded) and less than 2 minutes to transcribe successfully before a timeout occurs.
In this article, we focused on how to set up Leopard to perform short audio transcription. But one may try some solutions to go past these limitations:
- To transcribe audio files larger than 4.5MB, use S3 to get a pre-signed upload URL, upload your audio file, fetch it in Lambda, and process it.
- To transcribe audio files greater than 2 minutes (longer processing time), use Asynchronous Lambda to ping and get the response when the transcription finishes.
Conclusion
We have seen how to set up Leopard and integrate it with Lambda and API Gateway. Keep in mind this setup is for smaller audio files that require transcription. Since this is a starter code, you can always modify the code, re-zip, and upload the zip file again to update your function.
Take a look at Leopard GitHub Repository or Leopard Docs Page to learn more about Leopard.