Implementing a lightweight transcription microservice with gRPC

Mohammadreza Rostam
Picovoice
Published in
3 min readApr 29, 2022

Microservices have emerged as a trend in web application architecture, and there are some excellent motivations behind this tendency. Building your app as a set of microservices improves efficiency, portability, flexibility, and scalability. There are already a couple of fantastic articles on creating microservices using gRPC and Go (this and this). So instead, I would like to extend them here and focus on showing you how to create a lightweight speech-to-text (STT) microservice.

An ideal STT microservice should be snappy and small so that it can be created and destroyed on demand easily at a low cost. Picovoice recently released its offline Speech-To-Text engine, Leopard. It is fast and has a small memory footprint [Speech-to-Text 2.0]. At the same time, its accuracy is comparable to the cloud-based services according to its open-source benchmark. You only need to a Picovoice account for this project and use your Free Plan allowances. No credit card required.

Comparison of accuracy of Picovoice Speech-to-Text against leading cloud-based alternatives

To have a working gRPC microservice, three components are essential:

  1. A .proto file to define the gRPC services and messages
  2. A server to process the submitted audio and returns back the transcription
  3. A client to talk to the server

.proto file:

Let’s start with defining the interface in proto format:

To keep everything simple, only one service (GetTranscriptionFile) is defined in the proto file. Since gRPC limits incoming messages to 4 MB by default, the type of the transcription service needs to be set to the client-side stream. This way, files are sent to the server in chunks of bytes.

The transcriptResponse message, which is sent back to the client, contains both inferred transcription and status code to tell the caller whether the call was successful.

Before we can import and use this service method in other files, we need to compile the .proto file with protoc as we are going to write both server and client in Go.

Client:

On the client-side, we need first to make a connection to the server and then create a client for the defined LeopardService service.

Inside the runsTranscriptFile function, the audio file is read in chunks of 1 MB and transmitted over to the server, and a timeout of 10 seconds is considered here. Finally, the stream is closed, and the server response is received by calling the CloseAndRecv function.

Server:

On the server-side, a gRPC service instance is defined and registered to answer to LeopardService calls.

Once a transcription request is received, the server starts an instance of the Leopard engine and keeps reading the shipped bytes until the EOF is received. After that, the bytes are stored as a temporary file and then passed to the leopard engine. Finally, the transcription is sent back to the client side along with a status code.

We could also have sent the audio in raw (pcm) format and directly fed it to the Leopard engine without storing it on the drive, but there are two caveats. First, more preprocessing work would have to be done on the client-side to decode the audio file. Second, the amount of data needed to be transferred for the raw format is significantly more than compressed formats like MP3 or OGG.

That’s all! Now you just need to run the server and client with proper input arguments.

The source code of this demo can be found on Leopard GitHub Repository. It can serve as a template or a starting point and be tailored to meet individual requirements. For instance, if you want to transcribe an audio file that is already on the internet, you can modify the demo and send a simple URL to the server instead of the entire file. Please refer to the readme.md to run the demo and see it in practice. To know more about the Leopard Go APIs, take a look at Leopard Docs Page.

--

--