How we use Transcribe service at Preply

Artem Shananin
Preply Engineering Blog
4 min readMar 17, 2021

This fall Preply had a big release — we launched a new student homepage and new exercises for homework. I’d like to tell you about the Speaking module, which is an exercise that emulates conversation between two people: listening to a speaker (playing audio file) and responding to them (a student speaking into a microphone).

My name is Artem Shananin and I’m a Software Engineer at Preply.com. Let’s get started!

Our team used the following approach to verify if student’s response is correct:

  1. Convert audio to text (via Amazon Transcribe service from AWS).
  2. Check text matching.

What is Transcribe service?

Amazon Transcribe uses machine learning algorithms to convert audio files or audio streams into text. You can use this service for creating subtitles for video, enabling rich search capabilities of audio and video archives, or create a speaking exercise (as we did at Preply 🙂).

With the help of Amazon Transcribe you can:

  • Work with multiple audio formats.
  • Decipher streaming audio in real-time.
  • Create custom vocabulary (a list of specific words that you want Amazon Transcribe to know in your audio input).
  • Recognize voices in a conversation.

More details here.

Transcribe service suggests two approaches:

  1. Creating a background job for processing audio files. It takes at least one minute to process a small file.
  2. Streaming Transcription enables you to send an audio stream and receive a stream of text in real-time. The latency of processing is only a few seconds.

We value our customers and their time so having them wait 1 minute was not an option for us and we went with the 2nd option.

How do we connect to the Transcribe service? In most situations, access_key and access_secret are used for user authentication, but we cannot store them on the Frontend as the code is open and data can be compromised. That is why we made a decision to use a pre-signed URL. It is created on the Backend, contains user info, and cryptographically signed payload. Additionally, it is valid for a limited amount of time to protect from potential replay attacks.

Amazon uses Signature Version 4 for the signature. More info can be found here Signing AWS API Requests.

And the response is:

URL contains the following parameters:

language-code — The language code for the input audio
media-encoding — The encoding used for the input audio. The only valid value is pcm.
sample-rate — The sample rate of the input audio in Hertz.
X-Amz-Credential — A string separated by slashes (“/”) that is formed by concatenating your access key ID and your credential scope components. Credential scope includes the date in YYYYMMDD format, the AWS Region, the service name, and a special termination string (aws4_request).
X-Amz-Expires — The length of time in seconds until the credentials expire.
X-Amz-Signature — The Signature Version 4 signature that you generated for the request.

More information is here.

We connect to the Transcribe Service using a pre-signed URL via WebSocket and the client starts sending a sequence of audio frames. Each frame contains one data frame that is encoded in event stream encoding.

The transcription is returned to the application in a stream of transcription events.

Transcribe splits incoming audio stream based on natural speech segments, such as a change in speaker or a pause in the audio. The transcription is returned progressively to the application, with each response containing more transcribed speech until the entire segment is transcribed.

In the following example, each line is a partial result transcription output of an audio segment being streamed:

Even though Transcribe service has pretty high accuracy, it is not perfect and some mistakes are still encountered, but it also an interesting challenge for our dev team to resolve them.

In the next iterations we plan:

  • Improve names and numbers recognition;
  • Allow students re-try saying only incorrectly recognized words instead of repeating the whole sentence.

We are big believers in 1:1 in-person learning, but at the same time practicing pronunciation is extremely helpful for language learners and we wanted to enable them to do it even outside the lessons.

This feature was one of the baby steps for us to explore the possibilities of AI-assisted learning. We enjoyed working with AWS Transcribe and our customers loved the feature. More than 4500 dialogs happened in January 2021.

Do not hesitate to sign-up to Preply and try it yourself.

--

--