Create an Amazon Transcribe web app with AWS pre-signed URL

Jan Cibulka
7 min readOct 1, 2023

--

We will create a full-stack web app that transcribes audio from the user’s microphone.

When the recording starts, the front-end (vanilla JavaScript) asks the back-end (NestJS) to create a secure WebSockets connection URL to AWS, through which the audio can be streamed to Amazon Transcribe Streaming.

Creating AWS pre-signed URL with NestJS

Let’s first create the back-end.

On GitHub, you can find my complete front-end and back-end repositories for this project.

This guide assumes you have a basic knowledge of NestJS. If you need a refresher, read my NestJS Crash Course article.

We will start a new NestJS project:

$ npm i -g @nestjs/cli
$ nest new amazon-transcribe

We will create a new module with a service and a controller for generating a URL for AWS WebSocket API:

$ nest generate module aws-signature
$ nest generate service aws-signature
$ nest generate controller aws-signature

The env variables

Create the .env file in the root directory of your project and fill it with your AWS access credentials. You can read my previous article How to get AWS access keys if you need help obtaining them. Fill the .env file with the relevant values:

AWS_REGION=
AWS_ACCESS_KEY_ID=
AWS_SECRET_ACCESS_KEY_ID=

In order NestJS to be able to read the .env file, we need to update the ./src/app.module.ts file:

import { Module } from '@nestjs/common';
import { ConfigModule } from '@nestjs/config';
import { AwsSignatureModule } from './aws-signature/aws-signature.module';

@Module({
imports: [ConfigModule.forRoot(), AwsSignatureModule],
})
export class AppModule {}

While at it, let’s update the ./src/main.ts file as well, to enable CORS:

import { NestFactory } from '@nestjs/core';
import { AppModule } from './app.module';

async function bootstrap() {
const app = await NestFactory.create(AppModule);
app.enableCors();
await app.listen(8080);
}
bootstrap();

Now that the basic setup is complete, let’s continue with creating the Presigned Url for Amazon Transcribe Streaming.

The NestJS module

No changes should be necessary for the generated module ./src/aws-signature/aws-signature.module.ts. It should look like this:

import { Module } from '@nestjs/common';
import { AwsSignatureController } from './aws-signature.controller';
import { AwsSignatureService } from './aws-signature.service';

@Module({
controllers: [AwsSignatureController],
providers: [AwsSignatureService],
})
export class AwsSignatureModule {}

The NestJS controller

The controller ./src/aws-signature/aws-signature.controller.ts will handle an HTTP GET request to create a pre-signed URL for AWS Transcribe Streaming service. Copy the following script to the file. I will explain it below.

import { Controller, Get } from '@nestjs/common';
import { createHash } from 'crypto';
import { AwsSignatureService } from './aws-signature.service';

@Controller('aws-signature')
export class AwsSignatureController {
constructor(private readonly awsSignatureService: AwsSignatureService) {}

@Get()
async transcribe(): Promise<string> {
const awsEndpoint =
'transcribestreaming.' + process.env.AWS_REGION + '.amazonaws.com:8443';

return this.awsSignatureService.createPresignedURL(
'GET',
awsEndpoint,
'/stream-transcription-websocket',
'transcribe',
createHash('sha256').update('', 'utf8').digest('hex'),
{
key: process.env.AWS_ACCESS_KEY_ID,
secret: process.env.AWS_SECRET_ACCESS_KEY_ID,
region: process.env.AWS_REGION,
protocol: 'wss',
expires: 300,
query: 'language-code=en-US&media-encoding=pcm&sample-rate=44100',
},
);
}
}

The @Controller('aws-signature') decorator sets up a controller with a base path of “/aws-signature”. Any requests that match this path will be routed to this controller.

The awsEndpoint variable is created as per the official docs.

This controller will handle GET requests by returning the pre-signed URL for AWS Transcribe Streaming generated by the AwsSignatureService service (which we will create in the next part of this article).

The createPresignedURL() method of the awsSignatureService service to create the pre-signed URL. This method takes many parameters:

  • The HTTP method to use for the request (in this case, “GET”).
  • The endpoint for the AWS service.
  • The path to the resource to be accessed (“/stream-transcription-websocket”).
  • The service name (“transcribe”).
  • The hash of the payload (which is an empty string in this case).
  • An options object that includes the AWS access key ID, secret access key, region, protocol, expiration time, and any additional query parameters.

Finally, the pre-signed URL is returned from the transcribe() method, which will be sent back to the client as a response to the GET request.

The NestJS service

The code of the service is rather long and dry technical gibberish. I built it using the sparse information from the official docs and digging through dozens of outdated materials online. The result is as polished as it can be and works very well, but there is no need to go through it here line by line.

Here is the finished service in my repo.

The front-end

We will write the front-end in vanilla JavaScript.

On GitHub, you can find my complete front-end and back-end repositories for this project.

Create a new project or use something like CodeSandbox.

Install the necessary packages:

$ npm install microphone-stream @aws-sdk/eventstream-marshaller @aws-sdk/util-utf8-node axios

Import the required libraries and modules at the top of your code:

import MicrophoneStream from "microphone-stream";
import { EventStreamMarshaller } from "@aws-sdk/eventstream-marshaller";
import { fromUtf8, toUtf8 } from "@aws-sdk/util-utf8-node";
import axios from "axios";

Define constants and variables that will be used throughout the code (don’t forget to update the backendUrl if necessary):

const backendUrl = "http://localhost:8080/aws-signature"; // the URL of the WebSocket endpoint

const SAMPLE_RATE = 44100; // the sample rate of the audio stream
let inputSampleRate = undefined; // the actual sample rate of the audio stream
let sampleRate = SAMPLE_RATE; // the sample rate to use for the audio stream
let microphoneStream = undefined; // the audio stream
const eventStreamMarshaller = new EventStreamMarshaller(toUtf8, fromUtf8); // an object for marshalling and unmarshalling audio data

let socket; // the WebSocket connection to the endpoint
let transcript = ""; // the current transcription text

Now it’s time to define helper functions to convert and process audio data.

Convert audio data to PCM-encoded binary data:

export const pcmEncode = (input) => {
var offset = 0;
var buffer = new ArrayBuffer(input.length * 2);
var view = new DataView(buffer);
for (var i = 0; i < input.length; i++, offset += 2) {
var s = Math.max(-1, Math.min(1, input[i]));
view.setInt16(offset, s < 0 ? s * 0x8000 : s * 0x7fff, true);
}
return buffer;
};

Downsample the audio data to a lower sample rate:

export const downsampleBuffer = (
buffer,
inputSampleRate = SAMPLE_RATE,
outputSampleRate = 16000
) => {
if (outputSampleRate === inputSampleRate) {
return buffer;
}

var sampleRateRatio = inputSampleRate / outputSampleRate;
var newLength = Math.round(buffer.length / sampleRateRatio);
var result = new Float32Array(newLength);
var offsetResult = 0;
var offsetBuffer = 0;

while (offsetResult < result.length) {
var nextOffsetBuffer = Math.round((offsetResult + 1) * sampleRateRatio);

var accum = 0,
count = 0;

for (var i = offsetBuffer; i < nextOffsetBuffer && i < buffer.length; i++) {
accum += buffer[i];
count++;
}

result[offsetResult] = accum / count;
offsetResult++;
offsetBuffer = nextOffsetBuffer;
}

return result;
};

Convert audio data to a format that can be sent over the WebSocket:

const getAudioEventMessage = (buffer) => {
return {
headers: {
":message-type": {
type: "string",
value: "event",
},
":event-type": {
type: "string",
value: "AudioEvent",
},
},
body: buffer,
};
};

const convertAudioToBinaryMessage = (audioChunk) => {
let raw = MicrophoneStream.toRaw(audioChunk);

if (raw == null) return;

let downsampledBuffer = downsampleBuffer(raw, inputSampleRate, sampleRate);
let pcmEncodedBuffer = pcmEncode(downsampledBuffer);

let audioEventMessage = getAudioEventMessage(Buffer.from(pcmEncodedBuffer));

let binary = eventStreamMarshaller.marshall(audioEventMessage);

return binary;
};

Create a microphone stream instance:

const createMicrophoneStream = async () => {
microphoneStream = new MicrophoneStream();
microphoneStream.on("format", (data) => {
inputSampleRate = data.sampleRate;
});
microphoneStream.setStream(
await window.navigator.mediaDevices.getUserMedia({
video: false,
audio: true,
})
);
};

Start recording

And finally, only functions for starting and stopping the recording are left. This is the startRecording function:

export const startRecording = async (callback) => {
if (microphoneStream) {
stopRecording();
}

const { data: presignedUrl } = await axios.get(backendUrl);

socket = new WebSocket(presignedUrl);
socket.binaryType = "arraybuffer";
transcript = "";

socket.onopen = function () {
if (socket.readyState === socket.OPEN) {
microphoneStream.on("data", function (rawAudioChunk) {
let binary = convertAudioToBinaryMessage(rawAudioChunk);
socket.send(binary);
});
}
};

socket.onmessage = function (message) {
let messageWrapper = eventStreamMarshaller.unmarshall(Buffer(message.data));
let messageBody = JSON.parse(String.fromCharCode.apply(String, messageWrapper.body));
if (messageWrapper.headers[":message-type"].value === "event") {
let results = messageBody.Transcript?.Results;
if (results.length && !results[0]?.IsPartial) {
const newTranscript = results[0].Alternatives[0].Transcript;
console.log(newTranscript);
callback(newTranscript + " ");
}
}
};

socket.onerror = function (error) {
console.log("WebSocket connection error. Try again.", error);
};

createMicrophoneStream();
};

The function first checks if a microphoneStream exists and if so, it calls stopRecording() to stop the current recording.

Then, the function makes a GET request to backendUrl to get a presignedUrl. This URL is used to establish a WebSocket connection with the server.

The socket binary type is set to arraybuffer, and the transcript variable is initialized as an empty string.

The function sets an onopen listener on the socket that listens for when the socket connection is open. When the connection is open, it listens for data from the microphoneStream and sends it to the server in binary format.

The function also sets an onmessage listener on the socket to receive messages from the server. The received message is unmarshalled, and if the message type is an event, the function extracts the transcript from the message and calls the callback function with the new transcript.

If an error occurs in the WebSocket connection, the function logs the error to the console. Finally, the function calls createMicrophoneStream() to create a new microphone stream.

Stop recording

This stopRecording function is responsible for stopping the audio recording process so that doesn’t continue running in the background and consume unnecessary resources.

export const stopRecording = () => {
if (microphoneStream) {
console.log("Recording stopped");
microphoneStream.stop();
microphoneStream.destroy();
microphoneStream = undefined;
}
};

The function first checks if microphoneStream exists. If it does, it logs a message saying "Recording stopped". Then it stops the audio recording by calling the stop() method on microphoneStream. After that, it destroys the microphone stream by calling the destroy() method, and sets microphoneStream to undefined. This ensures that the microphone stream is completely stopped and removed from memory.

User interface

Here is a simple user interface that can make use of the JavaScript we have built:

<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8" />
<title>Speech-to-Text Demo</title>
</head>
<body>
<button id="start-button">Start Recording</button>
<button id="stop-button">Stop Recording</button>
<div id="transcript"></div>
<script type="module">
import { startRecording, stopRecording } from "./speech-to-text.js";

const startButton = document.querySelector("#start-button");
const stopButton = document.querySelector("#stop-button");
const transcriptDiv = document.querySelector("#transcript");

startButton.addEventListener("click", () => {
startRecording((newTranscript) => {
transcriptDiv.innerText += newTranscript;
});
});

stopButton.addEventListener("click", () => {
stopRecording();
});
</script>
</body>
</html>

Conclusion

This article showed how to create an Amazon Transcribe web app with AWS sockets.

On GitHub, you can find my complete front-end and back-end repositories for this project.

If you found it helpful, please click the clap 👏 button. Also, feel free to comment! I’d be happy to help :)

--

--