Generating transcripts for an m3u8 video stream using google cloud speech to text
m3u8 is one of the popular methods used for live streaming. A file with the m3u8 extension is used in this technique. m3u8 file is simply a collection of URLs of small ts files. These ts files actually contain the video. The m3u8 file keeps updating the URLs and hence the user can see it as a continuous stream.
Closed captioning / subtitling is becoming an important aspect of video streaming. Captions appear onscreen simultaneously with the audio and video and follow the same timing. It exists within the video player, and generally speaking, can’t be referenced outside of the video.
A transcript is the same word-for-word content as captions, but presented in a separate document, whether it is a text file, word processing document, PDF, or web page.
In this post, we are going to learn how we can generate transcripts for an m3u8 video stream using google cloud speech to text API.
Google cloud speech to text API can give us transcripts along with the timestamp for each word, given the audio file.
Speech to Text API provides 3 options for different Transcript Tasks :
1] Synchronous Request (recognize):- for short audio file (less than ~1 minute)
2] Asynchronous Request (longRunningRecognize):- for longer audio file (longer than 1 minute)
3] Streaming Request (streamingRecognize):- allows you to stream audio to Cloud Speech-to-Text and receive stream speech recognition results in real-time as the audio is processed.
Now To transcribe the live m3u8 stream (which is kind of endless), using streamingRecognise is the best option.
Streaming recognize works like this:-
1) First, You need to establish a gRPC connection.
2) Then in the first request, you need to send the config object.
3) And in subsequent requests, you need to send actual data in the form of audio chunks.
Remember audio chunks you are sending should match the config object. So for example in config object, if the encoding is Flac then Audio chunks must be Flac encoded.
Now before going towards google speech to text lets first try getting audio chunks out of m3u8 stream. For this purpose, we are going to use FFmpeg. FFmpeg is a great tool for multimedia tweaking. In Node.js Fluent-FFmpeg package is available, which is the wrapper on the top of FFmpeg.
Nodejs Code :
//imports
const ffmpeg = require('fluent-ffmpeg');
const { Transform } = require('stream');let dest = new Transform({
transform: (chunk, enc, next) => {
console.log('chunk coming',chunk.length);
next(null, chunk);
}
}).on('data', (data) => {});let livestream_endpoint = 'https://fuel-streaming-prod01.fuelmedia.io/v1/sem/40f15d3b-689f-48a5-9c61-a4f9583ed619.m3u8'; //sample m3u8 streamlet command = ffmpeg(livestream_endpoint)
.on('start', () => {
console.log("ffmpeg : processing Started");
})
.on('progress', (progress) => {
console.log('ffmpeg : Processing: ' + progress.targetSize + ' KB converted');
})
.on('end', () => {
console.log('ffmpeg : Processing finished !');
})
.on('error', (err) => {
console.log('ffmpeg : ffmpeg error :' + err.message);
})
.format('flac')
.audioCodec('flac')
.output(dest)command.run();
The above code will produce audio chunks in Flac format from the m3u8 video stream. Chunks will be written to the dest which is the Transform stream (Both readable and writable).
Now after getting the audio chunks we have to pass those chunks to google cloud speech to text API. So that the API will give us the transcripts.
Nodejs Code :-
Note: These samples require that you have set up gcloud and have created and activated a service account. For information about setting up gcloud, and also creating and activating a service account, see Speech To Text Quickstart.
//imports
const ffmpeg = require('fluent-ffmpeg');
const { Transform } = require('stream');
const fs = require('fs');//google speech to text
const speech = require('@google-cloud/speech');//creates a speech client
const speechClient = new speech.v1p1beta1.SpeechClient();let recognizeStream, timeout;
let streamingLimit = 210000 //3.5 minuteslet configurations = {
config: {
encoding: 'FLAC',
sampleRateHertz: 48000,
languageCode: 'en-US',
model: 'default',
audioChannelCount: 2,
enableWordTimeOffsets: true,
},
interimResults: true,
};function startStream() {
console.log("started recognition stream ");
recognizeStream = speechClient
.streamingRecognize(configurations)
.on('error', (err) => {
console.log(err);
})
.on('data', (stream) => {
speechCallback(stream);
});
//restarting stream every 3.5 mins for infinite streaming
timeout = setTimeout(restartStream, streamingLimit);
}function speechCallback(stream, incoming_which_stream) {
let stdoutText = stream.results[0].alternatives[0].transcript;
if(stream.results[0] && stream.results[0].isFinal) {
console.log("Final Result : ", stdoutText);
fs.appendFile('transcripts.txt', stdoutText, (err)=>{
if(err)
console.log(err);
}
} else {
console.log("Interim Result : ", stdoutText);
}
}function restartStream() {
if (recognizeStream) {
recognizeStream.removeListener('data', speechCallback);
recognizeStream.destroy();
recognizeStream = null;
}
}let dest = new Transform({
transform: (chunk, enc, next) => {
if(recognizeStream) {
recognizeStream.write(chunk);
}
console.log('chunk coming', chunk.length);
next(null, chunk);
}
}).on('data', (data) => {});let livestream_endpoint = 'https://fuel-streaming-prod01.fuelmedia.io/v1/sem/40f15d3b-689f-48a5-9c61-a4f9583ed619.m3u8'; //sample m3u8 streamlet command = ffmpeg(livestream_endpoint)
.on('start', () => {
startStream();
console.log("ffmpeg : processing Started");
})
.on('progress', (progress) => {
console.log('ffmpeg : Processing: ' + progress.targetSize + ' KB converted');
})
.on('end', () => {
console.log('ffmpeg : Processing finished !');
})
.on('error', (err) => {
console.log('ffmpeg : ffmpeg error :' + err.message);
})
.format('flac')
.audioCodec('flac')
.output(dest)command.run();
The above code will generate transcripts for the m3u8 video stream in transcripts.txt file.
You can read the following posts for further extension:
- Getting closed captions for the m3u8 video stream using google speech to text: for getting proper subtitles (which obviously involves timestamps)
- how to get rid of Longtime elapsed without audio and Resource exhausted error in google speech to text on m3u8 video stream
