Convert speech from an audio file to text using Google Speech API

Published in

cod3

6 min readJan 24, 2018

The backstory

I had to transcribe messages recorded from an iPhone in the m4a format, with a duration of 30 seconds to a couple of minutes, to text. After trying several APIs, I found the one from Google, without much surprise, to be the most accurate.

If you own an Android phone, you can expect about the same accuracy as when you use the “OK Google” application to send an SMS message. So something like 80% hit, 20% gibberish. Which you know could be worse, if you have ever used the Youtube auto-caption feature.

This created the following issues for the project:

The expected input format for the Google Speech API is LINEAR16 PCM (.wav), not m4a.
Audio files that last more than 1 minute must be uploaded to Google Storage, you can’t send them to the Google Speech API directly.

The tasks

Let’s split the problem into simple tasks:

Create a Google Cloud account
Convert the audio file to LINEAR16
Upload the converted file to Google Storage
Send the uploaded file to the Speech API
Display the transcription into the console

Create a Google Cloud account

I won’t really go into details for this, just go to https://cloud.google.com/ and follow the steps for signing up. You will have to create a service account for your application, so follow the instructions at Creating a Service Account.

Like instructed, save the JSON file generated with your credentials locally, and create a copy at the root of your project (as credentials.json for example).

Convert the audio file to LINEAR16

The Google Speech API documentation specifies that the expected input format is LINEAR16, but never really explains concretely how to convert an existing file to this format.

My knowledge about audio formats is really limited, but after a little while I found this conversation and more specifically, this comment by droidha…@gmail.com:

And here is an example for those that might be using ffmpeg to get audio from video sources or just using ffmpeg for conversion
ffmpeg foo.mp4 -f s16le -acodec pcm_s16le -vn -ac 1 -ar 16k foo.raw
the point to note is that its is s16le (little endian (intel) byte ordering) and this also down mixes to mono with -ac 1 and the -ar (audio rate) is 16k

Edit (2019–09–18): It seems that the ffmpeg command is a bit off, see the answer below if you want to try it from the command line:

I believe your ffmpeg command is a bit off as it is missing the -i and should be:

ffmpeg -i foo.mp4 -f s16le -acodec pcm_s16le -vn -ac 1 -ar 16k foo.raw

medium.com

Though I’m still not sure if the last line really means something or was auto-generated by a Twitter bot, I downloaded the FFmpeg binary for my platform, tried the command and it worked!

So I had to code a node.js version of this command. I found a node module that wraps FFmpeg (fluent-ffmpeg). It’s pretty well done but requires a specific installation of FFmpeg.

After copying and tweaking bits from similar projects, I came up with the following code:

'use strict';const ffmpeg = require('fluent-ffmpeg');  
const mime = require('mime');  
const fs = require('fs');module.exports = (filePathIn, filePathOut) => new Promise((resolve, reject) => {  
    if (!filePathIn || !filePathOut) {
        throw new Error('You must specify a path for both input and output files.');
    }
    if (!fs.existsSync(filePathIn)) {
        throw new Error('Input file must exist.');
    }
    if (mime.lookup(filePathIn).indexOf('audio') > -1) {
        try {
            ffmpeg()
                .input(filePathIn)
                .outputOptions([
                    '-f s16le',
                    '-acodec pcm_s16le',
                    '-vn',
                    '-ac 1',
                    '-ar 16k',
                    '-map_metadata -1'
                ])
                .save(filePathOut)
                .on('end', () => resolve(filePathOut));        } catch (e) {
            reject(e);
        }
    } else {
        throw new Error('File must have audio mime.');
    }
});

Since I crave geek-cred, I published the source on Github and a package on npm:

npm i --save linear16

Upload the converted file to Google Storage

First, you will have to create a “bucket”, a place where your files will be uploaded. You can do so through the SDK, I did it manually. It’s very intuitive but here’s a tutorial if you need some help.

The Google node SDK is pretty easy to use, just authenticate using your project id and your credentials file, select the bucket where you want to upload your files and specify the path of the file to upload:

const gcs = require('@google-cloud/storage')({  
    projectId: 'your-projectid-12345',
    keyFilename: './credentials.json'
});const bucket = gcs.bucket('your-bucket-name');module.exports = filePath => new Promise((resolve, reject) =>  
    bucket.upload(filePath, function (err, file) {
        if (err) {
            reject(err);
        } else {
            resolve(file);
        }
    })
);

Don’t forget to replace “your-project-id” and “your-bucket-name” in this script.

Once uploaded, the files are accessible throughout the Google API ecosystem via a special path, following this pattern:

gs://your-bucket-name/your-file-name.ext

Send the uploaded file to the Google Speech API

The Speech API is also pretty easy to use, we have to specify the input format and use the startRecognition method. A pretty interesting feature of the Speech API is that it recognizes a wide range of languages.

const speechClient = require('@google-cloud/speech')({  
    projectId: 'your-projectid-12345',
    keyFilename: './credentials.json'
});const options = {  
    'languageCode': 'en-US',
    'sampleRate': 16600,
    'encoding': 'LINEAR16'
};module.exports = fileName =>  
    new Promise((resolve, reject) => {
            speechClient.startRecognition(fileName, options, function (err, operation) {
                if (err) {
                    return reject(err)
                }
                operation
                    .on('error', function (err) {
                        return reject(err);
                    })
                    .on('complete', function (results) {
                        return resolve(results);
                    });
            });
        }
    );

Don’t forget to replace “your-project-id” and “your-bucket-name” in this script.

Since we’re dealing with an asynchronous request, the use of Promises is pretty fitting. Once the transcription is ready, the Promise will resolve with the resulting text.

Display the transcription into the console

At first I just displayed the text in the console, but since the transcription takes a little long (up to 1 minute for a minute-long file), I wanted a visual indicator that everything works correctly.

The simplest node module I found for this is the Spinner from clui, which is an awesome pack of command-line UI components.

The result looks something like this:

I also used chalk to add some colours to the output.

Putting it all together

I saved each component as a module:

I published the LINEAR16 converter as the linear16 package on npm, so I just have to call require('linear16') to use it.
I saved the Storage uploader as libs/cloud-storage.js and the Speech transcriber as libs/cloud-speech.js)
And I created a index.js main file to orchestrate all the operations.

I think the result is pretty simple to follow, and it works!

'use strict';const linear16 = require('linear16');  
const Spinner = require('clui').Spinner;const cloudStore = require('./libs/cloud-storage');  
const cloudSpeech = require('./libs/cloud-speech');const path = require('path');  
const chalk = require('chalk');try {    const countdown = new Spinner(`Starting...`);
    countdown.start();    const params = {
        input: './input/input.m4a',
        output: './output/output.wav'
    };    Promise.resolve(params)
        .then(paths => {
            countdown.message(`Converting ${path.basename(paths.input)} to ${path.basename(paths.output)}...`);
            return linear16(paths.input, paths.output);
        })
        .then(wavFile => {
            countdown.message(`Storing ${path.basename(wavFile)}...`);
            return cloudStore(wavFile);
        })
        .then(storageFile => {
            countdown.message(`Transcribing ${storageFile.name}...`);
            return cloudSpeech('gs://messages-audio/' + storageFile.name);
        })
        .then(transcription => {
            countdown.stop();
            console.log(chalk.green(transcription));
        })
        .catch(err => console.error(err));
} catch (err) {
    console.log(chalk.red(err.message));
    console.error(err);
}

I hope you enjoyed this post, don’t hesitate to contact me on Twitter if you have any question or comment!

Convert speech from an audio file to text using Google Speech API

The backstory

The tasks

I believe your ffmpeg command is a bit off as it is missing the -i and should be:

ffmpeg -i foo.mp4 -f s16le -acodec pcm_s16le -vn -ac 1 -ar 16k foo.raw

Written by Julien