Tutorial: Asynchronous Speech Recognition in Python

A (fairly) simple technique for using Google’s kinda-sorta-really confusing Speech Recognition API

Let’s face it: it’s hard to compete with Google’s machine learning models. The company has so much data that it blows the competition out of the water as far as accuracy and quality are concerned. Unfortunately, Google hasn’t done the best job of providing easily digestible and up-to-date documentation for its APIs, making it tricky for beginner and intermediate programmers to get started.

I ran into this problem recently when trying to use its Speech Recognition API to transcribe around 1,200 news broadcasts. Because Google has recently changed its cloud API, many of the examples I found around the web were not very helpful. Even when I updated the cloud SDK, I still ran into problems simply trying to run their sample code.

One alternative was to use the SpeechRecognition library, but as far as I can tell, it only works for synchronous requests, which are limited to one minute in duration. Perhaps a better programmer could have found a solution while still using Google’s Python API, but you’re stuck with me. :)

To complete this tutorial, you’ll need the following tools:

  • Python 2.7
  • A Google account
  • Sox
  • A wav file (download here)

Activating Cloud Services

Go to the Google Cloud homepage and sign up for a free trial. You’ll get $300 in free credit simply for signing up.

1) Create a “project” to store your credentials and billing information

2) Enable Google Speech API and follow the prompt to activate billing. Don’t worry — you won’t be charged until you upgrade to a paid account.

3) Create an API key and store it for later.

4) Create a cloud storage “bucket”. This is where we will host the files we want to transcribe. Unfortunately, you have to host files on Google Storage to use the asynchronous service.

Installing Sox

The next step is to install Sox on our machine. Sox is a really easy to use command line utility for manipulating audio files.

If you’re on a Mac, you can use Homebrew to install Sox and its dependencies by running the following in the terminal:

brew install sox

If you’re using Ubuntu, you can run the following from the terminal:

sudo apt-get install libasound2-plugins libasound2-python libsox-fmt-all
sudo apt-get install sox

Converting Audio to Mono

Now that we have Sox installed, we can start setting up our Python script. Because Google’s Speech Recognition API only accepts single-channel audio, we’ll probably need to use Sox to convert our file. You can check by looking at the file properties from your machine:

If your file is already mono, you can skip this step. If not, we can easily convert it from Python using Sox.

  • 1) Import the subprocess library to access our executable programs
import subprocess
  • 2) Write and run Sox command to write a new file with only one channel.
filename = "test/test.wav"
newfilename = "test/newtest.wav"
command = ['sox', filename, '-c', '1', newfilename]
subprocess.Popen(command)
  • 3) Verify our new file converted correctly.

Uploading the Converted File

With our audio converted to mono, all we need to do is upload the new audio to Google Storage and we can get to work on our Python script. While this is something you’ll probably want to do programatically using the Google Storage API module, it’s a bit beyond the scope of this tutorial.

Instead, we’ll just use the web-based GUI we were using earlier. You’ll want to check the “share publicly” option, as shown in the image. Keep in mind that this will be available to the entire world until you remove it from Google Storage or change the permissions.

1) Import the requests library for making the request to Google’s API and the json library to parse the response.

import requests
import json

2) Define the URL we’ll use when making the request. You’ll need to fill in the blanks with the API key you made earlier.

url = "https://speech.googleapis.com/v1/speech:longrunningrecognize?key=YOURAPIKEYHERE"

3) Create the parameters for our JSON request. These are just a few of the possible parameters we can specify. You can check out others here.

payload = {
"config": {
"encoding": "LINEAR16",
"sample_rate_hertz": 48000,
"language_code": "en-US"
    },
"audio": {
"uri": "gs://BUCKETNAMEHERE/FILENAMEHERE.wav"
    }
}

4) Send the request and save the response.

r = requests.post(url, data=json.dumps(payload))

The response should include a numeric token that we will use to access our transcription results. The output will look something like this:

Out[1]: {u'name': u'5284939751989548804'}

We can save the token like this:

json_resp = r.json()
token_resp = json_resp['name']

5) Retrieve results (you’ll want to wait a few seconds).

url = "https://speech.googleapis.com/v1/operations/" + str(token_resp) + "YOURAPIKEYHERE"
content_response = requests.get(url)
content_json = content_response.json()

The output should look like this:

Voilà!

You’re all set to build a speech-recognition pipeline for your own projects. If you have any questions (or more likely suggestions), don’t hesitate to comment.