Auto-Transcribe : Google Speech API Time Offsets in Python

Soham Sil
Towards Data Science
6 min readMar 11, 2018

Transcribing audio files or speech is vital for many companies around the world and as we know, the old school technique of Listen — Transcribe by humans may cause fatal errors and eats up lot of your resources(humans). It requires painstaking attention to transcribe every word that’s being recorded and sometimes, you have to deal with multiple audio files.

‘What a drag’, is exactly what Shikamaru would say if he was given the job of transcribing and here’s where Google Speech API and it’s latest addition, Time offsets (timestamps) comes to the rescue, for us Shikamarus.

What is Google Speech API ?

  • Apply powerful neural network models to convert speech to text
  • Recognises more than 110 languages and variants
  • Text results in Real-Time
  • Successful noise handling
  • Supports devices which can send a REST or gRPC request
  • API includes time offset values(timestamps) for the beginning and end of each word spoken in the recognised audio

Steps to setup Google Cloud and Python3 environment

Sign Up for a Free Tier Account in Google Cloud
Account is required to get an API key and you can for a free tier plan (365
days).

Google Cloud Platform Dashboard

Generate an API Key in Google Cloud
Follow these steps to generate an API key:

  • Sign-in to Google Cloud Console
  • Go to APIs and Services
  • Click on Credentials
  • Click on Create Credentials
  • Select Service Account Key
  • Select New service account in Service Account
  • Enter Service account name
  • Select Role as Project > Owner
  • Leave JSON option selected
  • Click on Create
  • Save generated API key file
  • Rename file to api-key.json
  • api-key.json will be downloaded to your computer
JSON file saved to computer

Install required Python modules

  • Install Google Speech package
    pip3 install -U google-cloud-speech
  • Install Google API package
    pip3 install -U google-api-python-client

Convert Audio

The Google Speech API supports a number of different encodings. The following table lists supported audio codecs:

Google Audio Codecs Table

All encodings support only 1 channel(mono) audio and the audio should be transmitted using a lossless encoding (FLAC or LINEAR16). Audacity has
been working well for me and with the simple UI, it’s very easy to convert
your audio files to mono in FLAC.

Click on Stereo to Mono using Audacity, to learn how to convert audio files.

Click on Audio Encoding for more information.

Code for Python

In this Python script, we will be using Google Speech API’s latest addition,
Time Offsets and include time offset values(timestamps) for the beginning
and end of each spoken in the recognised audio.

A time offset value represents the amount of time that has elapsed from the
beginning of the audio, in increments of 100ms.

Click on Transcribe with time offsets in Python for full code.

Let’s start with importing necessary libraries and create credentials to get
the Speech API credentials from the api-key.json we saved earlier.

Transcribe audio file from local storage

Here, we will define transcribe_file_with_word_time_offsets().It passes the audio and language of the audio as parameters and prints the recognised
words with their time offset values(timestamps).

Import the necessary google.cloud libraries and verify credentials with
Google Cloud using method SpeechClient().

Next we read the audio file and pass it through theRecognitionAudio()
method to store audio data to `audio` as per the encoding specified in theRecognitionConfig() method.

Assigning TRUE to the parameter enable_word_time_offsetsenables theRecognitionConfig() method to record the time offset vales (timestamps) for each word.

response contains the message returned to the client by speech.recognize(),
if successful. The result is either stored as zero or sequential messages as
shown below.

 Note: confidence in the output shows the accuracy of speech
recognition. The value is from 0.0 to 1.0, for low to high
confidence, respectively.

The value of confidence:0.93 shows the Google Speech API has done a
very good job in recognising the words. Now we iterate through results and print the words along with their time offset values (timestamps).

Transcribe audio file from Google Cloud Storage

The only difference in this section of the code is that we have to pass the
Google Cloud URL of the audio file to gsc_uri, the first parameter of the
method transcribe_file_with_word_time_offsets() and the rest works the same.

The words get printed along with their time offset values (timestamps) as
output.

Time to call the __main__ function
argparse library is being used to parse through the parameters passed in the
command-line during execution.

We are running an if else loop to call the appropriate method for locally and cloud stored audio files.

Invocation in Terminal

Mention the language type by typing –s “en-US” and the path of the file to execute the file.
python3 transcribe_time_offsets_with_language_change.py -s “en-US”
Sample.flac

For google cloud type \gs://cloud-samples-tests/speech/Sample.flac as path,
python3 transcribe_time_offsets_with_language_change.py -s “en-US”
\gs://cloud-samples-tests/speech/Sample.flac`

Invocation in Cloud Shell

Click on Google Cloud Shell to learn basic operations in Cloud Shell.

Install virtualenv globally in Cloud Shell,
pip install –upgrade vitrualenv

After installing virtualenv, use the — python flag to tell virtualenv which
Python version to use:
virtualenv –python python3 env

Next, we need to activate the virtuale. It tells the shell to use virtualenv’s
path for Python,
source env/bin/activate

Cloud shell is now ready to execute our Python program using commands
mentioned in under Invocation in Terminal.

Output

Congratulations! Here’s your transcribed data along with time offset values(timestamps) for each word.

We can improve these models by changing the parameters in configuration of
the file. Speech recognition can be improved by changing the parameters of
the configuration.

Click on Cloud Speech API to know more on Synchronous, Asynchronous and
Streaming recognition, and how to improve or change the models.

--

--

Towards Data Science
Towards Data Science

Published in Towards Data Science

Your home for data science and AI. The world’s leading publication for data science, data analytics, data engineering, machine learning, and artificial intelligence professionals.

Soham Sil
Soham Sil

Written by Soham Sil

Data Analysis and Machine Learning

Responses (1)