Auto-Transcribe : Google Speech API Time Offsets in Python
Transcribing audio files or speech is vital for many companies around the world and as we know, the old school technique of Listen — Transcribe by humans may cause fatal errors and eats up lot of your resources(humans). It requires painstaking attention to transcribe every word that’s being recorded and sometimes, you have to deal with multiple audio files.
‘What a drag’, is exactly what Shikamaru would say if he was given the job of transcribing and here’s where Google Speech API and it’s latest addition, Time offsets (timestamps) comes to the rescue, for us Shikamarus.
What is Google Speech API ?
- Apply powerful neural network models to convert speech to text
- Recognises more than 110 languages and variants
- Text results in Real-Time
- Successful noise handling
- Supports devices which can send a REST or gRPC request
- API includes time offset values(timestamps) for the beginning and end of each word spoken in the recognised audio
Steps to setup Google Cloud and Python3 environment
Sign Up for a Free Tier Account in Google Cloud
Account is required to get an API key and you can for a free tier plan (365
days).
Generate an API Key in Google Cloud
Follow these steps to generate an API key:
- Sign-in to Google Cloud Console
- Go to APIs and Services
- Click on Credentials
- Click on Create Credentials
- Select Service Account Key
- Select New service account in Service Account
- Enter Service account name
- Select Role as Project > Owner
- Leave JSON option selected
- Click on Create
- Save generated API key file
- Rename file to
api-key.json
api-key.json
will be downloaded to your computer
Install required Python modules
- Install Google Speech package
pip3 install -U google-cloud-speech
- Install Google API package
pip3 install -U google-api-python-client
Convert Audio
The Google Speech API supports a number of different encodings. The following table lists supported audio codecs:
All encodings support only 1 channel(mono) audio and the audio should be transmitted using a lossless encoding (FLAC or LINEAR16). Audacity has
been working well for me and with the simple UI, it’s very easy to convert
your audio files to mono in FLAC.
Click on Stereo to Mono using Audacity, to learn how to convert audio files.
Click on Audio Encoding for more information.
Code for Python
In this Python script, we will be using Google Speech API’s latest addition,
Time Offsets and include time offset values(timestamps) for the beginning
and end of each spoken in the recognised audio.
A time offset value represents the amount of time that has elapsed from the
beginning of the audio, in increments of 100ms.
Click on Transcribe with time offsets in Python for full code.
Let’s start with importing necessary libraries and create credentials
to get
the Speech API credentials from the api-key.json
we saved earlier.
Transcribe audio file from local storage
Here, we will define transcribe_file_with_word_time_offsets().
It passes the audio and language of the audio as parameters and prints the recognised
words with their time offset values(timestamps).
Import the necessary google.cloud
libraries and verify credentials with
Google Cloud using method SpeechClient()
.
Next we read the audio file and pass it through theRecognitionAudio()
method to store audio data to `audio` as per the encoding specified in theRecognitionConfig()
method.
Assigning TRUE
to the parameter enable_word_time_offsets
enables theRecognitionConfig()
method to record the time offset vales (timestamps) for each word.
response
contains the message returned to the client by speech.recognize()
,
if successful. The result is either stored as zero or sequential messages as
shown below.
Note: confidence in the output shows the accuracy of speech
recognition. The value is from 0.0 to 1.0, for low to high
confidence, respectively.
The value of confidence:0.93
shows the Google Speech API has done a
very good job in recognising the words. Now we iterate through results
and print the words along with their time offset values (timestamps).
Transcribe audio file from Google Cloud Storage
The only difference in this section of the code is that we have to pass the
Google Cloud URL of the audio file to gsc_uri
, the first parameter of the
method transcribe_file_with_word_time_offsets()
and the rest works the same.
The words get printed along with their time offset values (timestamps) as
output.
Time to call the __main__ functionargparse
library is being used to parse through the parameters passed in the
command-line during execution.
We are running an if else
loop to call the appropriate method for locally and cloud stored audio files.
Invocation in Terminal
Mention the language type by typing –s “en-US”
and the path of the file to execute the file.python3 transcribe_time_offsets_with_language_change.py -s “en-US”
Sample.flac
For google cloud type \gs://cloud-samples-tests/speech/Sample.flac
as path,python3 transcribe_time_offsets_with_language_change.py -s “en-US”
\gs://cloud-samples-tests/speech/Sample.flac`
Invocation in Cloud Shell
Click on Google Cloud Shell to learn basic operations in Cloud Shell.
Install virtualenv
globally in Cloud Shell,pip install –upgrade vitrualenv
After installing virtualenv
, use the — python
flag to tell virtualenv
which
Python version to use:virtualenv –python python3 env
Next, we need to activate the virtuale
. It tells the shell to use virtualenv’s
path for Python,source env/bin/activate
Cloud shell is now ready to execute our Python program using commands
mentioned in under Invocation in Terminal.
Output
Congratulations! Here’s your transcribed data along with time offset values(timestamps) for each word.
We can improve these models by changing the parameters in configuration of
the file. Speech recognition can be improved by changing the parameters of
the configuration.
Click on Cloud Speech API to know more on Synchronous, Asynchronous and
Streaming recognition, and how to improve or change the models.