Transcribe Japanese using Go and Machine Learning APIs

One of the hardest skills to develop when learning Japanese is solid listening comprehension. A great way to practice is to listen to the news in Japanese. For that, there’s nothing better than the NHK Radio News, or rajio nyūsu ラジオニュース. The Japanese in this program adheres to the crystal clear standards of NHK and the site allows listeners to adjust the speed of the audio between regular, slow, and fast.

As much as adjusting the playback speed may help, a language learner is bound to encounter unfamiliar words. While it is an equally important skill to identify unknown words to lookup on one’s own, a transcription of the audio makes a big difference in the speed of learning. The only problem is that NHK does not provide transcriptions. So, let’s create them ourselves!

To do that, we’re going to use Google Cloud Platform’s Speech-to-Text API and its Go API client. There are three steps.

Download an audio sample

The Speech-to-Text API makes a distinction between short and long audio files with a runtime below or above about one minute. If the audio file is longer than a minute, we need to use the asynchronous API for long running recognition. For this example, we are going to use the synchronous API. The API client also supports remote URIs, but to keep things simple we will upload an audio sample ourselves.

The easiest way to get a copy of an audio file is to download it directly from the radio program’s homepage. Note, a link to download the episode isn’t visible on the page. Instead, you will need to use your browser’s developer tools to inspect the audio player’s HTML after starting a particular episode.

Then, find the corresponding audio element whose src attribute points to an MP3 file of the episode. Download that file.

Process the sample for best results

Now that we have a Japanese audio sample, we need to do some minor processing to ensure best results. As the documentation on best practices notes, for optimal results we need to use a FLAC encoding and a sampling rate of 16,000 Hz. In addition, I have found it helps to submit samples mixed to mono rather than to stereo. The ffmpeg CLI is invaluable for this kind of work.

First, we can view the various bits of metadata about a file with the following command:

ffprobe -v error -show_format -show_streams input.mp3

Next, we need to convert from MP3 to FLAC:

ffmpeg -i input.mp3 output.flac

To keep our sample under a minute, we will trim the audio sample to a runtime of thirty seconds. For your particular clip, you will need to find a good start time and a good stop time. In the clip I am using, I want to start at 10 seconds and stop 30 seconds later:

ffmpeg -ss 10 -t 30 input.flac output.flac

The broadcast episodes are already mono, but in case you want to remix a stereo sample to mono, the command is also simple:

ffmpeg -i input.flac -ac 1 output-mono.flac

Now with the processing done, we are ready to write some code and submit the sample to the Speech-to-Text API.

Send the sample to the Speech-to-Text API

Let’s write some Go.

First, we will download a copy of the client libraries for Go:

go get -u cloud.google.com/go/...

Then, in a directory within our GOPATH, we create a main.go with the following code. We first create a new speech client.

Note, in the code below, we make no explicit reference to API credentials. Instead, the client will read the GOOGLE_APPLICATION_CREDENTIALS variable from the environment to locate the service-account.json file which provides authentication information. For more details on the various ways to set up authentication, see the documentation. Also, there are a number of helpful examples in GoDoc.

// ~/go/src/github.com/gobuildit/gobuildit/transcribe/main.go
package main
import (
    "fmt"
    "io/ioutil"
    "log"
    "golang.org/x/net/context"
    speech "cloud.google.com/go/speech/apiv1"
    speechpb "google.golang.org/genproto/googleapis/cloud/speech/v1"
)
func main() {
    ctx := context.Background()
    client, err := speech.NewClient(ctx)
    if err != nil {
        log.Fatalf("failed to create client: %v", err)
    }
    // ...
}

Next, we read the audio sample into memory:

// ...
data, err := ioutil.ReadFile("nhk-radio-news.flac")
if err != nil {
    log.Fatalf("failed to read file: %v", err)
}
// ...

And now, we are ready to create the API request and send it along:

// ...
resp, err := client.Recognize(ctx, &speechpb.RecognizeRequest{
    Config: &speechpb.RecognitionConfig{
        Encoding:        speechpb.RecognitionConfig_FLAC,
        SampleRateHertz: int32(16000),
        LanguageCode:    "ja-JP",
    }, 
    Audio: &speechpb.RecognitionAudio{
        AudioSource: &speechpb.RecognitionAudio_Content{
Content: data,
},
    },
})
if err != nil {
    log.Fatalf("failed to recognize: %v", err)
}
// ...

In the code above, we configure the RecognizeRequest with details about the audio sample. There are a number of additional properties available worth knowing about that we don’t use here.

Finally, provided our request succeeds, we print out the results:

// ...
for _, result := range resp.Results {
    for _, alt := range result.Alternatives {
        fmt.Printf("\"%v\" (confidence=%3f)\n", alt.Transcript, alt.Confidence)
    }
}
// end of main

It’s possible the results will include a number of Alternatives and so we will print them all as well as the Confidence value, which indicates the API’s estimate at how likely the transcript is correct. Note, the Alternatives are ordered by accuracy, with those of highest confidence coming first. See the docs here for more.

Now, we’re ready to transcribe our audio sample. Note, the following command may take a moment to complete.

$ GOOGLE_APPLICATION_CREDENTIALS=service-account.json go run main.go
ではニュースです # “And now for the news.” (full transcription result omitted)

And look at that! We have a transcript of our audio sample!

Possible Next Steps

Given how easy it is to take audio and convert it into a transcript, one could easily imagine extending this code into a realtime captioning service. Given a steady input of audio, perhaps split into sub-minute spans, we could submit those samples to an automated processing step. From there, the processed audio could be sent to the speech-to-text API to produce captions.

Of course, it’s almost certainly not a match for already existing systems, but as machine learning continues to improves in accuracy and speed, it’s also not hard to see how systems like live captioning become that much easier to create. For now, though, we have a powerful tool to aid the language learning process.

Further Reading