Analyzing Japanese with Natural Language Processing and Go

Eno Compton
8 min readJun 19, 2018

--

One of the most exciting things about learning a foreign language is developing a proficiency for reading it. At first, sentences in the foreign languages start off as intimidating blocks of text, totally opaque and meaningless. With lots of practice, single words start to make sense here and there. A basic understanding of the language’s syntax transforms the process.

Let’s use the Google Cloud Natural Language API to parse a long and seemingly complicated Japanese sentence and see how an understanding of the syntax helps us assemble the sentence’s meaning.

To start, here is a sentence completely unedited from a recent article in the Japanese newspaper Yomiuri Shimbun.

安倍首相は30日、野党党首との党首討論で、トランプ米政権が検討している輸入車への高関税措置について、「同盟国の日本に高関税を課すのは極めて理解しがたく、受け入れることはできない」と述べ、反対する意向を明言した。

Start from Translation

Since we don’t know what the text is saying, we might as well translate it first with the Translation API. We start with what we know: English and our favorite programming language Go.

The translation API is pleasantly easy to use.

package mainimport (
"context"
"fmt"
"log"
"cloud.google.com/go/translate"
xlang "golang.org/x/text/language"
)
const text = "安倍首相は30日..." // full text omitted.func main() {
ctx := context.Background()
tc, err := translate.NewClient(ctx)
if err != nil {
log.Fatalf("failed to create translate client: %s", err)
}
ts, err := tc.Translate(ctx, []string{text}, xlang.English, nil)
if err != nil {
log.Fatalf("failed to translate text: %s", err)
}
if len(ts) != 1 {
log.Fatalf("unexpected count. want 1, got %d", len(ts))
}
fmt.Printf("Translation: %s\n", ts[0].Text)
}

After creating a translate client, we pass a slice of the strings we want to translate (which we’ve hard-coded) and the target language English. We add error checking to avoid any surprises and finally print out the Text property of the first Translation. This is likely the simplest use of the translate client. There are a number of additional features worth knowing about. See the GoDoc for more.

There are a number of ways to authenticate against the API. For the example here, I have created a private key associated with a service account, downloaded it into the same directory as the main.go file, and am invoking our Go program with an environment variable like so:

$ GOOGLE_APPLICATION_CREDENTIALS=service-account.json go run main.go

The translation API does a decent job, too, and provides this translation (which I have refrained from editing except for replacing the HTML character entity " with actual quotation marks):

On the 30th, Prime Minister Shinzo Abe said in a party discussion with opposition party leader, “About high tariff measures on imported vehicles examined by the Trumpe administration,” It is extremely difficult to understand imposing high tariffs on Japan of allies, We can not accept it, “he declared the intention to object.

The sentence describes Prime Minister Abe’s negative reaction to the Trump administration’s consideration of imposing high tariffs on imports. Anyone reading the news is probably aware of similar policy discussions already. So, we understand what the sentence says.

Developing an understanding of what a sentence means is only half the fun. Understanding how a sentence communicates its meaning is just as fascinating. So let’s use the Cloud Natural Language API to look at the how our sentence is constructed.

A Quick Intro to Japanese Grammar

Before we analyze syntax, let’s talk briefly about Japanese grammar.

Unlike English which uses the word order SVO, i.e., Subject-Verb-Object, Japanese uses SOV, or Subject-Object-Verb. In other words, the object comes before the verb. And yes, that means that sometimes you have to wait until the end of a sentence to know what action occurred for a given object!

Two other points are worth mentioning. First, adverbs precede verbs. Whereas English permits adverbs on either side of the verb, e.g., “they quickly ate” vs “they ate quickly,” Japanese grammar requires an adverb come before the verb. Second, any modification of a noun, e.g., “the red hat” or “the red hat I wore every day until I lost it,” comes before the noun. For example, to say “the red hat I wore every day until I lost it,” you might say this:

naku shita made mainichi kabutteita akai bōshiなくしたまで毎日かぶっていた赤い帽子

If we ignore the details, the important thing to understand is that the word for “hat,” i.e., bōshi 帽子, comes at the tail end after the modifier (“that I wore every day until I lost it”).

If Japanese grammar seems endlessly fascinating (it is!), a great place to learn lots more is Tae Kim’s Guide to Learning Japanese.

Analyzing the Syntax

Now we are ready to write some code to analyze the syntax of our sentence.

package mainimport (
"context"
"fmt"
"log"
language "cloud.google.com/go/language/apiv1"
langpb "google.golang.org/genproto/googleapis/cloud/language/v1"
)
const text = "安倍首相は30日..." // full text omitted.func main() { ctx := context.Background() lc, err := language.NewClient(ctx) if err != nil { log.Fatalf("failed to create language client: %s", err) } resp, err := lc.AnalyzeSyntax(ctx, buildSyntaxRequest(text)) if err != nil { log.Fatalf("failed to analyze syntax: %s", err) } for _, t := range resp.Tokens { fmt.Printf("%s\t\t=> %s\n", t.Text.Content, t.PartOfSpeech) }}

As with the translate client above, we create a context.Context and then create a client for the Cloud Natural Language API. Again, we will invoke this program with the GOOGLE_APPLICATION_CREDENTIALS environment variable pointing to our service account private key for authentication. There are a number of interesting features in the language package. So definitely read through the GoDoc. To start, we will use only AnalyzeSyntax.

Because the request type is a little verbose, I have extracted its creation into a separate function called buildSyntaxRequest. That code looks like this:

func buildSyntaxRequest(text string) *langpb.AnalyzeSyntaxRequest {    return &langpb.AnalyzeSyntaxRequest{        Document: &langpb.Document{        Type: langpb.Document_PLAIN_TEXT,            Source: &langpb.Document_Content{                Content: text,            },        },        EncodingType: langpb.EncodingType_UTF8,    }}

For our purposes, we care only about the text in the request. Nonetheless, there are a number of additional supported features such that it’s worth looking at the GoDoc for details.

First, find the subject of the sentence

The response from AnalyzeSyntax has a number of tokens, each of which includes a part of speech. Looking at just the tail end of the printed response, we have:

反対      => tag:NOUN proper:NOT_PROPERする      => tag:VERB form:ADNOMIAL proper:NOT_PROPER意向      => tag:NOUN proper:NOT_PROPERを        => tag:PRT case:ACCUSATIVE proper:NOT_PROPER明言      => tag:NOUN proper:NOT_PROPERし        => tag:VERB form:GERUND proper:NOT_PROPERた        => tag:VERB form:FINAL_ENDING proper:NOT_PROPER tense:PAST。        => tag:PUNCT proper:NOT_PROPER

Even when we’re just printing the part of speech, we get lots of information here! In particular, we have tags for nouns, verbs, particles (i.e., PRT), and punctuation (i.e., PUNCT). The full list of tags is documented here.

Let’s take another look at our sentence again.

安倍首相は30日、野党党首との党首討論で、トランプ米政権が検討している輸入車への高関税措置について、「同盟国の日本に高関税を課すのは極めて理解しがたく、受け入れることはできない」と述べ、反対する意向を明言した。

Somewhere in there is a subject, an object, and a verb. Aside from telling us the parts of speech, the API has parsed word boundaries for us, itself a difficult task for a student of Japanese. So, if we look at the head of the response, we see a proper noun followed by a noun:

安倍        => tag:NOUN proper:PROPER首相        => tag:NOUN proper:NOT_PROPERは          => tag:PRT proper:NOT_PROPER30        => tag:NUM proper:NOT_PROPER日          => tag:AFFIX proper:NOT_PROPER、          => tag:PUNCT proper:NOT_PROPER

Knowing that modifiers precede the nouns they modify, we might guess that the first two nouns are “Prime Minister Abe,” or following the words as they appear in the original, “Abe Prime Minister.” And we would be right! It doesn’t hurt to use Google Translate to double check such things, too. We’ve found our subject!

Now, let’s find the verb

To make the task easier, let’s eliminate what we know cannot be the verb. We know adverbs come before verbs. Can we eliminate any adverbs or adverbial phrases? Yes, we can!

Looking at the API response, we see a couple of adverbial particles that precede a punctuation mark:

野党        => tag:NOUN proper:NOT_PROPER党首        => tag:NOUN proper:NOT_PROPERと          => tag:PRT case:ADVERBIAL proper:NOT_PROPERの          => tag:PRT case:GENITIVE proper:NOT_PROPER党首        => tag:NOUN proper:NOT_PROPER討論        => tag:NOUN proper:NOT_PROPERで          => tag:PRT case:ADVERBIAL proper:NOT_PROPER // <---、          => tag:PUNCT proper:NOT_PROPER

And:

トランプ        => tag:NOUN proper:PROPER米             => tag:NOUN proper:PROPER政権           => tag:NOUN proper:NOT_PROPERが            => tag:PRT case:NOMINATIVE proper:NOT_PROPER検討           => tag:NOUN proper:NOT_PROPERし             => tag:VERB form:GERUND proper:NOT_PROPERて             => tag:PRT proper:NOT_PROPERいる           => tag:VERB form:ADNOMIAL proper:NOT_PROPER輸入           => tag:NOUN proper:NOT_PROPER車             => tag:AFFIX proper:NOT_PROPERへ             => tag:PRT case:ADVERBIAL proper:NOT_PROPERの             => tag:PRT case:GENITIVE proper:NOT_PROPER高関税          => tag:NOUN proper:NOT_PROPER措置            => tag:NOUN proper:NOT_PROPERについて         => tag:PRT case:ADVERBIAL proper:NOT_PROPER // <---、              => tag:PUNCT proper:NOT_PROPER

Let’s remove these two phrases since they cannot be the verb we are looking for. If we know what Japanese quote marks look like, i.e., 「 and 」, we may ignore the quoted statement, as well.

安倍首相は30日、... と述べ、反対する意向を明言した。

So what was once a long sentence is now reduced down to “Prime Minister Abe… something, something, something.”

We see that the character following “30” is an “affix,” which is a grammatical term for something that attaches to another word. Both a prefix and a suffix are types of affixes. Since we know this piece of news describes an event which occurred on the “30th,” we can probably ignore that part of the sentence. After all, it’s not the verb.

Since it does not come at the end of the sentence, we may also eliminate the word which precedes a comma and seems to mark the end of the quoted text.

と              => tag:PRT case:COMPLEMENTIVE proper:NOT_PROPER述べ            => tag:VERB form:GERUND proper:NOT_PROPER

That leaves us with:

安倍首相は ... 反対する意向を明言した。

Now, we’re getting closer to finding that verb. Let’s look at the very end of the API response:

意向      => tag:NOUN proper:NOT_PROPERを        => tag:PRT case:ACCUSATIVE proper:NOT_PROPER明言      => tag:NOUN proper:NOT_PROPERし        => tag:VERB form:GERUND proper:NOT_PROPERた        => tag:VERB form:FINAL_ENDING proper:NOT_PROPER tense:PAST

Identifying the accusative particle o を, i.e., the particle which marks the direct object of the verb, is especially helpful. Before the particle comes the object, ishi 意向, which Google Translate tells us means “intention.” And after it comes the verb meigen shita 明言した, which means “made a statement.”

And now we have the subject, object, and verb, and have a basic understanding of how the meaning is assembled! “Prime Minister Abe made a statement of intention.”

That may seem like a lot of work for a trifling result. If it’s any consolation, know that Japanese is one of the hardest languages for English speakers to learn.

Wrapping Up

While it’s probably not the place to start for an early stage language learner, the Cloud Natural Language API provides some powerful tools for understanding Japanese, Chinese, and Korean to name just a few of the supported languages. Beyond just syntax, the API supports sentiment analysis (“what is the sentiment of the words?”), entity analysis (“what proper nouns are found?”), and even entity sentiment analysis (“what is the sentiment for each of the proper nouns?”). See here for more details.

In the case of Japanese, parsing a sentence into words and its parts of speech can be tremendously difficult for a student. After all, unlike English, Japanese does not need spaces to indicate word boundaries. The API does the hard work of identifying word boundaries for us. Granted, for the sentence we analyzed, there is a lot more meaning and nuance that we skipped over. Nonetheless, with a little understanding of Japanese grammar and lots of help from the API, we managed to identify the sentence’s core grammatical structure, its meaning, and how it conveys that meaning. That’s pretty exciting!

Finally, the program we wrote in this post is more of a one-off invocation of the API to demonstrate how easy the API is to use and how useful it can be. I hope a reader will be inspired to build more sophisticated and involved applications, and have some fun while doing it!

Further Reading

--

--