Translate a Book with Google Cloud Translation API

I had a autobiography from a family member in another language that I wanted to translate for awhile. You know, one of those projects on your list for a long time that you just cant seem to complete. Finally the day came and it was time to get this done and get hands on with the translation API.

Upon initial testing I noticed that there was a per character limit per request (5000 chars) using the translation API. My book happens to be over 500,000 chars and over 1000 lines of text.

$ wc -c book.txt
513632 book.txt
$ wc -l book.txt
1029 book.txt

I knew I would have to split the text by line and send multiple requests to the translation API. At first I thought I would split the file and then send 5000 char pieces to the translation API, but then I realized it would make more sense to split the requests by new line or ‘.’, then call the translation API, store response, and loop through. So thats what we did. Its only around 30 lines of python without the progress meter and comments and pretty easy to understand. Find the code on my github.

This script is only tested with Spanish and Hebrew languages and it worked well.

Did you Google has been doing translation services for over 10 years?
  • My 500,000 char / 1029 line text file took about 5 minutes to translate (Hebrew)
  • “La mentira en el Quijote / The lie in Don Quixote” 35,000 char / 46 lines took 5 seconds (Spanish)

The cost for the Translation API is $20 per 1,000,000 characters. For my testing this week I am at the following cost:

  • Translate NMT Characters 352,354.00 Count $7.05

As long as you send text through the translation API by new lines and breaking text by period you should be able to preserve most formatting in source text files. Some post processing manual corrects may be required (see below). Output language is set to English but can be changed by modifying target_language in the script.

Post translation clean up

Thanks for reading and have fun with Google Cloud Translation API!