How to OCR & translate large amounts of unstructured data - on a budget

Jan Degener
12 min readMay 14, 2022

--

Imagine you are working for a company that just acquired a subsidiary somewhere abroad. And because someone didn’t read the right memo, or the memo right, they just found out the subsidiary in question had another subsidiary. But that second subsidiary got liquidated the year prior. But nobody took care of the paperwork.

And now there is this folder full of contracts, memos (which nobody read), PDFs, Excel files and Word documents that might contain toxic information and mean doom for you and your company. But nobody really knows. Oh and nobody speaks the language these documents are written in, because that business was located even abroader than the business your company just acquired. Oh and this needs to be solved until the board meeting next month.

“You work with data, right? Here have 37GB of data, make it read- and searchable until next month. In English! aaaand we don’t have any money to outsource or buy any help. K thx!” — A boss

Totally hypothetical case of course. But let’s see how to solve that. Oh and let’s just assume the hypothetical language we are talking about is French.

And we are working on Windows.. and our language of choice is Python! And here we go…

1 Clearly define the goal

Sounds trivial I know. But do not underestimate how important this is — with every project by the way. The information given by the boss above are concrete enough to have a goal to work towards to.

In this case there needs to be a solution that allows to search for keywords in English, find it within all the documents and only get these documents to a professional translator if need be.

Great! That means for our goal: rough translation to text files that retain the info which file the text originated from.

1.1 Hire 100 interns to do the work

Just kidding. Interns are reserved for those important errands when you crave a nice cold beverage but don’t want to leave your seat.

2 Have a look at the data

Get a feeling for what you are facing, like the data types and the actual file contents. Generally speaking, if you have PDF files: open them! And see if they are all roughly the same size or not. If not, open some of the largest and some of the smallest. Where does the difference in size come from?

In our hypothetical case it turns out we are dealing with the a lot of varying data types:

  • searchable PDFs (i.e. PDF files that have text that can be easily extracted)
  • Image PDFs (i.e. these contain images of text or other documents and need to be OCRed first)
  • searchable Word .doc / .docx files (i.e. word files that really contain text as intended)
  • Image Word .doc / .docx files (same as Image PDFs, but in Word format)
  • searchable Excel .xls / .xlsx files
  • A bunch of jpg, png and tiff files
  • a very large .pst file

All files are either loosely present or part of a folder / subfolder structure. So what we are gonna do is converting them to *.txt files and add them in their original language to a “to_translate” folder. Afterwards we take all these *.txt files and translate them to English.

3 Extracting Text

This is gonna be a long part explaining how to handle each format and some of their respective caveats. Jump to 4 for the translation part.

All of the following methods return text which I collected in an outer function later that handles saving each output to a *.txt file.

The very large .pst file

.pst files are actually “Outlook files”. And if you have ever worked with data in proprietary MS Formats of any kind, you know this can be a huge hassle (since Guido v. Rossum joined Microsoft I have at least a tiny bit of hope this will improve in the future)

But that means you have to handle these files first. Because they are emails, and emails might contain attachments, you first have to add these potential pdfs, word files etc. to the other unstructured files. And retain the information which email they originated from.

I haven’t found a nice and easy way to handle *.pst directly from Python (there are some solutions which I did not get to work for this project), so instead I opted for using Outlook to import the *.pst, select all files and drag them to a folder in Windows, which basically converts each email into a *.msg file.

Sounds simple enough. BUT if you’re not using Outlook as your daily email driver and just want to fire it up for this, it requires a mail account to log into. Seriously, the program won’t let you do anything without you connecting to a mail account first. Luckily there is way to circumvent this, start it from Windows run (Win + R) and type Outlook.exe /PIM xxx

Start Outlook even if you do not have or want to link an account to. It will initiate an Outlook profile with the name xxx but without any link account

Once you got that handled go to File -> Open&Export -> Import/Export -> Import from another program or file -> Outlook Data File (.pst). If you are using Outlook for other stuff, be careful and create a new folder, because the .pst file contains the entire folder structure of the creator. So if they had an “Inbox” and you have a “Inbox”, it might merge. You probably do not want that!

Now you should have one or several new folders from the *.pst file in Outlook. Select all emails and drag & drop them into a folder in Windows Explorer. This will create *.msg files for each message. These can be handled nicely with extract-msg in Python.

And here is a simple code snippet how to extract text from the mail. I did not get to play around with this because I was more interested in the attachments though:

Before doing any more processing though, better add a unique ID to each *msg file to make it later easier to identify and also link the attachments to individual messages (otherwise the *.msg name will be the same as the mails subject line).

Adding unique or random IDs also helps with duplicates email names (Windows might raise an error when you select all the emails from Outlook an drag them to a folder to be converted to *.msg because of duplicate names. In that case, just export it in larger chunks, add the unique IDs, then add more chunks, add unique IDs…).

And here’s the function that handles the attachments of each *.msg file in “in_fol” and saves them to “out_fol”, renaming it with the same suffix created above:

Excellent. Let’s assume all attachments were extracted and added to the pile of other pds, word etc files and all text from the emails was saved to .txt files and added to the “to_translate” folder.

The searchable PDF files

These are quite nicely and easily handled by pymupdf. Note that this is a fork of the original pymupdf project… so to import it you actually have to do “import fitz” in Python. But then it’s easy as pie:

The image PDF files

This is of course a bit more tricky, but here, and in all cases where normal text extraction might fail, we use PyTesseract as a fallback to OCR text from an image.

At first we use pymupdf again to create an image of a page in the PDF (set it to 300 dpi, otherwise it will create quite pixelated images that will not work well with pytesseract). Then pass the image to pytesseract (be sure to switch to the correct language). Save the returned text to a *.txt file and voilà.

The searchable *.doc / *.docx files

The best way to handle these that I found are through the pywin32 bindings. Of course you will need to have Word installed, otherwise this does not work.

So, easy cakewalk as with the PDFs? Not quite — Word might crash if there are non-standard characters in the filename that your current OS installation does not support. So in these cases a workaround with a temporary file with a new name is needed.

After that we can start our cakewalk and “simply” load the file and save it as text. Okay, it’s not entirely straight forward, because you have to use the obscure method SaveAs2 and good luck finding this site that tells you that you need to specify the number 7 as FileFormat when saving. But after that it’s a cakewalk.

Oh okay, its sometimes not clear what encoding was used in the original file and what is now used in the exported text file. So handle both possible French options (there are of course other French options as well, but for my case these two were fine). But then it’s a cakewalk.

Extrating text from a wordfile. Note that strictly the package pendulum is not needed here… I just much prefer working with pendulum for everything time and date related in Python.

The image Word file

These are files where someone simply pasted scanned images of documents to a page. Sounds horrible I know, but it’s almost easier to get data from than from the “correct” Word files.

All you have to do is load the Word file, save it to PDF (remember this site? You need code 17 this time ;) and then return to the method where you extracted text from an image PDF.

One thing you have to decide though is when to use this method, and I mean: when do you assume that the Word file does not contain any text but only images? For my use case it worked pretty well to switch to this method if the file size was >1MB and I got <50 words back from the “searchable Word document” method. Not foolproof! But worked well enough…

Image files

This is PyTesseract all over again. As there might arise problems with poorly formatted file names the following code contains the temp-file workaround as well. The oem / psm flags are part of the PyTesseract options that worked well in this case (here is a nice little discussion about them).

A thing I initially forgot about but that was still handled nicely by PyTesseract was the fact the tiff files can have multiple pages. Probably don’t take the word from a random internet stranger on it and test it again for yourself though! Just a heads up.

Excel files

Being familiar with pandas I chose to outsource the Excel problem to pandas, save the output to csv and catch that output for further processing. This will preserve the tabular shape of the data while allowing me to treat it like any other text file. Only thing to look out for is to not forget to catch all of the sheets and to handle both xls and xlsx. I used xlrd for the former and openpyxl for the latter:

And as mentioned above, these are all encapsulated in a larger function that determined which file to direct to which function, then saves each output to a text file.

4 Translating the text

Okay, now we have converted our original 37GB of random files into 7 GB of nice simple text files. Now comes the hard part: translating all of it.

Online Resources

To be honest, my first idea was to simply pour all of it into google translate, deepl or a similar solution — fully anticipating some restrictions and IP blockings etc. (honestly, just don’t do that for this amount of data!)

The biggest obstacle was the character limit of each site. If I recall correctly there was a limit of merely 5000 character per request. Combine that with quick walls shooting up once you try your third or fourth larger request in a row (and from what I could gather the measures even went beyond simple IP blocking. Also google translate, bing translate, reverso and even yandex almost immediately blocked early requests… while some I think even had lower allowed characters per request). So just don’t try!

For 7GB of text data a limit of 5000 characters is… well nothing. I started to write code that would save the same text in dictionaries and check if a translation already existed before sending it again (and similar hacky solutions) so it wouldn’t have to be sent to be translated over and over again. But even that didn’t help. If there would have been any kind of budget I honestly would have simply bought a Deepl subscription and used their API. So if you have: just do that! Easy to implement and works incredibly well.

Local Solution — libretranslate.com

I realized rather early that even if the online solution would have worked, I was wasting computing resources by sending all of these 5000 char chunks to a web server, waiting 3–5 seconds idle for a response, and not leveraging any of my own CPU/GPU in the meantime.

Luckily I had read about libretranslate before (its a free and Open Source machine translation API) and thought it probably couldn’t hurt to quickly fire it up and try the API. However, “quickly” turned into a day trying to figure out how to get it to run on Windows (I’m not complaining, it just is what it is with most open source. More often than not the average Windows end user wasn’t the target group). Missing required packages, missing compilation environments (solved by the great Gohlke binaries) and general lack of Windows support took longer than expected.

But at the end of the day I got it to run… only for libretranslate to start downloading my specified language packages on startup… then crashing for some server reasons, and me unable to get it to restart the download. That’s all I wanted: redo the language download that gets triggered automatically after installation. Couldn’t figure it out. Took me hours before I gave up (it depends on some other packages that do the actual translation work… so figuring out what does what and how to get these sub-parts running was when I gave up that path).

I really don’t want to complain when free work of other people does not work for my special use case — so just take it as my brain was not smart enough to figure it out by itself. Your experience with your own head may vary.

In the end the working fix was to simply follow libretranslate’s own recommendation to run it through Docker for Windows! BUT to get this running, you have to have the the subsystem for Linux running/enabled on Windows as well as virtualization in your bios. Which turned out is named “SVM Mode” on my board, which took me another hour to figure out ;)

BUT, once all of that worked, it worked like a charm. In docker you simply fire up an instance with

docker run -ti --rm -p 5002:5002 libretranslate/libretranslate --load-only fr,en port 5002

which launches a libretranslate webserver that listens on port 5002 and only downloads French and English languages when called like this (instead of downloading all languages which increases the chance of crashing). I think the documents refer to port 5000, but its your machine. Do whatever you want! You even get a nice GUI interface when you visit 127.0.0.1:5002 from your browser. Then all you have to do is use Pythons requests package to send text chunks to this local “webserver” and wait for it to return

port = 5002
base_url = f"http://localhost:{port}/translate"
headers = { "Content-Type": "application/json" }
j = {"q": chunk,
"source": "fr",
"target": "en",
"format": "text"
}

Works like a charm! You can even fire up several Docker instances on other ports to your hearts pleasure until your CPU/GPU is fully swamped with translation tasks and then fire up corresponding python instances to serve data to these endpoints. Beautiful! You can’t imagine how happy I was when this (hypothetical!) thing finally worked.

I think I had 7 threads working in parallel for a week in the end to get all of it translated. The one thing I noticed though, was that you shouldn’t feed it chunks of text that a too large at a time or the translation slows down significantly. It is better to build in a little logic that splits the text into smaller chunks of at most 10.000 characters (split at a natural breakpoint) and then reassemble afterwards. But then it purred like a well fed translation cat.

Summary

If you have random documents you want or have to translate:

  • Use pywin32 for MS Office documents to extract text (need to have Office installed)
  • Use pytesseract to OCR text from images
  • Use libretranslate to host a local translation server that you can feed all of your text
  • Never hear of your project again because something else came up and everyone forgot they really really needed to check those files

--

--