Going paperless with OCRmyPDF

Ivan “elgris” Kirichenko
2 min readOct 27, 2018

--

Adult life is about many things. And one of them is that huge pile of paper one needs to deal with. Contracts, receipts, notifications, your documents and the letters from your granny that you keep around just in case. That pile weights A LOT, but you never notice it until you move to another house.

I moved recently, I felt the pain. I tried to find a notification from my insurance company and it took me several hours to sift through the pile of that paper cr^W. I felt the pain again. All right, calm down. Take your flamethrower and prepare to join the happy community of those who went paperless.

Step 1: Transform that pile of paper into the pile of PDF files

Any scanner will do, even DIY one. Seriously, I’m going to make one, but for this task I borrowed a scanner with a paper feed.

Step 2: Make that pile searchable

There is a lot of tools to help. Many of them ask for money. Some of them keep YOUR docs in THEIR cloud, so the NSA/FSB/Anonymous hackers can also enjoy reading your Granny’s letters. Not good.

Also there are quite big Document Management Systems like Mayan or Paperless. But using an additional storage to keep the index of PDFs? C’mon, it’s 2018 out there, modern OS are capable of fast search through searchable PDFs on your hard drive.

I need a a) free b) simple to use c) running locally solution d) that just makes PDFs searchable and does not force me to use an additional database for indexing. And here it is: OCRmyPDF. Everything is neatly packed into a Docker image. Yay!!! Let’s ask the tool to process all our PDFs and make them searchable.

for filename in ./*.pdf; do
docker run — rm -v=$(pwd):/tmp jbarlow83/ocrmypdf -l deu /tmp/${filename#./} /tmp/out/${filename#./}
done

Where -l deu is a language code for the input PDFs. German (DEUtsch), in my case.

And that’s it, folks! I’ve burned that huge pile of paper and I’m happy about it! Now every time I receive a new letter, I do the same: I scan a PDF and make it searchable with OCRmyPDF.

--

--