Configurin Tesseract OCR in Heroku-16

Pravesh Koirala
2 min readJan 31, 2018

--

Tesseract is an OCR sponsored by Google. It is open-source and its binaries are readily available for most platforms, hence, it is a popular go-to library when OCR functionalities are required in an app. Setting up tesseract is a hassle-free procedure in popular development environments but I didn’t find a simple walk-through in setting it up in heroku as of this writing. Google search’s top results will point to custom heroku buildpacks created by other developers. But for some reasons, none of those worked out for me. If you are encountering problems using those buildpacks as well, then this might be worth a shot.

Note: This is only tested in heroku-16 and not in cedar-14 (fyi).

Steps

Add heroku-apt-buildpack using the command:

heroku buildpacks:add --index 1 https://github.com/heroku/heroku-buildpack-apt

Create a file named as Aptfile in your app directory and paste the following:

tesseract-ocr
tesseract-ocr-eng

Note that tesseract-ocr-eng is the language file for tesseract. If you want to enable languages other than English, you’ll have to specify it accordingly (e.g. tesseract-ocr-spa for Spanish).

Next you’ll have to set a heroku config variable named TESSDATA_PREFIX. This is the path to the data downloaded by the tesseract-ocr-eng package.

heroku config:set TESSDATA_PREFIX=/app/.apt/usr/share/tesseract-ocr/tessdata

If for some reasons, this does not works out for you, you’ll have to find the path of the tessdata directory yourself. To do this, go to your heroku shell using heroku run bash and run the following command:

find -iname tessdata

Then you can change your TESSDATA_PREFIX config accordingly.

Hope this helps out!

--

--