How to do OCR in Ruby on OSX

I found this story, and tried to run it, but got into couple of troubles. So to save your time, here is my revised version.

Few words about what Tesseract is. Tesseract is an open-source OCR library, which was initially developed by Hewlett Packard, and in 2005 it was released as open-source. Currently development of it is supported by Google. Tesseract is considered to be one of the most accurate OCR engines available.

There is a Ruby gem, which is basically a wrapper around Tesseract API. In the initial article, it was recommended to install Tesseract through homebrew (to get the most recent version). However, i found out that the most recent version of Tesseract is not supported by tesseract-ocr gem. Thus, i have removed the most recent version (3.04 at the moment of writing) and installed the version 3.02.02_3.

To install version Tesseract 3.02.02_3 run this:

$ brew install https://raw.githubusercontent.com/Homebrew/homebrew/8ba134eda537d2cee7daa7ebdd9f728389d9c53e/Library/Formula/tesseract.rb

But, if you installed the most recent version by mistake, you can first uninstall it and then run the same command:

$ brew uninstall Tesseract
$ brew install https://raw.githubusercontent.com/Homebrew/homebrew/8ba134eda537d2cee7daa7ebdd9f728389d9c53e/Library/Formula/tesseract.rb

When Tesseract is installed, create a Gemfile:

# Gemfile
source ‘http://rubygems.org'
gem ‘tesseract-ocr’

And the Ruby file (ocr-tesseract.rb):

# encoding: utf-8
require ‘tesseract’
engine = Tesseract::Engine.new do |config|
 config.language = ARGV[1]
 config.blacklist = ‘|’
end
def clean(text)
 text.split(/\n/).compact.select { |v| v.size > 0 }
end
puts clean(engine.text_for(ARGV.first))

Then, run bundle to install gem:

$ bundle

By default, Tesseract will support only English language. There are much more languages supported, and you can install them by:

cd /usr/local/share/tessdata
wget https://github.com/tesseract-ocr/tessdata/raw/master/pol.traineddata

You can replace pol.traineddata with language of your choice, for example nor.traineddata for Norwegian language. List of the available files can be found here.

After you have added the languages you need, you can run Ruby script:

$ ruby ocr-tesseract.rb file-to-recognize.png eng

As the first parameter you specify the input file that you want to recognize, and as second — language.

Couple of more general comments about Tesseract:

  1. It doesn’t seems that it will recognize the orientation of image. So before sending it for processing, you have to rotate your image. Otherwise it won’t be recognized.
  2. Another recommendation from Wikipedia:
    Tesseract’s output will be very poor quality if the input images are not preprocessed to suit it: Images (especially screenshots or photos) must be scaled up such that the text x-height is at least 20 pixels, any rotation or skew must be corrected or no text will be recognized, low-frequency changes in brightness must be high-pass filtered, or Tesseract’s binarization stage will destroy much of the page, and dark borders must be manually removed, or they will be misinterpreted as characters.

That’s basically all.