Cleaning Up Scanned Documents with Open Source Tools

As more and more Malaysian government information goes off-line with the current government, there is an increasing amount of work needed to scan and digitize documents. In current digital landscape of Malaysia, documents that are not available on-line, may as well be inaccessible to the public. Sifting through hard copies of large amounts of information is also not really feasible proposition for researchers. Digital formats allow the public and researchers to quickly search and categorize hundreds of thousands of pages of documents.

The source of the digitized documents may not necessarily be always nicely scanned, OCR’ed and in PDF format. More often then not, we can expect it to be text taken by camera phones too. These images need to be cleaned up somewhat before we can make them available on platforms such as Parliamentary Documents. A more broad government documents platform for archived Malaysian government documents is in the works based on this same platform.

Example of skewed text from scanned parliamentary documents

The Tools

ImageMagick is a useful utility for manipulating and converting images to different formats of splitting them up.

Splitting PDF pages into images

Often scanned images are in PDF format, often without OCR, which need to be split before processing.

convert  -density 600 -trim file.pdf -quality 100 page-%04d.jpg

Deskewing

Once all the PDF images are split, you will then need to deskew them, detect content, split pages (if scanned as dual page book form) and then to finally output them nicely formatted with margins.

The brilliant tool ScanTailor will do this all automatically for single or multiple pages.

deskewed image after running ScanTailor

Putting it all together

Using ImageMagick we can now put it all back together again, nicely deskewed and formatted.

convert *.tif output.pdf

Create PDF with OCR Text with pdfsandwich

tesseract is a command line OCR tools that supports multiple languages, pdfsandwich converts PDFs into images that tesseract uses and then merges the resulting text back into a PDF with OCR text that users can search and copy and past text from.

Example below for mixed Malay and English language text which is common for Malaysian government documents.

pdfsandwich -lang msa+eng -grayfilter input.pdf

pdftk

Another useful command line tool is to merge, split and fix PDF documents. When the Malaysian parliamentary document splitter script fails, due to not enough data to parse, tools like pdftk help us to quickly split and join wrongly split PDFs.

The following command for example extracts just page 6 from the pdf as an individual pdf file.

pdftk input.pdf cat 6 output soalan-3.pdf

The final result

The skewed document now is now readable and searchable by both people and computers, with accurate OCR text on pardocs.sinarproject.org