Cleaning Up Scanned Documents with Open Source Tools

Khairil Yusof
Jan 11, 2017 · 4 min read

As more and more Malaysian government information goes off-line with the current government, there is an increasing amount of work needed to scan and digitize documents. In current digital landscape of Malaysia, documents that are not available on-line, may as well be inaccessible to the public. Sifting through hard copies of large amounts of information is also not really feasible proposition for researchers. Digital formats allow the public and researchers to quickly search and categorize hundreds of thousands of pages of documents.

The source of the digitized documents may not necessarily be always nicely scanned, OCR’ed and in PDF format. More often then not, we can expect it to be text taken by camera phones too. These images need to be cleaned up somewhat before we can make them available on platforms such as Parliamentary Documents. A more broad government documents platform for archived Malaysian government documents is in the works based on this same platform.

Update: 2017 Malaysian Government Documents Archives mentioned above was developed and now hosts thousands of searchable government reports and other documents.

Image for post
Image for post
Example of skewed text from scanned parliamentary documents

The Tools

ImageMagick is a useful utility for manipulating and converting images to different formats of splitting them up.

Splitting PDF pages into images

convert -verbose -density 300 file.pdf -quality 100 -trim page-%04d.jpg

When dealing with very large documents when this command may fail, or we want to make use of all CPU cores to convert the PDF pages to images, we can use the command line tool GNU Parallel.

ImageMagick convert command takes file.pdf[n] where n is a page number to convert just one, or a range of pages. With pdfinfocommand we can find out how many pages there is and then use parallel to process all pages concurrently.

For a 80 page document:

parallel convert -density 300 document.pdf[{}] -quality 100 -trim pages-%04d.jpg ::: {0..79}

Note
On some Linux distributions, you will need to enable ImageMagick operations for PDF, and change this line in: /etc/ImageMagick-6/policy.xml

Enable read/write by finding and editing the line as below:

<policy domain="coder" rights="read|write" pattern="PDF" />

Deskewing

The brilliant tool ScanTailor will do this all automatically for single or multiple pages.

Image for post
Image for post
Image for post
Image for post
deskewed image after running ScanTailor

Putting it all together

convert *.tif output.pdf

As before when splitting images, the PDF may be too big and this might fail, so we may need to convert each image into a pdf separately and then combine them all into one pdf again.

ls *.tif | parallel convert {} {.}.pdf

and join all the separate single page pdf’s into one with pdftk command:

pdftk *.pdf cat output document.pdf

Create PDF with OCR Text with pdfsandwich

Example below for mixed Malay and English language text which is common for Malaysian government documents.

pdfsandwich -lang msa+eng -rgb input.pdf

Notes
-rgb option preserves colour of original images can switch to -gray for black and white documents

pdftk

The following command for example extracts just page 6 from the pdf as an individual pdf file.

pdftk input.pdf cat 6 output soalan-3.pdf

The final result

Image for post
Image for post
Image for post
Image for post

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch

Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore

Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store