Better OCR for Newspapers

John Scancella
7 min readAug 20, 2020

--

Disclaimers: I do not currently work for the Library of Congress (though I have in the past). All work on this initiative has been done on my own time.

Now that I have that out of the way, let’s get to the good stuff!

When you think of the world’s best collections, you probably think of museums like the Smithsonian or the Louvre, but there are also extraordinary collections available for free from the U.S. Library of Congress (LOC). One such LOC collection under long-term development is “Chronicling America” (known internally as “Chronam”), a Website providing searchable access to digital information about historic American newspapers published between 1690-present. During a four-year stint as an Information Technology Specialist at the Library of Congress, I worked to deploy Chronam to cloud infrastructure. One of the noted attributes of Chronam is it provides Optical Character Recognition (OCR) of the high resolution images of each newspaper page; however, there is one big problem with it — the image often doesn’t render readable OCR text. For example https://chroniclingamerica.loc.gov/lccn/sn83030213/1842-04-22/ed-1/seq-1/ocr/ is just gibberish.

Some of the reasons why the OCR is so illegible are:

  • Text is grouped in columns
  • Text is in all different sizes
  • Because the page is old, a lot of the ink has faded or worn off
  • There are lines delineating columns and paragraphs
  • Small images
  • OCR was generated using an old version of Tesseract

These things all cause OCR programs such as Tesseract to produce OCR that’s difficult to read: one column of text is mixed in with the others, but the OCR goes straight across the page.

Some of the oldest pages in the collection have some of the worst OCR results. One example is the collection of newspapers About New-York tribune. [volume] (New-York [NY]) 1841–1842, and this is the example I will be using to show a way to get better OCR results.

If we could crop the images into columns, then OCR would produce much better results, results that might make the difference in someone’s finding that needle in a haystack. With that in mind, I set out to try and do that in an automated way. Manually, it would take far too long and would be too expensive. This is a work in progress, and I am not a python developer by nature, so pull requests for improvement are welcome (https://github.com/jscancella/NYTribuneOCRExperiments).

The main idea is simple: use openCV and its many powerful transformations to modify the image and find the columns. The main method we need is findContours() which finds white areas in a black-and-white photo and then returns them as a collection of coordinates. Some of you have already spotted the first problem — the photos aren’t in black and white and the text isn’t white. Here again, we can use openCV to transform the image before calling findContours(). However, that still isn’t good enough, as the findContours() method will just find all the bits of text and lines, when all we care about is the columns. So what if we make the text bleed into each other just in the Y-axis? That would combine the columns together and findContours() would find those big blobs. Then we could crop those areas of the image and feed it to Tesseract. And that is exactly what my first attempt has done.

First we convert the image to grayscale and invert the image so the text becomes the white that openCV expects for all of its methods.

inverted image

Next we apply the close() in the Y-axis to the image to take care of any little holes or disconnects that may be in the image. We only apply it in the Y-axis as we want to keep the white from expanding in the X-axis and ruining our chances of properly finding the columns.

Applied close operation

Because front pages have the name of the newspaper (and in the case of this newspaper, most pages have a line at the top), we need to remove those, otherwise they would cause the columns to form into one giant blob and then findContours() wouldn’t work and would just draw a box around the entire page! So we need to cut off the top of the page by doing a findContours() and then checking the first 1000 rows of pixels for long lines. If it finds any, it colors them black, which openCV will then ignore in any subsequent transformations.

Removed the title and lone horizontal lines

Next we do a dilation, and this is where the real magic happens. OpenCV has many functions that calculate the value of a pixel depending on the value of the pixels around it. In this case, we look at the grid of 9 by 9 pixels and if the center pixels are white, we make the pixel in the middle white. This has the effect of stretching the white text in the Y-axis. We do this a number of times, and hopefully all the white pixels (which were originally the column text) form into one big blob as we see here:

Dilate in the Y axis

One last step before we start finding the blobs is to remove any noise in the image (little specks of white) that might be from dust or grit in the original image.

Remove any noise

And now we can finally run findContours(), which returns boxes around the white blobs which are our columns. You can see how they match up well with the original image’s columns:

Green boxes around the found columns

Now it’s just a simple matter of creating cropped images using those contours, and running Tesseract on them. Here is a sample of the result:

SIX DAYS LATER FROM EUROPE.

Arrival of the Great Western.

The steamer Great Western arrived here at

half-past ¢ o’clock on Sat@rday morning, bringing

dates up tothe Léchinst. She !eft Bristol oa the

afternoun of the 16th, thus muking ler passage

in the remarkably short tine of thirteen days and

a half.

She brings 56 passengers and an average cargo,

chieily comnposed of dry goods.

The most interesting intelligence by her is that

of the sudden death of the Duke of Orleans, heir

apparent to the throne of France, caused by leap-

ing from his carriage, the horses of which were

running away. His son, the Count de Peris, is

but four years old, and the age of Louis Philippe,

who is in his sixty-niath year, forbids the hope

that he can survive til his grandson attains lis

majoricy. Should he not, a Kkegency must be ap-

peinted, and this may lead to confusion, if not an-

archy. We annex the particulars of the unfortu

nate event:

From Galiynani’s Messenger.

The details of the calamicy ere as follows : —

Yesterday (July 13) ut i2 o’clock, the Duke of Or-

leans was to leave Paris for St. Omer, where he

was to inspect several regiments intended for the

cerps of operation on the Marne. His equipages

were ordetcd, and his attendants in readiness.

Every preparation was made at the Pavilion Mar- |

san for the journey, atter which his Royal High-

ness Was to join the Duchess of Orleans at Piom-

bieres. At 11 the Prince got into a carriage, in-

tending to go to Neuilly to take leave of the King

and Queen and the royal family. This carriage

was a four wheeled cabriolet, or caleche, drawn by

two horses a la demi-Daumont — that is, driven by

postillion. It was the conveyance usually tawen

by the Prince when going short distances round

Paris. He was quite alone, not haviug suflered

one of lis officers to accompany lim. On arriving

near the Porte Maillet, the horse rude by the pust-

illion took fright and broke into a gallop. The

carriuge was svon carried with great velocity up

the Chemin de la Reyolte. The Prince seeing that

the postillion was unable (o master the horses, put

his foot ou the step, whichis very near the ground,

aid jumped down on the read, when about half

way along the road which runs direct from the

Porte Maillot. The Prince touched the read with

both feet, but the impulse was so great that he

staggered, and fell with his head on the pave-

ment.

Is it perfect? By all means, no. But it is substantially better than what running Tesseract over the whole page produces. And while it doesn’t work perfectly on every page, for the vast majority of the time it is good enough. One can imagine this could be used to help generate training data for a neural-network AI that would produce even better OCR.I would like to thank the National Digital Newspaper (NDNP) group, and the Library of Congress for their hard work and tireless drive for excellence. If this sort of thing interests you, please take a look at Library of Congress API which exposes much of their unique and historic collections for automated processing at https://labs.loc.gov/lc-for-robots/

--

--