DATA STORIES | OCR CORRECTION | KNIME ANALYTICS PLATFORM

TroveKleaner: A low-code experiment in OCR error correction

Easy-to-use, intuitive and completely no code

Angus Veitch
Low Code for Data Science

--

KNIME: a gateway to computational social science and digital humanities

I discovered KNIME by chance when I started my PhD in 2014. This discovery changed the course of my PhD and my career. Well, who knows: perhaps I would have eventually learned how to do things like text processing, topic modelling and named entity extraction in R or Python. But with no previous programming experience, I did not feel ready to take the plunge into those platforms. KNIME gave me the opportunity to learn a new skillset while still having time to think and write about what the results actually meant in the context of media studies and social science, which was the subject of my PhD research.

KNIME is still my go-to tool for data analysis of all kinds, textual and otherwise. I’ve used it not only to analyse contemporary text data from news and social media, but to analyse historical texts as well. In fact, I think the accessibility of KNIME makes it the perfect tool for scholars in the field knowns as the digital humanities, where computational methods are being applied to the study of history, literature and art.

Mining and mapping historical texts

My own experiments in the digital humanities have focused on historical Australian newspapers that are freely accessible in an online database called Trove. I have developed methods to combine the thematic and geographic information contained in these historic texts so that I can map the relationships between words and places. This has been a very complex and challenging task, and I have used KNIME every step of the way.

First, I used KNIME to obtain the newspaper data from the Trove API. In the process, I created the Trove KnewsGetter workflow, which you can download from the KNIME Hub. I then used KNIME to clean the text, identify place names and keywords, assign geographic coordinates, calculate the statistical associations between the words and places, and prepare the results for use in Google Earth and Google Maps.

The TroveKleaner: an experiment in OCR error correction

When I say that I used KNIME to ‘clean’ historical newspaper texts, I don’t just mean stripping out punctuation and stopwords, although I did that as well. I also took on the challenge of correcting some of the many spelling errors that result from glitches in the optical character recognition (OCR) process that converted the scanned texts on Trove to machine-readable plain text. Some of the original texts in Trove are difficult to read even with the human eye, so it is no surprise that machines have struggled! The example below shows a scanned article next to the OCR-derived text, with the OCR errors shown in red.

Figure 1. An excerpt from the OCR-derived text from a newspaper article in Trove (right) and the corresponding scanned image (left). OCR errors are coloured red.

I used some rather creative and experimental methods to correct these OCR errors.

To correct ‘content words’ (that is, everything except for ‘stopwords’ like the or that), I extracted ‘topics’ from the texts using KNIME’s Topic Extractor (Parallel LDA) node and then used string-matching and term-frequency criteria to identify likely errors and their corrections. A high-level view of the steps I used to do this is shown below in Figure 2, while an example of the identified corrections can be seen in Figure 3. I explain the logic in more detail in this blog post.

Figure 2. These nodes within the TroveKleaner identify and apply corrections to content-words. They do this by running a topic model and searching the outputs for pairs of terms that appear to contain an error and its correct form.
Figure 3. The 63 highest scoring content-word corrections identified by the TroveKleaner from a sample of 20,000 newspaper articles published in The Brisbane Courier between 1890 and 1894.

To correct stopwords, I first identified common stopword errors (which conveniently clustered together in the extracted topics) and then analysed n-grams (pairs of sequential words) to work out which valid words appeared in the same grammatical contexts as the errors. Some examples of the resulting corrections are shown below in Figure 4. Note that not all of these are truly stopwords (or ‘stop phrases’). Some of them are content words that have been split into two, such as govern ment and com pany. Errors like these would never be corrected if we hadn’t thrown n-grams into the mix.

Figure 4. A sample of ‘stopword’ corrections generated after n-grams have been tagged.

Choose your own error-correction adventure

Neither of these methods was ever going to be perfect or comprehensive, but they worked well enough to make the experiment worthwhile. And well enough, I think, to make the methods worth sharing. So I cleaned up and annotated my workflow to produce the TroveKleaner. (I do my best to include a K in the name of all my KNIME workflows!) As shown in the ‘homescreen’ view below, the TroveKleaner workflow contains several separate components, which can be run in an iterative, choose-your-own-adventure fashion.

Figure 5. The ‘homescreen’ of the TroveKleaner workflow.

Each time you run one of the error correction processes, the TroveKleaner finds new corrections. Some of these might emerge from previous corrections or new n-grams, while others will just be new products of the probabilistic topic modelling algorithm. So the idea is to run the TroveKleaner in an iterative fashion, checking the error correction statistics as you go. The longer you are willing to persevere, the more corrections you will squeeze out of it. Before you know it, you may find that you’ve spent several hours playing the weirdest and most nerdiest adventure game ever.

Is the adventure worthwhile? That depends on your expectations. In my tests, the TroveKleaner made tens of thousands of unique corrections over 11 iterations, averaging to dozens of corrections per article. This can only be a good thing. However, the average percentage of English dictionary terms in the collection (a very rough metric of OCR quality) only increased from about 71% to 73%, meaning that there were still a lot of errors left behind.

Figure 6. Left: the number of new and unique corrections discovered in each iteration of content-word corrections. Right: The number of actual corrections made per document in each iteration. Also shown is the minimum number of documents in which a term must appear to be included in the process.

Figure 7 shows the corrections made in a single article. In this case, more errors were left behind than were corrected.

Figure 7. After several iterations of corrections, the TroveKleaner fixed only a handful of the errors in this article (left: the original scan; middle: OCR errors in red; right: TroveKleaner’s corrections highlighted).

These results suggest that many of the errors in Trove (and presumably similar collections of digitised texts) are rarities of the kind that the TroveKleaner’s statistically driven methods are not well-equipped to detect. In other words, the TroveKleaner will not get you all the way from OC-aargh! to OC-aahh. But it will help. Even if individual documents do not look much better, there is something to be said for knowing that the dataset as a whole is healthier.

As I’ve already mentioned, the TroveKleaner is a highly experimental approach to doing what it does, and in truth, it may be more valuable as an experiment than as a practical tool. I’m sure there are more sophisticated and effective methods available for correcting OCR errors — but then again, I don’t know how easy they are to use, or what sort of computing resources they require. You can use the TroveKleaner with no coding knowledge whatsoever.

The next adventure: guided analytics?

As I mentioned above, the TroveKleaner is designed to be run in an iterative fashion. You run a process, check the results, then run it again, perhaps with different settings. And then repeat, until you are either satisfied with the results or too bored to stay awake.

This iterative process contrasts with the linear structure of many data analytics pipelines, where you start with some input, apply a process (perhaps some data cleaning and then a machine learning model), and walk away with the output. Or perhaps you want to tweak the process to improve the result, in which case you go right back to the beginning and commence the linear process again. Indeed, the very idea of a pipeline suggests a linear, left-to-right approach. This is how all KNIME workflows are built, and how they are usually experienced if they are implemented as a guided analytics tool or as data apps via the WebPortal.

Or at least, this was the case until very recently. The introduction of the Refresh Button Widget node in version 4.4 of the KNIME Analytics Platform has paved the way for more iterative and dynamic approaches to data analysis in KNIME. As demonstrated in the KNIME Blog post covering eight data app designs with the new refresh button, this humble node (which is so simple that it doesn’t even accept any configuration) enables KNIME developers to build something more like workcycles rather than just workflows.

In the coming weeks, I hope to reimagine the TroveKleaner as a guided analytics tool with the help of the Refresh Button Widget node. This will be no simple feat, as the TroveKleaner is a complex beast that may push the limits of what a guided analytics tool can be. But the challenge is too interesting for me to resist. If I succeed — or perhaps even if I don’t — I’ll do my best to share the experience in a future post.

TroveKleaner on the KNIME Hub

If you want to try the TroveKleaner for yourself, you can download it from the KNIME Hub. Included in the workflow is a sample dataset containing 3,000 articles from the Brisbane Courier. This sample is big enough to demonstrate the workflow, but keep in mind that the TroveKleaner will work much better if you feed it larger datasets. And of course, the texts do not need to come from Trove! They can come from anywhere, as long as they are packaged into the format that the TroveKleaner expects.

If you do want to try out the TroveKleaner, be sure to see the more detailed information in the post that I originally wrote about it on my own blog. And don’t hesitate to get in touch to tell me what isn’t working, or to ask for a better explanation of how to use it!

--

--

Angus Veitch
Low Code for Data Science

Researcher at Swinburne University, Melbourne, and analytics consultant at Forest Grove Technology.