It’s Time for a Gutcheck

Josh Dobbs
DH Tools for Beginners
8 min readJan 8, 2017

So, you’re a literary scholar interested in becoming a digital humanist. Or, maybe you just want to get your feet wet and tinker with the literary aspects of digital humanities, just to see what all the fuss is about. Whatever your reasoning is, once you begin, you find out very quickly that you need texts to work with! Whether it is just one to test different digital humanities tools on, or many for text mining purposes, you need texts, and you need them as plain text files.

link to Internet Archive
link to HathiTrust

There are many places that you can get plain text files of older texts online, such as the Internet Archive and HathiTrust. These offer wonderful PDF renderings of older texts that have fallen out of copyright, and thus can be distributed online without fear of copyright violation. However, more often than not, their plain text files leave a lot to be desired:

Image 1. A clip from the Internet Archive’s plain text version of James Hogg’s “The Queen’s Wake.”

These mostly appear to have been run through Optical Character Recognition (OCR) software and immediately displayed online. Most of these files have so many errors that running DH tools on them is pointless.

Project Gutenberg and Gutcheck

Does this mean that you are doomed to typing up all the text files that you need by hand? Or, equally doomed to use OCR software to convert PDFs or hardcopy books to plain text files, spend days scouring over these files, and comparing them to the original documents to verify that the spelling and punctuation are exactly as in the original sources? No! This is in large part due to Project Gutenberg.

Image 2. A description from Project Gutenberg’s homepage.

Project Gutenberg provides usable and properly formatted plain text files for a wide range of materials that can easily be used by digital humanists. These have been vetted by volunteers, who have cleaned up the OCR-produced plain text files of a wide variety of publications. To help their volunteers in this endeavor, Project Gutenberg created a rather handy tool called Gutcheck!

Well, a volunteer at Project Gutenberg did anyway…

Image 3. The anonymous creator of Gutcheck describes both Project Gutenberg and Gutcheck on its homepage.

Simply put, Gutcheck scans the plain text document and creates a list of all the potential OCR and human errors.

Beyond Gutenberg

Of course, Gutcheck need not be limited to Project Gutenberg documents only. That is merely its origin. The creator of Gutcheck foresaw wider use for the tool.

Image 4. From Gutcheck’s homepage.

Imagine, if you will, a group of college students bent on preserving an old school magazine that is deteriorating. They scan the original documents and through OCR software are able to create plain text versions of this old magazine. Before each individual article can be loaded into a database, say on Omeka, they must first clean up the OCR-created text. This would usually require comparing the OCR-created text to the original copy, possibly for hours on end, and then having someone else double-check the edits. Time consuming indeed. However, with Gutcheck, most of the work is done for them. All they need to do is to make the corrections that Gutcheck highlights for them! What could be easier?

Downloading Gutcheck

Okay. In reality, it wasn’t that easy for me. I followed the links provided on Gutcheck’s website and on Project Gutenberg, but downloading Gutcheck quickly became a nightmare. This is because the provided links lead to old DOS software:

Image 5. After downloading the version of Gutcheck linked to in the websites, you find this plain text document explaining how to use it. I spent far too long ineptly trying to get this to apply to Windows 8.1.

Sure, if you already know computers like the back of your hand, this may be no problem for you. But, if you’re like me and only use your computer as a glorified word processor, then this may be of little help to you…

Have no fear. Over the years, as the programming needs of digital humanists have changed, another programmer using the pseudonym Thundergnat has been developing Guiguts, a non-DOS program for Windows, Macs and Linux that incorporates Gutcheck. All you need to do is to find the Guiguts that works for your system and programming editions. For me, I needed a version compatible with Windows 8, which I found through a simple Google search:

Image 6. Googling the Guiguts for my computer.

Here, you can find a list of all of the possible iterations of Guiguts. However, this list doesn’t really tell you which version is best for which operating system.

Installing Gutcheck

Now that Guiguts is safely in the Downloads folder of your computer, simply extract all of the files into the location of your choosing:

Image 7. Find the zipped Guiguts folder in your Downloads and extract all the files.

For simplicity sake, I have been extracting it to the folder containing the plain text document I am working with. Once the files are safely extracted, Guiguts really becomes an simple tool to use. In your extracted files, look for run_guiguts and click on it:

Image 8. Click run_guiguts to bring up the program.

Running Gutcheck

Image 9. Guiguts itself, in all its anticlimactic glory!

Guiguts is now its own window. No messy DOS to work with. No need to apply the DOS program to a specific file. No need to really understand the technical aspects of the programming that governs what your computer is doing! You’re quite welcome to simply assume that it’s magic.

Simply open the text file that you want to run Gutcheck on in Guiguts using the “open file” icon in the bottom left (Image 9-A).

Now that the document is loaded into Guiguts, hit the Gutcheck button (Image 9-B).

For simplicity sake, you can split the screen between the Guiguts screen with the original text, and the list of issues provided by Gutcheck.

Image 10. A plain text version of James Hogg’s “Justified Sinner” loaded into Guiguts (left) side by side with the list of potential errors provided by Gutcheck.

As you click on each Gutcheck issue, the corresponding area is highlighted on Guiguts. Now you don’t need to count lines and columns to find the errors.

Image 11. By clicking on the error in the Gutcheck list (right), the computer automatically highlights the error in the plain text document (left).

Addendums

Simple, right? Well, sort of. There are a few addendums I should mention.

First, the more errors that are in the text and the larger the file, the harder it becomes for a simple laptop to process all of the text. I just have a simple laptop with 6 GB of RAM and a 2.7 GHz CPU. When I ran an already edited version of James Hogg’s The Private Memoirs and Confessions of a Justified Sinner through Gutcheck, it only had 80 issues, and my laptop alone was sufficient to handle the processing. Most of the errors were punctuation and spelling issues that, although they were technically incorrect, were faithful to the original publication. Quirky author. Yet, when I ran an unedited version James Hogg’s The Queen’s Wake (a 300+ page epic poem) through Gutcheck, it found almost 13,000 potential errors:

Image 12. Gutcheck’s list of errors for the untidied version of James Hogg’s “The Queen’s Wake.”

I could only scan through the first 20 errors before my laptop froze. For big text files, I recommend finding a more powerful computer. If you cannot do that, then the next best thing is to get smaller, bite-sized chunks of your plain text document for Gutcheck to more easily process. Or, you can clean up the document yourself the old-fashioned way, going line by line to check it against the original, and then use Gutcheck as a follow-up.

Image 13. A typical complaint concerning Guiguts’s spellchecker.

Secondly, there are other features to Guiguts beyond Gutcheck, including a spellchecker. However, I could not get the spellchecker to work, and this is the most common complaint I saw concerning the tool package. Still, industrious digital humanists have plenty of other tools to explore within the Guiguts package.

Conclusion

Although Project Gutenberg provides a wide array of usable texts for literary-minded digital humanists, the vast majority of files available on Project Gutenberg are from older texts simply due to copyright regulations. The easiest way to do the digital humanistic work and not have to pay hefty prices to publish texts online, is to use books that are no longer copyright protected. Not only are most of Project Gutenberg’s texts old, but they are still limited in scope and range. Project Gutenberg’s selection is chosen by volunteers at their discretion, so the availability of texts are sometimes constrained. Are you looking for a specific work by a specific author? There is a good chance that, unless it is a major canonical work by a major canonical author, you will not find it here. Then there is the breadth of the authors’ works too. In my case, I am currently studying James Hogg, a minor Scottish author with over 42 major works credited to him. However, only 5 are represented on Project Gutenberg. What if I wanted to use a DH tool to compare all of his works? Or, to look for word trends throughout his collected works? Well, 5 is still better than none, but it hardly represents the whole.

What to do…what to do…

I guess we are back to cleaning up all those files by ourselves to create a collection of texts from which to work. At least we have Gutcheck to speed up the process. But, why be selfish in this endeavor? You can submit your cleaned-up plain text files to Project Gutenberg, and aide future literary-minded digital humanists! You’re doing the work anyways, so why not leave a legacy on which others can build?

--

--