The ABC’s of Project Gutenberg

Project Gutenberg

Josh Dobbs
DH Tools for Beginners
5 min readNov 1, 2016

--

If you’ve spent any amount of time around the field of literary studies, you’ve most likely heard of Project Gutenberg. Perhaps, like me, you are a college student with mounting debt looking for a free e-book to use for class. Maybe, you’re professional digital humanist, using plain text documents as the cornerstone for your research. Whatever the reason, you’re probably still familiar with the basic concept: an internet-accessible, digital repository of some of the greatest works in literary history.

The Introduction on Project Gutenberg’s website.

However, if you’ve spent any amount of time exploring Project Gutenberg, you've probably noticed that it’s missing some of the works you’re interested in. After all, 50,000 is a relatively small drop in the complete literary history of humanity! I, myself, am exploring an author named James Hogg for my thesis. Deemed by many outside of his home country, Scotland, to be aa minor canonical author, he is sorely underrepresented on Project Gutenberg. James Hogg is credited with 25 works, all of which worked together to make him a fairly famous figure in Scotland, and around the world, during the early 1800's:

The list of James Hogg’s works, provided by Project Gutenberg.

However, in my research, I quickly realized that Project Gutenberg housed only 3 and 1/2 of Hogg’s works:

The screenshot of the works available on Project Gutenberg by James Hogg, including a few that appear to be compilations to which he was added.

Contributing to Project Gutenberg

Be not dismayed. Project Gutenberg is continuing to grow everyday through the addition of new contributions. These contributions are created by volunteers. Click here to link to Project Gutenberg’s site for potential volunteers.

Actually, anyone can contribute to Project Gutenberg! So, is the work you’re interested in missing from Project Gutenberg? Then, maybe you are just the person to ensure that it is digitally archived so that future generation can access it and so that today’s digital humanists have access to the plain text version to incorporate into their research.

At first glance, this prospect is very intimidating. There’s a Gutenberg How-To wiki, copyright information, information and usage pages, uploading instructions, links to the homepage of a required tool, and constant links to an old DOS version of that tool. And that’s just scratching the surface. As complicated as this web of links seems at first, contributing to Project Gutenberg is really as easy as A, B, C.

Project Gutenberg: Step A

The first step is fairly basic. You need to determine if there are any copyright restrictions on the work you want to submit. As a general rule, anything published before 1920 should have fallen out of copyright by now. There are, of course, exceptions to every rule, so it is a good idea to familiarize yourself with the copyright restrictions.

To ensure that your submission does not infringe on any existing copyright restriction, you need to submit a clearance form:

A screenshot of Project Gutenberg’s copyright clearance form.

You will need to create an account. Nothing hard there. And to fill out the clearance form with the information from the title pages of the work you wish to submit. Finally, you will need PNG, JPEG or GIF copies of the title page and verso page to attach to your clearance form.

Project Gutenberg: Step B

This next step is the hardest and most time consuming. You need to create an edited plain text version of the work you wish to submit. You can might consider scanning the entire book yourself using Optical Character Recognition (OCR) software and equipment. This is time consuming work, and you still have to edit the text once you’ve scanned it all!

So, to save time, you might consider looking at online repositories such as HathiTrust and the Internet Archive. These offer PDFs of millions of scanned books from predominantly university libraries, all of which have also been converted to plain text via OCR software. As convenient as these archives are for accessing large numbers of documents and texts, the plain text versions of these works leave a lot to be desired:

On the left is a copy of the plain text version of James Hogg’s The Queen’s Wake found on the Internet Archive. On the right is the properly edited and formatted version of the same text.

This way, you can jump straight into the many hours of editing ahead. The work I’m currently submitting is a 350+ page epic poem by James Hogg entitled The Queen’s Wake. Properly editing the text and correcting formatting issues to ensure that it is as close to the originally published version as possible has taken me over 50 hours.

For the more technologically savvy ( I admit to a significant ineptitude), there are editing tools available. One that is required by Project Gutenberg is Gutcheck. I have already written a detailed tutorial concerning Gutcheck here.

Gutcheck is a tool that will show you most of the potential grammar, punctuation and formatting errors in the text you are editing. In The Queen’s Wake, Gutcheck found nearly 13,000 problems, which was too much for my simple laptop to handle. If you have a computer with the proper processing power, you can begin right away with Gutcheck. Otherwise, you might want to consider running smaller chunks of the overall text through it one at a time, or, as I did, edit it and then use Gutcheck as a final editing tool.

Project Gutenberg: Step C

This is perhaps the easiest step of all. Roughly two weeks after you’ve submitted your copyright clearance form, you should get an e-mail from Project Gutenberg giving you either the problems with your request and suggestions for resubmission, or the official clearance to proceed. Hopefully it’s the latter.

A screenshot of my acceptance letter from Project Gutenberg with the clearance key.

With that acceptance, Project Gutenberg will send you a clearance key, a long number followed by the author’s last name.

Follow the directions in the e-mail to upload your edited plain text file, and you have officially submitted to Project Gutenberg! Not only have you made this text available for digital humanities researchers, but you have created a lasting legacy. Thanks to you, there is now a readable version of your chosen text that can be easily accessed for generations to come.

--

--