RoboEdit: AI-assisted, Human-in-the-Loop Copyediting

How OCR combined with AI can significantly speed up the full-text digitization of books and manuscripts

6 min read6 days ago

For more information, visit https://roboedit.app

In early 2023, as AI went mainstream and chat completion APIs started to come online, I immediately realized that creating eBooks from PDF scans would become much, much easier. Not only would it remove much of the drudgery of copyediting, it would free up human editors to focus more on the actual content and formatting of manuscripts.

For this reason I created RoboEdit — an AI-powered, human-in-the-loop (HITL) proofreading and manuscript creation tool to empower publishers, scholars, writers, authors, or anyone who wants to create and edit books for future publication.

The Problem: OCR Scannos

Extracting text from PDFs is nothing new. Really good OCR (optical character recognition) technology has been around since the 2000s. Tesseract — one of the most popular OCR libraries, has been open-source and readily available to use in software development since 2015.

However, anyone who has used Tesseract or other OCR technology long enough knows that these tools have limitations. Final results can often be mixed, especially with low-quality PDFs that may contains ink smudges, blurred lines, or a variety of other hazards that necessitate a clean up after the initial OCR scan.

Websites like archive.org have made digitized books and manuscripts readily available, but it has also resulted in vast troves of low-quality OCR text stored in library databases across the world. Try reading the plain text version of virtually any book on this website and you will see what I mean. Take this example, a very low quality scan of a page of Alexei Tolstoy’s 1927 novel The Garin Death Ray:

A long, Buave Rolls Royce with a mahogany^paoelled 
body glided noiselessly up to the hotel entrance. The com> 
misslonaire, his chain rattling, hurried to the revolving 
doors. 

The hrsl to enter was a man of medium stature with 
a pale, sallow face, a short, trimmed black heard, and a 
fleshy nose with distended nostrils. He wore a long, sack* 
like coat and a bowler hat tilted over his eyes.

There are hundreds of thousands, if not millions, of OCR scans like this one that require significant manual labor to transform a PDF from a digital facsimile to a full text digitization.¹ The passage above contains multiple OCR-introduced typos — commonly referred to as scannos. OCR scans also have line breaks that make the final text disjointed and in need of post-processing.

The Current Solution: Human Copyeditors

Preparing PDF scans for publication and correcting OCR-introduced errors has long been the task of digital copyeditors. Correcting scannos is one of many important tasks the copyeditor performs to ensure a published work is formatted correctly.

When preparing a text for publication, there is no substitution for a human editor, and humans remain central to delivering the final product. In case this wasn’t clear:

Human editors have always been and remain indispensable for the written word.

This is especially true when it comes to the preparation of textual content such as literature, scholarly articles, memoirs, or non-fiction, work that is ultimately meant to be read and pondered by other humans.

While keeping humans-in-the-loop is key to any digitization process, advances in AI mean that we can remove much of the tedium of copyediting. AI tooling allows the copyeditor to focus more on the actual content and structure of the final document. Rather than spending x number of hours removing extra spaces or fixing em dashes, an editor can spend more time creating detailed annotations, or identifying areas to improve the format of their future book.

The Smarter Solution: AI-powered OCR Post-Processing and Human-in-the-Loop User Interfaces

A brief demo highlighting the RoboEdit Smart OCR Scanner

RoboEdit’s solution to the problem of OCR post-processing involves two phases. After uploading a PDF that serves as source material for a future eBook or print manuscript:

Each page is scanned via OCR, with RoboEdit calling Google’s Document AI to perform the actual OCR scan. Users also have the option to use Tesseract (directly in the browser client) for faster but less accurate results.
The results of the single-page OCR scan are fed into the Anthropic’s Claude Messages API and sent back to the client.

Users can scan either single pages or an entire book (a succession of single page scans due to rate limiting and token input limitations) with this method. The editor can then view a diff to ensure that the results are accurate.

Claude was able to accurately correct text so jumbled that even I (the human editor) had difficulty parsing the original OCR result.

RoboEdit also has a built-in Human-in-the-Loop Editor that allows the user to review paragraph and apply spot edits as needed — either manually, or with another (much less token-intensive) call to Claude. Editors can quickly pull out a drawer with the original PDF to consult the source text. The user can also view a final “proof” of the chapter to get an idea of what the chapter will look like in print form.

The RoboEdit Chapter and Paragraph Editor, and Proof Viewer

After the book has been copyedited, the user can export the manuscript as a .docx file, which can then be used to create an eBook with a variety of excellent eBook creators out there. Kindle Create, Vellum, or even Google Docs, which can export files as an .epub3 come to mind.

Conclusion

RoboEdit is above all meant to empower editors to create books faster. On a technical level, RoboEdit supercharges copyeditors to transform a written work from a digital facsimile (i.e. a PDF scan) into a full-text digitization that can be published, indexed, and more readily available in the data banks of human knowledge.

If you are interested, please feel free to sign up and try it out at https://roboedit.app. Send me an email at sean@roboedit.app if you have any questions or suggestions for this tool.

¹ There are existing tools to help with this problem, VietSource being one of my favorites, but like many problems AI can now help to solve, it often requires more touchpoints than necessary.

Additional Context and Acknowledgements

I started work on RoboEdit in February 2023. As a hobby, I digitize forgotten books in the public domain. Often times, the only digital representation of these books on the internet is a PDF scan, which I find less accessible than full-text digitization. As a former Slavic studies scholar with a background in academic publishing, I am especially interested in reviving English translations of Russian literature that have rarely appeared in print.

The original incarnation of RoboEdit was a simple Python script to help me copyedit large blocks of text. I eventually added a GUI interface built with React and Firebase to help me see each paragraph before and after. I use a variety of React editor components (AceEditor, MDEditor) and textareas to make manual edits or send the paragraph to AI to perform spot edits.

This past summer, I added in RoboEdit’s OCR scanner feature. I was greatly inspired by Simon Willison’s blog post, where he demonstrated how easy it is to create an OCR scanner in browser clients these days. A huge thank you for Simon for his enthusiasm for new technology and prolific blog output.