PRImA’s Aletheia — Ground Truth & Softalk Magazine

Although most folks are happy with full access to the complete run of the Apple edition of Softalk magazine now that Timlynn​ and I have funded its “ingestion” into the Internet Archive, the Softalk Apple Project is far from over. In fact, getting the collection into the Archive was just step one — the first important step in the Citizen History aspect of our two inter-related projects.

Before the Internet, Facebook, Twitter, email, and all the other social digital media we use to communicate, there were newspapers, magazines, and other print media. At the dawn of the microcomputer and digital revolutions, Softalk magazine chronicled the people, products, companies, events, and importantly, everyday users during these transformative years. Forty-eight monthly issues published between September 1980 and August 1984 are a unique and valuable historical record. The Softalk Apple Project, together with the FactMiners.org development community are working to preserve, explore, and extend the legacy of this incredible cultural resource.

We’re now moving forward forging strategic collaborations with world-class researchers and research centers to turn the Softalk Apple Project digital collection into a unique and valuable reference resource for broad applications in the Digital Humanities and Cognitive Computing domains. This applied research dimension of our Softalk preservation activity is the Citizen Science agenda we are pursuing at FactMiners.org.

One particularly exciting collaboration is shaping up with the VERY “deep weeds” researchers at the PRImA Research Center at the University of Salford in Manchester England. PRImA is an acronym for “Pattern Recognition and Image Analysis.” And PRImA is the premiere research center working on the most vexing technical challenges of doing document structure and layout recognition as well as OCR (text recognition) of print and hand-written documents.


PRImA’s Aletheia — Our “Get-going” Fact-mining Tool

The vision for the “FactMiners entrepreneurial community ecosystem” is to create a LAM-based (Libraries, Archives, and Museums) social-gaming platform to engage the public with the “serious fun” of creating incredibly rich and detailed semantic models of the structure and content of historically important digital collections of magazines, newspapers, and other mostly-serial documents. The “gameplay” within the FactMiners’ experience is, in effect, a subset of what is known as “ground truth production” within the #DigitalHumanities research community. These “deep weeds” researchers are focused on the challenges of improving OCR (optical character recognition) and HHDR (handwritten historical document recognition). This intersection of interests — PRImA’s OCR/HHDR ground truth production (and storage) and FactMiners’ “fact-mining” — is the basis for my excitement upon discovery of the capabilities of Aletheia.

To date, the Center’s research has focused (and made AMAZING progress) addressing within-page layout recognition via fine-grained page segmenting techniques. In order to test recognition algorithms, PRImA has created a “ground truth” tool called Aletheia. For the hard-core OCR crowd, “ground truth” is page segmentation meticulously done by, and validated to be “correct” as an example of a perfect solution by, a human page segmentation expert. The “ground-truth edition” of a page is, in effect, used as an “answer key” to measure the accuracy of, and the differences between, the results of layout and text recognition algorithms.

The great thing is that FactMiners can use Aletheia as a “proof of concept” tool to begin creating the FactMiners’ Fact Cloud “edition” of the Softalk Apple collection! A FactMiners’ Fact Cloud is just our name for a “machine-readable copy” of each issue of Softalk. That is, a Fact Cloud is all the “facts mined” and stored in a graph database such that ALL the information that a human can gain by reading the magazine will also be accessible as #SmartData via the Fact Cloud. Our approach to #SmartData is via a “self-descriptive” graph database, that is, a graph database that includes a metamodel subgraph explaining the data’s structure and processes for access and use. These ideas are explored further in my #MCN2014 presentation, “Where Facts Live” — The GraphGist Edition.”

Toward Automatic Recognition of Whole-Issue Magazine Structure

To help me understand what a Ground Truth Edition of Softalk magazine would look like, PRImA’s Research Fellow Christian Clausner was kind enough to lay in a “region deep” level of page segmentation using the Aletheia ground truth tool. I was then able to use this amazing desktop application to add metadata at the “logical segment” (Group) level providing whole-issue magazine structure context to this page’s segmentation. This particular page — page one of Vol. 1 No. 1, Sept. 1980 — is the kind of “hint rich” page that has all kinds of information providing whole-issue document structure context. At FactMiners.org, we’re working on whole-issue magazine structure recognition technologies to find and use this information to inform subsequent individual page segmentation processes.

The picture I have included with this post is an example of detailed page-segmentation of the key “structure revealing” page of Softalk’s first issue (September 1980). Page 1 is “hint rich” with information revealing the meta-structure of this issue of the magazine. I’m calling this a meta-structure as magazines have a common but not required structure with flexible composition rules/guidelines about putting these pieces together. In this case, page one has the Table of Contents, the Advertiser’s Index, the masthead, and Previews of next month’s content.

Aletheia provides flexible region (page segment) creation, including the all-important page-segment-respecting OCR feature. Bulk OCR simply produces an unstructured “text soup” in an unseen layer of, for example, an image-based PDF file. While full-text searches can be done on such bulk OCR data, the actual and all-important structure of the magazine is nowhere to be found. This is a central challenge of FactMiners’ fact-mining — typed-structure-respecting text recognition… our Holy Grail of tool requirement.

The only way to create a FactMiners’ Fact Cloud will be through a page-segment modeling and respecting tool… and to date, nothing I have seen comes close to filling this requirement the way Aletheia does.

FactMiners’ “Visual Language” of Magazine Design

One of the #CognitiveComputing agenda items we’re working on at FactMiners is whole-issue commercial magazine layout recognition. Our goal is to develop a specialized #SmartProgram (whether NLP-algorithmic or neural net based, etc., is TBD) that — given a set of images of the pages of a magazine — finds the key pages that contain the telltale specification of the overall structure of the magazine. These key page elements, our “usual suspects,” will be the issue’s table of contents, list of advertisers, editor’s introduction, and other page-segments that reveal the whole-issue structure of the magazine. Finding and interpreting whole-issue magazine structure is a recognition process that spans individual pages and will be done in an iterative fashion over the collection of all page-images in an issue.

Whole-issue magazine structure recognition will use an iterative process much like the ‘elimination of unknowns’ strategy used to solve a Soduko puzzle. Using clues from an issue’s “hint pages” — such as the Table of Contents, Advertiser’s Index, and text continuation references, etc. we will identify the page-placement of “known” parts. As these hints are processed, the role of remaining page segments becomes clearer.

The whole-issue structure-recognizing challenge will require a “Sudoku-like” iterative solution. That is, by finding the key “hint page(s)” — primarily the table of contents and list of advertisers, etc. — our recognition algorithm or neural net will use a process of elimination to find and identify as many page segments as possible. While doing this at the page level, the #SmartProgram will be building up a whole-issue document structure as its iterative recognition reveals such features as the front-of-book, feature-well, and back-of-book structure of most commercial magazines.

PRImA’s Magazine Layout Dataset consists of PAGE-format “ground truth” files and their respective source images for over 800 pages of typical magazines. Ground truth in this context is a human-created “perfect solution” for page segmentation in advance of image extraction and text recognition (OCR). Each of these detailed pages drills down to the fine-grained level of text baseline, word and glyph boundary identification.

To this end, we’ll be creating a computer-vision reference dataset of the “visual language” of magazine layout design. As we create the FactMiners’ Fact Cloud, an intermediary step will be to create PAGE-format files (PRImA’s XML spec similar to ALTO) that include a “whole page layout” description at the page level of the XML-based PAGE document structure hierarchy. This means that the Softalk Apple collection will be the first 9,300+ images of a dataset that can be used to teach deep learning algorithms and neural nets, etc. to recognize commercial magazine whole-issue structure. Our “semi-ground-truth” dataset — as we’ll only go down to typed page segments, and not down to baseline, word, and glyph boundaries, etc. — will be a natural complement to the 880+ pages in the PRImA Magazine Layout Dataset.

I am VERY excited to learn about the PRImA Research Center, its wonderful people, and the amazing tools they have created. I look forward to the progress that can be made through this evolving collaboration.