PRImA’s Aletheia — Ground Truth & Softalk Magazine
Although most folks are happy with full access to the complete run of the Apple edition of Softalk magazine now that Timlynn and I have funded its “ingestion” into the Internet Archive, the Softalk Apple Project is far from over. In fact, getting the collection into the Archive was just step one — the first important step in the Citizen History aspect of our two inter-related projects.
We’re now moving forward forging strategic collaborations with world-class researchers and research centers to turn the Softalk Apple Project digital collection into a unique and valuable reference resource for broad applications in the Digital Humanities and Cognitive Computing domains. This applied research dimension of our Softalk preservation activity is the Citizen Science agenda we are pursuing at FactMiners.org.
One particularly exciting collaboration is shaping up with the VERY “deep weeds” researchers at the PRImA Research Center at the University of Salford in Manchester England. PRImA is an acronym for “Pattern Recognition and Image Analysis.” And PRImA is the premiere research center working on the most vexing technical challenges of doing document structure and layout recognition as well as OCR (text recognition) of print and hand-written documents.
PRImA’s Aletheia — Our “Get-going” Fact-mining Tool
To date, the Center’s research has focused (and made AMAZING progress) addressing within-page layout recognition via fine-grained page segmenting techniques. In order to test recognition algorithms, PRImA has created a “ground truth” tool called Aletheia. For the hard-core OCR crowd, “ground truth” is page segmentation meticulously done by, and validated to be “correct” as an example of a perfect solution by, a human page segmentation expert. The “ground-truth edition” of a page is, in effect, used as an “answer key” to measure the accuracy of, and the differences between, the results of layout and text recognition algorithms.
The great thing is that FactMiners can use Aletheia as a “proof of concept” tool to begin creating the FactMiners’ Fact Cloud “edition” of the Softalk Apple collection! A FactMiners’ Fact Cloud is just our name for a “machine-readable copy” of each issue of Softalk. That is, a Fact Cloud is all the “facts mined” and stored in a graph database such that ALL the information that a human can gain by reading the magazine will also be accessible as #SmartData via the Fact Cloud. Our approach to #SmartData is via a “self-descriptive” graph database, that is, a graph database that includes a metamodel subgraph explaining the data’s structure and processes for access and use. These ideas are explored further in my #MCN2014 presentation, “Where Facts Live” — The GraphGist Edition.”
Toward Automatic Recognition of Whole-Issue Magazine Structure
The picture I have included with this post is an example of detailed page-segmentation of the key “structure revealing” page of Softalk’s first issue (September 1980). Page 1 is “hint rich” with information revealing the meta-structure of this issue of the magazine. I’m calling this a meta-structure as magazines have a common but not required structure with flexible composition rules/guidelines about putting these pieces together. In this case, page one has the Table of Contents, the Advertiser’s Index, the masthead, and Previews of next month’s content.
Aletheia provides flexible region (page segment) creation, including the all-important page-segment-respecting OCR feature. Bulk OCR simply produces an unstructured “text soup” in an unseen layer of, for example, an image-based PDF file. While full-text searches can be done on such bulk OCR data, the actual and all-important structure of the magazine is nowhere to be found. This is a central challenge of FactMiners’ fact-mining — typed-structure-respecting text recognition… our Holy Grail of tool requirement.
The only way to create a FactMiners’ Fact Cloud will be through a page-segment modeling and respecting tool… and to date, nothing I have seen comes close to filling this requirement the way Aletheia does.
FactMiners’ “Visual Language” of Magazine Design
One of the #CognitiveComputing agenda items we’re working on at FactMiners is whole-issue commercial magazine layout recognition. Our goal is to develop a specialized #SmartProgram (whether NLP-algorithmic or neural net based, etc., is TBD) that — given a set of images of the pages of a magazine — finds the key pages that contain the telltale specification of the overall structure of the magazine. These key page elements, our “usual suspects,” will be the issue’s table of contents, list of advertisers, editor’s introduction, and other page-segments that reveal the whole-issue structure of the magazine. Finding and interpreting whole-issue magazine structure is a recognition process that spans individual pages and will be done in an iterative fashion over the collection of all page-images in an issue.
The whole-issue structure-recognizing challenge will require a “Sudoku-like” iterative solution. That is, by finding the key “hint page(s)” — primarily the table of contents and list of advertisers, etc. — our recognition algorithm or neural net will use a process of elimination to find and identify as many page segments as possible. While doing this at the page level, the #SmartProgram will be building up a whole-issue document structure as its iterative recognition reveals such features as the front-of-book, feature-well, and back-of-book structure of most commercial magazines.
To this end, we’ll be creating a computer-vision reference dataset of the “visual language” of magazine layout design. As we create the FactMiners’ Fact Cloud, an intermediary step will be to create PAGE-format files (PRImA’s XML spec similar to ALTO) that include a “whole page layout” description at the page level of the XML-based PAGE document structure hierarchy. This means that the Softalk Apple collection will be the first 9,300+ images of a dataset that can be used to teach deep learning algorithms and neural nets, etc. to recognize commercial magazine whole-issue structure. Our “semi-ground-truth” dataset — as we’ll only go down to typed page segments, and not down to baseline, word, and glyph boundaries, etc. — will be a natural complement to the 880+ pages in the PRImA Magazine Layout Dataset.
I am VERY excited to learn about the PRImA Research Center, its wonderful people, and the amazing tools they have created. I look forward to the progress that can be made through this evolving collaboration.