

PRImA’s Aletheia — Ground Truth & the Internet Archive
Like The Black Adder’s Baldrick, I have a cunning plan… a plan to open the Internet Archive digital document collections to a new level of fine-grained document structure and text analysis supporting both serious eResearch scholarship as well as “seriously fun” social gameplay that invites the playful public to explore our cultural heritage stored in the Archive’s digital collections.
This is also a cunning plan that helps me and Timlynn fulfill our commitment to #PayItForward during our Bonus Round of Life following our successful (so far) battles with cancer.


My purpose here is not to dwell on the life-interrupted path of how we got here. I need only summarize that Timlynn and I have chosen to spend our new Lease on Life by envisioning and nurturing two inter-related projects — The Softalk Apple Project, our Citizen History project, and FactMiners.org, the Citizen Science project doing applied research to open Softalk magazine beyond the limits of basic human reading and interpretation.
If asked to summarize our FactMiners’ applied research in three hashtags, they would be #SmartData, #CognitiveComputing, and #DigitalHumanities.
Why Magazines?
It is my personal history as a serial entrepreneur at the dawn of the microcomputer revolution — including a stint writing and working for Softalk Publishing— that grounds my deep appreciation for the cultural heritage value of this 48-issue, 9,304-page publication.
To be a worthy post-cancer Bonus Round #PayItForward activity, Timlynn and I hope that the Softalk Apple Project will create the ultimate on-line digital archive to preserve and to celebrate the cultural treasure that is Softalk magazine. As we envisioned the software design and platform required to deliver on this promise, my undergraduate majors in Journalism and Psychology helped me to see the value of FactMiners’ applied research as a #DigitalHumanities contribution to the domain of the History of Journalism and Mass Communications in addition to its obvious focus on the History of Science and Technology.


Magazines, particularly commercial magazines, are unique print-based “mirrors of our soul” — topical, timely, and very intimately reflecting our interests, attitudes, desires, aspirations… the whole range of who we are, what we do, and how we think. As serial publications, they capture time-series data that shows the ebb and flow of these social interests, attitudes, desires, and the like.
The challenge in opening serial publications, like newspapers and magazines, to historical eResearch is that most mainstream cultural heritage digitization workflows are not sufficient to unlock the full “artifact value” of these complex documents, especially if we are to begin factoring in machine-readability along with human-readability.
A Question of Granularity
There is a “granularity mismatch” between basic digitization output and the structure of the documents that these digital copies represent. Typical cultural heritage digitization most frequently is a high-volume, highly efficient workflow involving human “scanners” shooting high-resolution images of the print publication, followed by a highly-automated computer workflow that performs bulk, non-human-corrected OCR (optical character recognition).
Automated OCR generates output documents with a hidden, searchable “text layer” to the digitized document. These image-based digitized documents — most frequently in the PDF format — combine viewing of images of the publication’s pages supplemented with “full text search” that links the hidden OCR’ed text layer with the image of searched words on-screen. These word-to-word-image linkages give the appearance that the digitized document is a text document — and in a way it is — but only to the extent that it needs to be for most applications.
One Size Does Not Fit All
The fidelity of how well the basic cultural heritage digitization workflow output reflects the intended meaning and content of a source publication depends greatly on the structural complexity of the document being bulk-processed, say, during the amazingly affordable ingestion of magazine collections into the Internet Archive.
Typical books or monographs are relatively “lightweight” document structures, consisting mostly of some front-matter and table of contents followed by a sequence of pages, first to last. An author usually provides headings or chapter breaks to help organize the presentation. Bulk ingestion of such documents has achieved phenomenal fidelity and processing efficiency.


But the structure of a magazine — especially a commercial magazine with paid advertisements — is a byzantine, highly complex document structure with issue-to-issue serialization, within-issue text-block continuations, single pages with multiple, distinctly different content structures on them… all kinds of added complexity beyond that of a basic book or monograph.
The hidden “text soup” layer of searchable text provided by a typical cultural heritage digitization ingestion process simply does not contain the essential magazine document meta-structure information required to interpret the full meaning of the content of the digitized magazine.
A book’s author speaks with a single voice. A magazine “speaks” like a cacophonous room full of people, each competing for your attention, each with their own credibility, motivations, and expertise, etc..
Semi-Ground-Truth!?
To unlock the full value of newspapers and magazines to eResearch and “serious fun” social gameplay, we need digitization workflows that can accommodate page-segment-respecting OCR (text recognition), together with user-defined structure-defining metatagging of these logical segments. And in the specific case of the Softalk Apple Project, we need to be able to perform this “post-basic-ingestion” additional workflow within the repository structure and access provided by the Internet Archive, the home of the “basic edition” of the Softalk Apple Project collection.


In an earlier series of FactMiners’ Musings, I related the story of how my search for Kindred Spirits and the knowledge needed to create the definitive Softalk magazine digital archive had taken me to my first museum informatics conferences.
At the time, I was surprised when the folks most interested in my applied research agenda tended to be medieval scholars and folks working on the transcription of handwritten historical documents.
As I became familiar with the people and projects in the historic document transcription community, I found great examples of peer-to-peer learning/sharing communities. While exploring the Transkribus project out of the University of Innsbruck, I found fascinating references that led me to the PRImA Research Center at the University of Salford, Manchester England (UK).
PRImA is the acronym for “Pattern Recognition and Image Analysis.” The PRImA Research Center is focused on the deepest, deep weeds of optical character recognition, text-mining, layout recognition, and image analysis, etc.. PRImA is so well-respected for its expertise and deep focus of applied research that it serves as the de facto authority among a consortium of academic and private-sector researchers (such as the ABBYY developers) who hold the annual ICDAR automated layout and recognition competition.


Upon my first visit to the PRImA website, I learned that “ground truth” for the OCR community is a human-created and human-validated “perfect solution” to the complete page-segment-based layout analysis of an image of a source document page. For rigorous OCR research applications, ground truth drives down to the level of text baseline identification, word and glyph boundaries, font metrics, etc.
What dropped my jaw and put a huge smile on my face was the discovery that the PRImA folks have created, and make freely available, the Aletheia Ground-Truth tool. This Windows-based client application includes the all-important “page-segment-respecting” OCR capability as well as user-definable meta-tagging of logical page segments (one or more ordered text and/or image typed-regions making up the logical segment on a page). This was the first tool of its kind that I’ve seen with all the key features I needed to begin prototyping-while-building the FactMiners’ “Fact Cloud edition” of Softalk magazine.


Through an enthusiasm-building series of communications with PRImA Research Fellow Christian Clausner, we developed a shared understanding of the difference between “full ground truth” — as required for PRImA’s focus on performance evaluation of automated layout analysis by OCR algorithms — and the less-rigorous requirement of “semi-ground-truth” needed for “fact-mining” Softalk magazine. I’ve described this intersection of our domains of interest in “PRImA’s Aletheia — Ground Truth & Softalk Magazine.”
Next-Gen Aletheia & the Internet Archive
As I explored the current implementation of Aletheia, I wanted to identify the specific features needed to tailor this remarkable application for service as an ideal FactMiners’ prototyping tool. During this exploration, Christian and I identified a subset of to-do requirements that would allow a next-generation Aletheia tool to work directly with the standard file-set of document collections in the Internet Archive.
This, essentially, is the “cunning plan” I alluded to in the theme-setting cartoon at the top of this article.
For eResearch projects working with Internet Archive collections, we’ve identified a “low impact” means to provide support for both standard ground-truth production (OCR research) and page-segment-respecting OCR with user-configurable region type-tagging (e.g., FactMiners’ “fact-mining”).


We identified this opportunity to open Aletheia to Internet Archive collection support when Christian explained to me that Aletheia was able to import the XML-based “job file” produced by the ABBYY FineReader Engine. I knew from my visit to the Midwest Regional Scanning Center of the Internet Archive, that the Archive used the FineReader Engine in its custom digitization workflow that produced the high-resolution page images along with bulk OCR for automated text retrieval.
Closer inspection of the stock output files of issues within the Softalk Apple Project collection at the Archive revealed that each issue included a file with the filename ending with “_abbyy” and having the “gz” filename extension indicating the file’s compressed state. After decompressing the file and restoring its original “xml” filename extension, I was able to open and explore the stock ABBYY XML Export file. With great anticipation, I tried to open the file with Aletheia… and it failed!?
After a few note exchanges, we determined that the current Aletheia was written in support of ABBYY FineReader XML Schema Version 10 and the Internet Archive version is… wait for it… wait for it… Version 6!? This is not surprising, actually, as Brewster’s Army has been bulk digitizing on scale for many years, and scaled workflows that work tend to be upgraded only as necessary.
A quick inspection of the Version 2 and Version 10 Schema confirmed the significant difference between these generations of the ABBYY XML file that describes the page regions and other metadata related to a FineReader-processed document. Fortunately, if we’re to look at the “cup half filled” side of this issue, the Version 6 ABBYY Schema is very “bare-bones” and will lend itself to relatively easy import into Aletheia.
- So the first “to-do” — for next-gen Aletheia to support document collections on the Internet Archive — is to incorporate importing of ABBYY FineReader XML export files based on Version 2 of the ABBYY XML schema.
Aletheia’s ground-truth storage file format is the PAGE format developed and supported by the PRImA Research Center. As OCR research ground-truth focuses on single page layout interpretation, the current implementation of Aletheia expects the ABBYY FineReader XML “job file” to be a single page document. If a multi-page ABBYY XML file is imported by Aletheia, only the first page in the file is read.
The focus of FactMiners’ “fact-mining” encompasses not just the whole-issue meta-structure of the magazine, but the longitudinal dimension of serial publication. To reveal the evolution of the fledgling microcomputer industry and the dawn of the digital age, we not only need fine-grained information extracted from the magazine, that data needs to reflect its time-series dimension as well.
- Given the whole-issue focus of “semi-ground-truth” applications, the next-gen Aletheia import of ABBYY FineReader XML files will appropriately process a multi-page import file by generating a collection of PAGE files as output.
Importing multi-page ABBYY FineReader XML files is not the extent of next-gen Aletheia’s support of magazines, newspapers and other serial publications.
- Next-gen Aletheia will support eResearch of magazine, newspaper, and other serialized publications by using PRESSoo to model and organize its collection of specific-issue PAGE files.


PRESSoo is an ISSN-supported domain-specific extension of FRBRoo, the bibliographic reference extension of the #cidocCRM, the Conceptual Reference Model for Museums. FactMiners has adopted the #cidocCRM as the primary Reference Model for the metamodel subgraph of a Fact Cloud. (Follow this link for more on #cidocCRM at FactMiners.)
As Christian and I discussed these and related possible new features to be added to a hypothetical next-generation Aletheia, we agreed that a plug-in architecture would allow next-gen Aletheia to hit MVP — Minimally Viable Product — functionality for the less-rigorous “semi-ground-truth” applications, like FactMiners, ahead of its delivering the full functionality of an updated “full ground-truth” configuration.
As we envisioned the possibilities for a next-gen Aletheia, Christian and I also identified other issues where the current implementation is showing its age, or obvious “next step” features where begging to be added. The UX (user interface/experience) frameworks used to build the current Aletheia are old and incapable of accommodating today’s ultra-resolution displays. Certain strategic places within the current user interface allow the adoption of user-contributed idioms for incorporating project-specific meta-tagging. But there is no programmatic means to ease or validate such user-adopted extensions.
When viewed as covering both “standard” ground-truth and “semi” ground truth, Christian and I were able to identify a step-wise evolution of a new generation of the Aletheia tool that could open an exciting new avenue for Citizen Scholarship (a niche within eResearch) based on “deep dives” into existing document collections within the Internet Archive.
Of course PRImA is a research center within a university environment, so its activities are greatly prescribed by the ebb and flow of its project-based funding. Christian would love nothing better than to be able to cloister himself for some quality time to do the development work we’ve identified. But he won’t be able to commit to this activity — especially at a scale that will produce rapid results — without project-specific funding.
Wolf Child Seeking Fellowship
Although it is in no way comparable to the famous dream of Dr. King; I, too, have a dream… a Post-Cancer, I’m-Not-Done-Yet-Reaper! Dream.
My dream is to start and nurture a Innovation Incubator focused on #SmartData and #CognitiveComputing designs focused on #CitizenScience and #CitizenHistory applications within the #DigitalHumanities.


This incubator will function as a strategic Network Enabler within its associated Entrepreneurial Community Ecosystem. The incubator’s initial “stable” of incubating projects and associated technologies will be FactMiners.org and The Softalk Apple Project.
It does not matter if this incubator is evolved as the focus of an academic research fellowship, or privately-funded as a lightweight boutique engineering lab with investors participating in licensing or investment opportunities related to technologies or new ventures spun out of the incubator. The only requirement for grant funding of a fellowship or angel investment in a start-up will be to embrace Open Source as a software licensing and business model constraint on “offspring” sprung from the incubator.
The Path Ahead for the Innovation Incubator
Absolutely the “fast path” first step for the incubator’s FactMiners project is to do proof-of-concept and tool-MVP level work on the #SmartData design ideas underlying the FactMiners’ social-game platform. The means to achieving this goal is to find funding for the FactMiners/PRImA collaboration to co-develop the next-generation Aletheia Ground-Truth tool along the lines described above. The incubator —whether a grant-funded fellowship or an angel-invested start-up — will need sufficient funding to enable PRImA’s participation in the proposed collaboration in addition to sustaining and expanding FactMiners’ own participation in the effort.


Concurrent with this next-gen Aletheia development project, the incubator will launch a crowdsourced project to build the “Visual Language of Magazine Design,” a Ground-Truth Image Dataset as briefly described in this earlier article. This project will pay particular attention to the document collections stored on the Internet Archive as these will be most easily ingested by the proposed next-gen Aletheia tool.
With progress on both the next-gen Aletheia tool and a growing ground-truth image dataset focused on the “visual language of magazine design,” the incubator will be in a position to pursue the #CognitiveComputing dimension of FactMiners’ applied research. The first expression of this focus will be to develop whole-issue, complex document “meta-structure” recognition technologies specific to commercial magazine analysis essential to moving digital cultural heritage resources from human-read-only to everybody-everymachine-readable-computable resources.
The second phase of our #CognitiveComputing initiative will focus on tapping the “virtuous circle” of the “Rainman/Sherlock” collaboration made possible by the #SmartData feedback loop between NLP (natural language processing) text-mining algorithms, and the semantic and document structural insights reflected in the Fact Cloud’s metamodel subgraph.
The Wish List of the Usual Suspects
Of the two potential paths for realization of the proposed Innovation Incubator, I prefer a grant-funded Research Fellowship. I would also greatly appreciate the unique personal-growth opportunity to “learn-while-doing” that a research fellowship would provide. I also believe that the quality of what I accomplish will be greater if immersed in an energizing #DigitalHumanities research ecosystem.


To be literally in the “belly of the beast,” I would be doubly-excited if this fellowship would be hosted by the Internet Archive. An IA fellowship would enable a sustained visit or visits to the Archive’s San Francisco headquarters. During these visits, I would have an energizing, immersive personal learning experience while, in turn, providing thought leadership and project demonstrations of “next generation” Citizen Scholarship, eResearch, and cultural heritage social-gaming. With family and friends in the area, this would be an exciting #PayItForward #BonusRound opportunity.
If a fellowship at the Internet Archive was not feasible, there are some “personal favorites” of #DigitalHumanities research hotbeds where Timlynn and I could have family and friend access during temporary relocation or sustained visits during a fellowship.


As a Baltimorean with an interest in a collaboration with J-school researchers at Washington & Lee, my undergraduate alma mater in Virginia, I would be very interested in securing a research fellowship at the Maryland Institute for Technology in the Humanities (@UMD_MITH) at the University of Maryland. From everything I can gather through on-line research, there are many intersections of both research interests and institutional values that give me a sense that MITH would be a great fit for my proposed applied research and related activity. While I am certainly interested in MITH’s leadership participation in such activities as the Crowd Consortium and HILT, it is MITH’s Vintage Computers initiative that would be an ideal home for my proposed fellowship.
As a “best possible #PayItForward #BonusRound dream scenario,” my imagined ideal fellowship would be a “shared-host” arrangement, with and opportunity to learn and work at both the Internet Archive and MITH through a institutional-collaboration organized under the MITH Vintage Computers initiative. :-)
To this end “In Search of a Research Fellowship,” I will “cold tweet” Brewster Kahle a link to this article as well as tweet to MITH’s Director Neil Fraistat and Associate Directors Matthew Kirschenbaum (the Vintage Computers guy :-) ) and Trevor Muñoz.
Of course if a fellowship at @UMD_MITH is not possible, there is also the “close to home” route of our exploration of the potential of a grant-funded fellowship through the #DigitalHumanties research centers at the University of Iowa and Grinnell College. This might be a timely fit given the recently Mellon-funded “Digital Bridges for Humanistic Inquiry” initiative, an innovative public research university and a private liberal arts college collaboration.
If one or more angel investors wanted to go the boutique engineering lab start-up route, the location would almost certainly be in the Boulder, Colorado area. Family access and the vibrant Boulder tech startup ecosystem are big attractions. Realistically, under a start-up studio model, my and Timlynn’s roles would be limited to an energetic start-up phase, evolving pretty quickly to thought leadership and mentoring roles. We are not in a position to do intense daily operational or strategic business management. This said, there are many creative ways to build and run a lightweight organization today.
But seriously… I get goosebumps thinking about the potential of an Internet Archive and MITH Vintage Computers co-hosting scenario…
Chasing My #PayItForward #BonusRound Dream
My next steps will be to bring this “In Search of a Fellowship…” proposal to the attention of folks who may be in a position to help me realize this #PayItForward Dream. A few folks are known and trusted colleagues. In other circumstances, as mentioned about “cold tweeting” Brewster and the MITH folks, I will simply reach out through public self-introduction.
If you are in a position to offer such a research fellowship, or to fund such a fellowship at your or another institution, or if you just have words-of-wisdom to offer, please do not hesitate to “marginalia” me. Leave a note through this article’s public or private commenting features provided by the Medium platform. Or reach out through Twitter at @Jim_Salmons.


In just a couple days, Timlynn and I will be attending #HILT2015 as part of our next @DPLA #SmartTrip. We’ll be in Indianapolis to take Mia Ridge and Ben Brumfield’s “Crowdsourcing Cultural Humanities” course. This week-long event will be a wonderful opportunity to explore the possibilities of securing a research fellowship while learning as much as possible about the next phase of our design focus for the FactMiners’ social-game and eResearch platform.
Wish us luck! :-) :-) I will certainly update this article with any news of our progress. In the meantime, you can keep an eye on our HILT-related activity by monitoring the #HILT2015 and #SmartTrip hashtags, as well as follow our Softrek2 publication here on Medium.
Happy-Healthy Vibes,
-: Jim Salmons :-
23 July 2015
Cedar Rapids, IA