Ground Truth & the Knight Prototype Fund
There is a colloquial saying that if at first you don’t succeed, try, try again. So it is that I have been given the opportunity to reflect on these wise words in what is now the third article in the “Ground Truth &…” series of FactMiners’ Musings at Medium.com.
Well, long story short, the first two in this now-growing series of articles led to a flurry of exciting and productive activity that I’ll compress here… Based on interest in our research, my wife and project partner, Timlynn Babitsky, and I were invited to attend the 2015 edition of the Internet Archive’s Library Leaders Forum. The Good Folks at the Archive — being a non-profit and not having funding lying around for applied research fellowships — also recommended that we apply for project funding through the currently-running News Challenge of the Knight Foundation, which we did.
Alas, like nearly a thousand of our fellow entrants, our project — “Turn Text Soup into Smart Data in Newspaper & Magazine Archives” — was not selected as one of only forty-five projects to move forward to the News Challenge semi-finals. Which is what brings me to commiserate on the popular wisdom of what Chumbawamba so tunefully recommends… that when you get knocked down, you get back up again… to try, try again.
In the “Thanks but no thanks…” email from the Knight Foundation informing us that our project would not go forward in the News Challenge, we were encouraged to join any of our fellow entrants who cared to apply for admission to the current cohort of individuals and organizations that would be given a flat $35,000 USD award and project management training to kick-start their proof-of-concept projects through the Knight Prototype Fund program! :D
And that brings us to the Life Lesson part of being a determined, independent Citizen Scientist…
Funding grassroots Citizen Science projects won’t be easy, but it is possible. Especially if we remember to pick ourselves up, dust ourselves off, and try, try again…
Here’s our most recent shot… and it’s a doozy! For our Prototype Fund entry we have focused on the “Crowdsourcing Ground-Truth” component from our News Challenge entry, and we have significantly strengthened our core project team to include Laura Mandell, Principal Investigator of eMOP (the Early Modern OCR Project) at Texas A&M’s IDHMC, and Jude Coelho, Process Engineering Manager at the Internet Archive.
Unlike the public and transparent process of the News Challenge, the submissions to the Knight Prototype Fund program do not have a public-facing URL where we can point you to our entry. So to keep our various collaborators informed of our activity and to provide an “actual sample” of a Prototype Fund submission for our fellow Citizen Scientists thinking about applying for the next cohort of Prototype Fund program participants, I have reproduced our current entry below.
Our Entry to the Knight Prototype Fund
The following is a (snazzily formatted) copy of our current submission to the Knight Prototype Fund round of funding. The initial submission form is short and focused with strict word-count limits on each question’s answer.
In addition to the web-based entry form text-fields, entries are allowed to include three “visual media” files as part of your submission. Our file attachments were created as PDF documents and are included here via links to uploads of these submitted documents on Slideshare.net.
Turning Text Soup into Smart Data in Newspaper and Magazine Archives: Step One. Crowdsourcing Ground Truth
Describe what you will make. *(150 words)
FactMiners will develop a custom version of the Open Source Zooniverse Citizen Science crowdsourcing platform that will:
- add workflow and classification features supporting generation of “Ground Truth” editions of newspaper and magazine pages that are used as ‘answer keys’ by OCR and AI researchers training ‘robots’ (AKA smart programs) to read and understand magazines and newspapers
- support direct access for Ground Truth mapping of high-resolution document page image files in the collections of the Internet Archive
- add data conversion and export of crowdsourced Ground Truth data files in the PAGE format compatible with the desktop edition of PRImA’s Aletheia Ground Truth tool, and support the export of these files directly back to collections at the Internet Archive
- use this platform for the prototype implementation of the “Table Of Contents (TOC) Pattern Reference Library” to be established as a Special Purpose Research Collection at the Internet Archive
What problem are you trying to solve? * (200 words)
The current state-of-the-art of document OCR (Optical Character Recognition) is a “one size fits all” bulk digitization process that can most easily be described as “text soup” — a “bag of words” with little to no information about the complex whole-issue document structure of the source material. Lack of document structure information is less critical in basic chapter-organized books and monographs. But this lack of document structure information in digitized documents is particularly critical if we want to extract the rich cultural and historic time-series data locked in the content of serial magazines and newspapers.
Originally submitted as part of a Knight News Challenge entry, our proposed prototype is the first step and a key foundation for the applied research collaboration of FactMiners and the PRImA Research Center. The goal of this initiative is the development of the next generation of document layout and text recognition that can turn “Text Soup” into “Smart Data.”
Please see the Appendix of the Project Summary document attached to this submission for copies of the letters of support from our collaborators at PRImA, Texas A&M’s eMOP (Early Modern OCR Project), and the Internet Archive.
Who do you intend to impact with the project and how do you understand their needs? * (200 words)
Our hope is that the fruits of our applied research will contribute to the “next leap forward” in cultural heritage digitization that, as our project title suggests, will “Turn Text Soup into Smart Data.” As such, our aspirations as Citizen Scientists are to contribute to Science in the domains of Digital Humanities and Cognitive Computing.
Our strategy is focused on developing next-generation “drop-in” workflow services that provide “deep-learning” features that can be called upon on-demand within the standard digitization process/service of the Internet Archive. These software-based “magazine/newspaper savvy” OCR/layout micro-service tasks will be integrated with the “human-assist” of a crowdsourced “Ground Truth and Fact-Mining” community/service which is the focus of the prototype to be funded by this entry.
FactMiners’ Jim and Timlynn were among the invited attendees at the Internet Archive’s recent Library Leaders Forum. The invitation was based on the Archive’s interest in the FactMiners/PRImA research agenda. Through hands-on workshops and feedback sessions on proposed Archive strategies, we have gained deep insight into the Archive’s digitization processes together with many excellent collaborative relationships with key Archive staff.
Please list team members and their qualifications. * (400 words)
- Jim Salmons, FactMiners. Former Executive Consultant, Object Technology Practice, IBM. Serial entrepreneur in tech-related startups/services since 1980s. Co-founder of Sohodojo, Open Source R&D lab. Experience: software design/development, executable business models, metamodeling, graph databases. Cedar Rapids, Iowa USA
- Timlynn Babitsky, FactMiners. Experience: Tech Mentoring as Senior Consultant, Object Technology Practice, IBM and as Director of Education Services, Knowledge Systems Corporation. Community Development and Grants Management as Director of North American Rural Futures Institute and co-founder of Sohodojo. Cedar Rapids, Iowa USA
- Apostolos Antonacopoulos, Director & Founder, Pattern Recognition and Image Analysis (PRImA, http://www.primaresearch.org) Research Center, School of Computing, Science & Engineering, University of Salford. As Founder of PRImA and with extensive leadership roles and accomplishments in the OCR, document analysis, and layout recognition research domains, Apostolos is widely recognized as one of the top thought leaders and researchers in these fields. U. Salford, Manchester UK.
- Christian Clausner, Research Fellow, PRImA. In addition to his wide-ranging and relevant research interests, he is the core designer/developer of the Aletheia Ground Truth and WebAlethia tools. U. of Salford, Manchester UK.
- Laura Mandell, Director of IDHMC (the Initiative for Digital Humanities, Media, and Culture) and Professor, Department of English, Texas A&M. She is Principal Investigator of the eMOP (Early Modern OCR Project), an eResearch consortium including PRImA, developing software that will allow all scholars to deep-code documents for data-mining, and improving OCR software for early modern and 18th-c. texts via high performance and cluster computing. College Station, Texas USA
- Jude Coelho, Process Engineering Manager at the Internet Archive where his duties include designing new processes and software tools to increase efficiency and productivity in the Archive’s book scanning operations. San Francisco, CA USA.
While these core team members will envision and lead the Prototype Fund project, the “to be” leaders and contributors are yet to be determined as the project funding will be used to implement a two-session, two-location (Manchester, UK and College Station, Texas USA) “hackathon” event to design and develop the proposed Zooniverse crowdsourcing Ground Truth platform. A Call For Participation will invite interested public participation by Open Source developers and eScholars with strong participation anticipated from Digital Humanities and Computer Science students of the host Universities, Salford and Texas A&M.
What progress, if any, have you made on this project? * (200 words)
- 2013: Jim and Timlynn begin post-cancer #PayItForward Bonus Round activity vowing to “Do something amazing!” with our collection of Softalk magazines; Jim’s metamodel subgraph #SmartData design wins a 1st place in Neo4j’s Domain Modeling GraphGist Challenge; Neo4j community member points us to the museum informatics domain and the #cidocCRM, the Conceptual Reference Model for Museums.
- 2014: FactMiners demo at “Museums & the Web 2014"; Semifinalist Ashoka/LEGO ‘Re-imagine Learning’ Challenge; Jim develops PLN (Personal Learning Network) of experts in #cidocCRM and #TEI; peer-reviewed paper & demo at #MCN2014; peer-reviewed “FactMiners’ Fact Cloud and Witmore’s Text as Massively Addressable Objects” in “CODE|WORDS: Technology & Theory in Museums”
- 2015: Collaboration w/ PRImA begins with “skunkworks” to tweak Aletheia to read Internet Archive ABBYY-XML metadata files; DPLA (Digital Public Library of America) Community Reps for Iowa; FactMiners’ Zooniverse prototype, “Teach Robots to Read Magazines,” developed at #HILT2015 “Crowdsourcing Cultural Heritage” workshop; Invited attendees at 2015 Internet Archive “Library Leaders Forum”.
Acceptable file types: pdf, jpg, gif, mp4, png, mov, xls, ppt, pptx, avi, mpg. Optional: Add images or other visual material to support your project.
Select up to 3 files to attach.
These are the three files we attached to our submission. (We are particularly pleased and proud to include the letters of support from our PRImA and eMOP partners submitted as an Appendix to the third document.) Also note, if you are on iOS and using the Medium app, the Slideshare.net links are not displaying. You will find these three documents here, here, and here.
Which of the following best describes your organization? *
Nonprofit startup <== selected
[NOTE: Timlynn and I are currently independent Citizen Scientists submitting to the Prototype Fund as individuals. However, it is our intent and noted in our submission that, if funded, we will have a non-profit company/foundation created, FactMiners.org, to serve as the legal entity that supports the organization and funding of the FactMiners’ Open Source developers community and its projects.]
Location (City, State)
Cedar Rapids, Iowa USA
Over 18 (checkbox)
CHECKED… Way over! :D :D
How did you hear about the Prototype Fund?
Wendy Hanamura, Director of Partnerships at the Internet Archive recommended that FactMiners and PRImA apply for both the Knight News Challenge and the Knight Prototype Fund as a means to energize broad interest in the applied research agenda of “Turning Text Soup into Smart Data” and as a means to initiate an active research collaboration with the Archive.
Enter email for updates
Agree to T&Cs (checkbox)
And that’s it! We submitted the above entry within the deadline this past Monday to be considered for funding in the current round of the Knight Prototype Fund program.
We are elated to be working with the Internet Archive and our world-class collaborators at PRImA and eMOP at the University of Salford and Texas A&M, respectively. We are sure that the project proposed in our Prototype Fund submission will be full of fun, learning, and serious contributions to the domains of the Digital Humanities and Cognitive Computing.
Here’s hoping that the next entry in the “Ground Truth &…” series of FactMiners’ Musings on Medium.com will be the tale of how we implemented the “Crowdsourcing Ground Truth” project in our on-going commitment to turn Text Soup into Smart Data.
-: Jim Salmons :-
18 November 2015
FactMiners, Cedar Rapids, Iowa USA