The Open Islamicate Texts Initiative Arabic-script OCR Catalyst Project (OpenITI AOCP)
With generous funding from The Andrew W. Mellon Foundation, OpenITI AOCP will create a new digital text production pipeline for Persian and Arabic texts.
In June 2019 The Andrew W. Mellon Foundation generously awarded the University of Maryland, College Park (UMD) a $800,000 grant for the Open Islamicate Texts Initiative’s Arabic-script Optical Character Recognition Project (OpenITI AOCP).
The project is led by Matthew Thomas Miller (Roshan Institute for Persian Studies at UMD), Maxim Romanov (University of Vienna), Sarah Bowen Savant (Aga Khan University), David Smith (Northeastern University), and Raffaele Viglianti (Maryland Institute for Technology in the Humanities at UMD). SHARIAsource, a project of the Program in Islamic Law (PIL) at Harvard Law School (both led by Intisar Rabb), provided significant support for the initial technical infrastructure upon which this project will build (i.e., CorpusBuilder 1.0) and they will also play a leading role in the technical development portion of OpenITI AOCP.
OpenITI AOCP will catalyze the digitization of the Persian and Arabic written traditions by addressing the central technical and organizational impediments stymying the development of improved OCR for Arabic-script languages. Through a unique interdisciplinary collaboration between humanities scholars, computer scientists, developers, library scientists, and digital humanists, OpenITI AOCP will forge CorpusBuilder 1.0 — an OCR pipeline and post-correction interface — into a user-friendly digital text production pipeline with a wide range of new OCR enhancements and expanded text export functionality. The project will also include a series of workshops, a full corpus development pilot, and a Persian and Arabic typeface inventory, all of which will inform the development of the technical components in important ways.
Background and Project Details
The textual production of the diverse premodern Islamicate cultures stretching from modern Spain to South Asia is one of the most prolific in human history, making it an ideal candidate for digitally enhanced methods of search and analysis. Yet, our ability to bring these diverse literary traditions into the digital realm has been repeatedly frustrated by the underperformance of the currently available OCR solutions for Arabic-script languages. On top of this poor performance, several of them are also prohibitively expensive for academic users and offer limited trainability. For this reason, by late 2016 OpenITI’s work increasingly began to focus on the development of improved open-source OCR for Persian and Arabic through a collaboration with Benjamin Kiessling, then a computer scientist at Leipzig University’s (LU) Alexander von Humboldt Chair for Digital Humanities (now at Université Paris Sciences et Lettres). This collaboration led to a series of studies on Kiessling’s new OCR engine, Kraken, which demonstrated that its neural network-based approach to OCR (now implemented in Tesseract 4’s most recent release too) could routinely achieve accuracy rates on Persian and Arabic print books greater than 97% and, at times, even greater than 98%.
While these recent advancements in the accuracy of open-source OCR engines such as Kraken and Tesseract certainly mark an important milestone in creating digital sources for Persian, Arabic, and other Islamicate languages, these tools still remain inaccessible and thus functionally useless for the vast majority of scholars, librarians, and other interested users. There is no user-friendly digital infrastructure to provide access to these communities and no organizational apparatus to support and organize large-scale OCR projects/collaborations.
OpenITI AOCP will address these issues by:
- Transforming CorpusBuilder 1.0 (developed in collaboration with and funded by SHARIAsource at Harvard Law School) into CorpusBuilder 2.0 — a genuine digital text production pipeline — by integrating standards-compliant text conversion/export functionality and the latest advancements in machine learning, computer vision, and natural language processing into CorpusBuilder (for more information on new OCR components, see following section; for more information on CorpusBuilder 1.0, see its project page);
- Executing a corpus development pilot that will produce a typeface inventory of late nineteenth and twentieth-century Persian and Arabic printing, training data/OCR models for the top twenty Persian and Arabic print typefaces, and two hundred newly OCR’d texts (ten high priority works selected from among books in each of the top twenty typefaces) in Text Encoding Initiative (TEI) XML format;
- Fostering the development of a network of allied OCR projects and Arabic-script OCR users through regular teleconferences, Slack groups, and two experts workshops focused on soliciting interdisciplinary feedback and coordinating efforts on technical developments with other OCR projects;
- Producing a white paper on best practices for Arabic-script text digitization (workflow, technical method, etc.) and a comprehensive five-year plan focused on the next steps for Persian and Arabic corpus development and the improvement of Arabic-script OCR.
OpenITI AOCP will establish the digital and organizational infrastructure for the production of Persian and Arabic corpora and Islamicate Digital Humanities more broadly. It will increasingly open up the “great unread” of twelve hundred years of Persian and Arabic cultural production to rapidly proliferating and increasingly sophisticated “distant reading” methods that until now have remained only partially usable in the study of Islamicate cultures. Researchers, students, and libraries will benefit not only from the newly digitized Persian and Arabic texts produced in the corpus pilot, but they also will be able to OCR their own texts with the new Arabic-script digital text production pipeline, CorpusBuilder 2.0. These tools will enable students and scholars alike to leverage digital methods of analysis such as text reuse, topic modeling, and stylometric analysis in the study of the Persian and Arabic texts on which they are conducting research (not just on the limited number that currently exist — in a variety of non-standards-compliant formants — in open digital repositories).
Lastly, the development of Arabic-script OCR technology that is highly accurate, user-friendly, and open-source will be an important technological step towards providing access to texts for both visually-impaired native speakers and students of Arabic-script languages who require text-to-speech screen readers to access written materials.
OpenITI AOCP Deliverables
Broken down by workstreams, the primary outcomes of AOCP will be:
- Technical Workstream Deliverables:
1.1) Systems for:
- creating ground truth by aligning dirty OCR with existing editions;
- improving layout analysis;
- automating offline OCR retraining;
- unsupervised confidence measures for automatic transcription;
- unsupervised model selection, combination, and adaptation;
- producing accurate OCR with significant mixtures of languages.
1.2) TBD number of publication(s) on tests of aforementioned OCR systems;
1.3) New Arabic-script OCR models, reducing average character error rates from over 20% to under 3% on the ten most important Persian and ten most important Arabic typefaces (a base of training data that will significantly reduce post-correction time for large swaths of the Persian and Arabic textual tradition and help advance the field of Arabic-script OCR generally);
1.4) A Python library that converts OpenITI mARkdown exports from CorpusBuilder into standard open formats, e.g. TEI XML;
1.5) All technical products produced in the technical workstream will be integrated with CorpusBuilder 1.0 — forming the digital text production pipeline, CorpusBuilder 2.0 — and released separately. The CorpusBuilder 2.0 interface and basic training materials will also be translated into Persian and Arabic.
2. Humanities/Corpus Development Workstream Deliverables:
2.1) a Persian and Arabic-script typeface inventory and bibliographic database (including identification of typeface families) of the most important typefaces in Persian and Arabic print history and a companion article describing the inventory’s methodology, structure, and related research;
2.2) one hundred newly digitized Persian works in TEI XML and one hundred newly digitized Arabic works in TEI XML (both Persian and Arabic works will be made freely available through both GitHub and Perseus Digital Library’s new Scaife Viewer);
2.3) a reproducible digital text production workflow (detailed in the white paper).
3. Community Building Deliverables:
3.1) A white paper, published on Humanities Commons or open access relevant journal, detailing best practices for Arabic-script digitization (including text encoding standards for Persian and Arabic);
3.2) An interdisciplinary experts workshop focused on OCR research and OCR projects;
3.3) An interdisciplinary experts workshop focused on evaluation of CorpusBuilder 2.0 beta, digital capacity building in Islamicate Studies, and involving Islamicate Studies scholars in planning the future of Islamicate corpus development;
3.4) Training modules, published on the OpenITI website and GitHub page, on “How to” produce digital editions of texts in CorpusBuilder 2.0;
3.5) Networks of allied OCR researchers, projects and interested users, which will meet biannually during the grant period and discuss projects over Slack;
3.6) a comprehensive five-year plan for Islamicate corpus development and the improvement of Arabic-script OCR (including, expansion into Urdu and Ottoman Turkish).
Full Project Team
Dr. Matthew Thomas Miller, Assistant Professor of Persian Literature and Digital Humanities at the Roshan Institute for Persian Studies, University of Maryland, College Park — Principal Investigator.
Dr. Maxim Romanov, Universitätassistent für Digital Humanities at Universität Wien — Area Specialist Co-Principal Investigator.
Dr. Sarah Bowen Savant, Professor of Islamic History at Aga Khan University-Institute for the Study of Muslim Civilisations — Area Specialist Co-Principal Investigator.
Dr. David Smith, Associate Professor in the College of Computer and Information Sciences at Northeastern University — Computer Science Co-Principal Investigator.
Dr. Raffaele Viglianti, Research Programmer at the Maryland Institute for Technology in the Humanities, University of Maryland, College Park — Digital Humanities Co-Principal Investigator.
Masoumeh Seydi, Digital Lead of the Knowledge, Information Technology, and the Arabic Book (KITAB) Project at Aga Khan University, Institute for the Study of Muslim Civilisations — education and OpenITI mARkdown specialist.
Guy Burak, Middle Eastern and Islamic Studies Librarian, New York University — library project lead.
Jonathan Parkes Allen, Mellon Islamicate Digital Humanities Postdoctoral Fellow.
TBD Mellon Computer Science Postdoctoral Fellow, based at Northeastern University.
Asad Zaman, Mellon Islamicate Digital Humanities Graduate Fellow — year one.
TBD Computer Science Graduate Fellow, based at Northeastern University. (Interested in applying for this position? Please go here.)
Kamil Ciemniewski, End Point Developer — Technology Integration Specialist.
Bria Parker, Head, Discovery & Metadata Services, UMD Libraries — metadata librarian.
Sharon Tai, Deputy Editor, SHARIAsource at Harvard Law School — Project Manager for Technical Development.
Janny Peng, Assistant Director for Finance and Administration at the School of Languages, Literatures, and Cultures, University of Maryland, College Park — AOCP Financial Director.
Samar Ata, Special Projects Coordinator at the Roshan Institute for Persian Studies, University of Maryland, College Park — Project Coordinator.
OpenITI AOCP will benefit from the guidance of a team of senior advisors who will provide feedback on the development of the project and its integration with ongoing Digital Humanities and Islamicate Studies initiatives. This team includes:
- Bridget Almas, Software Architect, The Alpheios Project
- Dale J. Correa, President, Middle East Librarians Association & Middle Eastern Studies Librarian, University of Texas-Austin
- Gregory Crane, Alexander von Humboldt Professor of Digital Humanities, Alexander von Humboldt-Lehrstuhl für Digital Humanities Institut für Informatik, UL; Professor of Classics, Tufts University
- Ahmet T. Karamustafa, Professor of History, University of Maryland, College Park
- Fatemeh Keshavarz, Roshan Institute Chair in Persian Studies and Director, School of Languages, Literatures, and Cultures, University of Maryland, College Park
- Intisar A. Rabb, Professor of Law and History, Harvard University; Director, Program in Islamic Law; Editor-in-Chief, SHARIAsource; Faculty Affiliate, Institute for Quantitative Social Science at Harvard
For the latest news and updates on OpenITI AOCP, please follow OpenITI, SHARIAsource, and its co-PIs (Matthew Thomas Miller, Maxim Romanov, Sarah Bowen Savant, David Smith, and Raffaele Viglianti) on Twitter.
 Benjamin Kiessling, Matthew Thomas Miller, Maxim G. Romanov, and Sarah Bowen Savant. “Important new developments in Arabographic optical character recognition (OCR).” Al-ʿUṣūr al-Wusṭā 25 (2017): 1–17, http://islamichistorycommons.org/mem/wp-content/uploads/sites/55/2017/11/UW-25-Savant-et-al.pdf. OpenITI also has completed a new OCR accuracy study on the al-Abhath Arabic academic journal, which was commissioned by JSTOR as a part of its NEH-funded study of the feasibility of digitizing Arabic-language journals. More information on JSTOR’s project can be found here: https://about.jstor.org/news/jstor-receives-50000-neh-grant/. These results have not been published yet, but the full article is available for review upon request and has been submitted for publication. Lastly, the Persian OCR accuracy studies have not been prepared for publication yet, but the full CER reports for these tests can be viewed at OpenITI’s GitHub repository: https://github.com/OpenITI/OCR_GS_Data/tree/master/fas. More on Kraken’s technical details can be found here: https://github.com/mittagessen/kraken.