The Open Islamicate Texts Initiative Arabic-script OCR Catalyst Project (OpenITI AOCP)

  • Transforming CorpusBuilder 1.0 (developed in collaboration with and funded by SHARIAsource at Harvard Law School) into CorpusBuilder 2.0 — a genuine digital text production pipeline — by integrating standards-compliant text conversion/export functionality and the latest advancements in machine learning, computer vision, and natural language processing into CorpusBuilder (for more information on new OCR components, see following section; for more information on CorpusBuilder 1.0, see its project page);
  • Executing a corpus development pilot that will produce a typeface inventory of late nineteenth and twentieth-century Persian and Arabic printing, training data/OCR models for the top twenty Persian and Arabic print typefaces, and two hundred newly OCR’d texts (ten high priority works selected from among books in each of the top twenty typefaces) in Text Encoding Initiative (TEI) XML format;
  • Fostering the development of a network of allied OCR projects and Arabic-script OCR users through regular teleconferences, Slack groups, and two experts workshops focused on soliciting interdisciplinary feedback and coordinating efforts on technical developments with other OCR projects;
  • Producing a white paper on best practices for Arabic-script text digitization (workflow, technical method, etc.) and a comprehensive five-year plan focused on the next steps for Persian and Arabic corpus development and the improvement of Arabic-script OCR.
  1. Technical Workstream Deliverables:
  • creating ground truth by aligning dirty OCR with existing editions;
  • improving layout analysis;
  • automating offline OCR retraining;
  • unsupervised confidence measures for automatic transcription;
  • unsupervised model selection, combination, and adaptation;
  • producing accurate OCR with significant mixtures of languages.
  • Bridget Almas, Software Architect, The Alpheios Project
  • Dale J. Correa, President, Middle East Librarians Association & Middle Eastern Studies Librarian, University of Texas-Austin
  • Gregory Crane, Alexander von Humboldt Professor of Digital Humanities, Alexander von Humboldt-Lehrstuhl für Digital Humanities Institut für Informatik, UL; Professor of Classics, Tufts University
  • Ahmet T. Karamustafa, Professor of History, University of Maryland, College Park
  • Fatemeh Keshavarz, Roshan Institute Chair in Persian Studies and Director, School of Languages, Literatures, and Cultures, University of Maryland, College Park
  • Intisar A. Rabb, Professor of Law and History, Harvard University; Director, Program in Islamic Law; Editor-in-Chief, SHARIAsource; Faculty Affiliate, Institute for Quantitative Social Science at Harvard




The Open Islamicate Texts Initiative (OpenITI) is a multi-institutional effort to build the digital infrastructure for Islamicate Digital Humanities.

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

I Am Enjoying Learning Web Development Again

Usage of API for Integration with Multiple Shopping Platforms — API2Cart

shopping platform API integration

Making the Case for Observability (to your boss)

Python Vs R : The Eternal Question for Data Scientists

Amazing websites you can build with Python

An online Marriage Business merges with Cloud Service to provide uninterrupted matchmaking!

Portswigger File Upload Vulnerabilities(Bahasa Indonesia)

Rivet Adds Support for Web3 Login and Crypto Payments

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Open Islamicate Texts Initiative (OpenITI)

Open Islamicate Texts Initiative (OpenITI)

The Open Islamicate Texts Initiative (OpenITI) is a multi-institutional effort to build the digital infrastructure for Islamicate Digital Humanities.

More from Medium

Unmanned Stores? An Outlook on Our Future (Part 1)

Why I Started Devium Network

Feature Scaling in ML

Application Frameworks- Introduction