Digital Humanities spaCy Workshop

  • Many scholars in the digital humanities use TEI markup to enrich texts with information. This can include lexical features as well as specific identifiers for people, places, and organizations. Our workshop filled a need for scripts to convert to and from TEI/XML to spaCy’s preferred JSONL format for training data and pattern files. David Lassner wrote an excellent standoff converter, which converts TEI documents to plain text while preserving the information in the markup. Future work with these scripts will simplify the task of converting TEI to the formats needed for spaCy seed patterns and training data. They will make it equally simple to write data back to TEI.
  • Several of the workshop participants work with languages for which there is no existing language model of any kind. While the spaCy documentation on how to add a language is quite good, further efforts can be made to explain the process and make the creation of custom language models more accessible to DH scholars. Current work at Haverford to train a language model for Zapotec could provide an effective example and starting point.
  • The creation and editing of TEI require significant domain expertise as well as familiarity with XML and XML editors such as Oxygen. The workshop materials contain scripts to add annotations with Prodigy and save them as TEI markup. For simple tags, this is not a problem. However, we typically need to add more than just <person>, but a person id and other tag attributes.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store