Digital Humanities spaCy Workshop
On July 8th, a group of 28 scholars met in Utrecht for a DH2019 workshop on spaCy, a fast and accessible natural language processing (NLP) library that integrates modern machine learning technology. This was an opportunity to learn about spaCy and to fill the gaps between community needs and technical possibilities. This post will articulate the insights from the workshop while they are still fresh. In doing so, it will build on the dialogue between digital humanities (DH) scholars and the spaCy developer community.
The workshop began with an introductory session taught by Seth Bernstein from the Higher School of Economics in Moscow. Like similar NLP libraries, such as Stanford NLP or the Natural Language Toolkit (NLTK), spaCy allows computers to identify features of human language such as parts of speech, types of words, and entities (such as places, people, emotions). Without NLP, a text is just a sequence of characters to a computer. spaCy transforms plain text into document, span, and token objects. The document object contains the original text as well as lists of the component sentences, entities, and tokens. Each token, in turn, has more than 60 attributes, including the root form of the word (lemma), the entity type, part of speech, language and a host of other relevant information. A span has similar capabilities for patterns and sections of text. With just three lines of code, the text is now brimming with features for the computational analysis of text and language.
from spacy.lang.en import English
nlp = English()
doc = nlp(‘On July 8th, a group of 28 scholars met in Utrecht for a DH2019 workshop on spaCy... ’)
As a library, spaCy favors simplicity, quick experimentation and customization. A variety of pre-trained models are available out of the box for English, German, and seven other languages. The developers of spaCy have created a very useful app to visualize and assess whether the pre-trained models are right for your named entity recognition tasks. A simple built-in tool called displaCy makes it easy to see similar results for part-of-speech tagging, dependencies and other results from your models.
Given that the language model defines what is legible to the computer in our texts, a key need voiced by the workshop participants was the ability to shape what the model is capable of “seeing.” From project to project, we are interested in different elements of language and content. No model will ever accommodate every possible research project, so what we need most is a tool that can adapt to our current research objectives. spaCy meets that need. With the set_extension() function, we can add new attributes to the model. In his section of the workshop, David Lassner (TU Berlin) demonstrated how to add custom attributes at the document, span, and token levels. Equally important, David showed how to import tags and attributes from a TEI document into the spaCy model using a standoff converter that he created for the workshop. TEI is an excellent format for storing information about a text. However, XML can be hard for non-experts to work with. With the relevant tags loaded into spaCy, we can quickly prepare the documents for computational analysis.
A common need among DH scholars is the ability to load an existing TEI document into an NLP language model, to add new information automatically using a model (or models) and then save that information back to the TEI file. When we’re working with thousands of documents, it can be impractical to mark up every document by hand. Even an initial round of automated markup can do work that would take years for a single person and makes the documents available for initial experimentation. Inevitably, we learn from the first iteration and make adjustments. Automated tagging makes it possible to experiment and evaluate different approaches to a given research question.
Most pre-trained language models are trained on texts that are fundamentally unlike the literary and historical documents that we study in the digital humanities. A model trained on 21st-century Wikipedia articles, for example, is not able to accurately identify place names in early modern documents. Sweveland and Denmarke are incorrectly identified as organizations rather than places. We need a simple way to teach the model to identify those elements of speech and content that are relevant to our research. spaCy offers two possible approaches to this problem. Using a text from the Perseus project, we created a list of 2000 early modern place names. We then converted that list into training data. We can load an existing model, train it on our new category and then save the new model to disk.
The next section was taught by Andrew Janco from Haverford College. We used a tool created by the spaCy team called Prodigy. Prodigy is an annotation tool for machine teaching. It is a tool that facilitates active interaction between a human annotator and the model. A person can manually show the model examples of an element of interest. Seed terms offer additional examples of what we would like the model to learn. During training, Prodigy will sort the model’s results and ask for input from the user. If it’s only 50% certain, for example, that “English Realme” is a place, it will ask for a yes or no answer from the researcher. Is “English wolle [wool]” also a place? No, it is not. What unfolds is a dialogue between the scholar and the model. It’s fascinating to watch as the model asks questions and learns (or fails to learn). For anyone that is frustrated by the mysterious “black box” of machine learning, Prodigy offers a unique opportunity to engage in dialogue and teach the machine model aspects of the text that are relevant to your research. With only 2000 seed patterns taken from Persius and 200 annotations, we were able to train a model on early modern English place names in about 20 minutes. The model is able to correctly identify places that were not in the training data and with texts that it has never seen.
During the discussion section of the workshop, we were joined by Matthew Honnibal and Ines Montani who are the creators and main developers of spaCy and Prodigy. This offered an opportunity for academic researchers to share their work and interests with the library’s maintainers. While DH scholars prefer open-source software, we rarely participate in the community discussions that establish needs and priorities for future development work. Over the course of the workshop and our discussions, several needs emerged which can serve as a guide for future work.
- Many scholars in the digital humanities use TEI markup to enrich texts with information. This can include lexical features as well as specific identifiers for people, places, and organizations. Our workshop filled a need for scripts to convert to and from TEI/XML to spaCy’s preferred JSONL format for training data and pattern files. David Lassner wrote an excellent standoff converter, which converts TEI documents to plain text while preserving the information in the markup. Future work with these scripts will simplify the task of converting TEI to the formats needed for spaCy seed patterns and training data. They will make it equally simple to write data back to TEI.
- Several of the workshop participants work with languages for which there is no existing language model of any kind. While the spaCy documentation on how to add a language is quite good, further efforts can be made to explain the process and make the creation of custom language models more accessible to DH scholars. Current work at Haverford to train a language model for Zapotec could provide an effective example and starting point.
- The creation and editing of TEI require significant domain expertise as well as familiarity with XML and XML editors such as Oxygen. The workshop materials contain scripts to add annotations with Prodigy and save them as TEI markup. For simple tags, this is not a problem. However, we typically need to add more than just <person>, but a person id and other tag attributes.
<person sex=”intersex” role=”god” age=”immortal”>
Further work is needed to find ways to meet that need with a simple annotation tool that is accessible to students and faculty alike. Brat, INCEpTION and similar tools offer a potential alternative but have proven difficult for the workshop participants to learn and use in their work. One approach would be to use spaCy’s named entity linking feature to draw information from a knowledge base to enrich our TEI tags. For example, a student would highlight Hermaphroditos in the text and the sex, role and Greek name would be loaded automatically from DBpedia.
The DH2019 spaCy workshop established and partially filled the need for simple tools and clear instructions to work with TEI and spaCy. We can process a text with customized models and save the results back to TEI. It is possible for individual researchers and small teams to customized and train models that address their specific research goals. These processes can be further simplified and added to a package. Imagine just typing `pip install spacy-tei` and being able to load TEI documents for analysis with spaCy and Prodigy.
There is also a demand for simple instructions and a workflow to create models for new languages. Over the next year, we should work to add new language models for less-common languages and assess the feasibility of a follow-up workshop at DH2020 in Ottawa.
David, Seth and I would like to thank everyone who attended the workshop. Our notebooks and materials can be found here. We hope that we’ve given you the tools and knowledge that you’ll need to make use of spaCy in your work. Please keep in touch. Many thanks to Matt and Ines for joining us. We’d love to continue the conversation!