Annotation as a Pivot in DH Experiments

Nils Reiter
5 min readMay 27, 2015


Nils Reiter and Jonas Kuhn
Stuttgart University

This is our contribution to the blog carnival in preparation of the #dhiha6 conference. We try to explain how we think an experimental methodology within (text-oriented) digital humanities could look and work like and also what benefits could be expected.

Typically, higher-level research questions in text-oriented DH projects are (or should be) broken down into specific subquestions. Some of which ask for presence or absence of textual categories in a corpus. The categories can be some key words (e.g., the uses of emotionally tainted words by a certain author), topics (e.g., the appearance and use of the concept of “nation” in newspapers), complex patterns (e.g., time expressions in novels or location references in historical documents), categories specific to a humanities discipline (e.g., focalisation in narrative texts) or even relations between textual categories (e.g., social networks over characters/historical persons). In all of these cases, one would investigate the occurrences of the category/categories and compare them with expectations.

Theoretically, this can be done purely manually, by (close) reading the corpus and collecting/counting instances. Practically, however, a tool that predicts this category would be preferable, as it can be applied to larger quantities of texts and ensures consistency, i.e., the application of the same criteria for a long period of time.

We would like to put forward the idea that there are two underlying kinds of experiments that can (and need to) be conducted in order to tackle questions of the types described above, and we would also like to advertise to view these experiments as steps forwards even without directly answering one of the questions described above.

The two kinds of experiments are annotation and automatisation or prediction experiments. Annotation experiments revolve around manually annotating texts with the categories of interest, while in automatisation experiments we build some predictive model for the categorisation task and try to annotate the categories automatically.

Annotation Experiments

Annotating texts with categories predicted from a certain theory is far from trivial. Examples given in theoretic literature are usually clear cut and unambiguous, but “real” text is not. There is a gap between a theory and annotated texts. Bridging this gap can be done in an experiment-driven fashion.

The general workflow we envision is an annotation cycle:

  1. Start with an existing theory (or create a new one)
  2. Define categories, develop tests, give examples for annotators
  3. Let annotaters do their job (in parallel)
  4. Evaluate the annotations, change definitions and categories accordingly, go back to 2.
  5. After a number of iterations: Reflect on the relation between theory and annotation rules, incorporate insights from annotation into theory

To put it in experiment terms: We hypothesise that certain definitions, tests and annotator instructions are suitable to annotate reliably (i.e., objectively describe the categories of the theory) and do the annotations in order to verify this hypothesis.

Seeing annotations as an experimental endeavour requires being able to evaluate how well the annotation went. This could be done in a number of ways, from asking the annotator(s) directly for their intuitions to letting them fill in a questionnaire to measuring inter/intra-annotator agreement quantitatively (IAA). The underlying goal in all cases is that annotations are reproducible, i.e., a different person annotates similarly in the same setting or the same person annotates similarly at a different time. The annotations should depend on how categories are defined and as far as possible not on factors that are not controlled.

Automatisation Experiments

Independent of the concrete implementation strategy (rule-based, supervised or unsupervised machine learning, …), at some point a certain amount of annotated test items is required. Test items are manual annotations of text and thus text instances for which we already know what the program we write should predict. If we make changes to our program (adding new rules, adding features, changing the algorithm), we can test against the test set and check whether the changes decrease or increase the performance.

Similarly to the workflow above, this is also a cycle:

  1. Annotated dataset (resulting from, e.g., an annotation experiment)
  2. Implement a program (that incorporates hypotheses on how to detect the category automatically)
  3. Evaluate the program against the data set, make an error analysis
  4. Make changes to the program based on the evaluation

The experiments in this workflow are conducted in steps 1 and 2. We hypothesise that a certain algorithm/feature set/rule set works well in predicting the category. We apply the program on a data set to verify or falsify this hypothesis.

Similarly to the manual annotation experiments, we need to measure how well the program performs and can draw on the established evaluation methodology from NLP and machine learning (e.g., precision, recall, f-score). Principally, we could also just count errors manually.

This cycle gets even more potent if we do not assume that implementation and evaluation is done by the same researcher(s), but split out to various research groups. In so-called shared tasks, one group is responsible for evaluation and other groups provide the predicted annotations of their different implementations. This leads to results that are directly comparable and has fostered enormous progress in the past within NLP.


After some annotation and in the following automatisation experiments have taken place, the prediction quality of the programs has (hopefully) reached a certain level on the annotated data. If, for instance, the program reaches a prediction quality of 90% it would not be perfect but could still be helpful towards big data questions as mentioned in the beginning.

This may seem like a purely pragmatic approach tailored towards automatisation, but the benefit from applying a humanities theory onto text (by annotating it) should not be underestimated. Within computational linguistics, big annotation projects have led to substantial improvement of e.g., grammatical theory, because sentences had to be analysed that were not expected and would never appear in a grammaticians rule-book. Occasionally, category distinctions that are extremely hard to predict automatically have been reconsidered, with positive effects on the overall network of interconnected layers of systematic description. In a similar way, humanities theories of texts could benefit from annotation efforts directly, even when there is no automatisation involved.

We would like to point out that work in these directions is no low hanging fruit. From our experience, even annotating relatively simple humanities theories on texts is quite difficult. Systems to predict complex humanities categories even more so, given the holistic approach humanities theories often take. From our point of view, establishing publication opportunities (e.g., in shared tasks) for experiments as sketched above would promote the field of DH substantially.