Reviewing Sorting Phase Data: Formal or Informal Style?

6 min readApr 2, 2019

By Emily Esten, Judaica Digital Humanities Coordinator at the University of Pennsylvania

Scribes of the Cairo Geniza — Help researchers prepare ancient documents for transcription!

To celebrate our volunteers’ hard work & review the data produced in the sorting phase, we’re sharing a series of blog posts that answer some of these questions about this project. Part 1 reviews the question of whether a subject was Hebrew or Arabic script. This part reviews the question of whether a subject was written in formal or informal style. Part 3 looks at visual characteristics on the fragments. Part 4 reviews classification tags from the talk boards.

What does formal and informal mean?

In Geniza scholarship, there are two types of fragments: literary and documentary. Literary fragments refer to the religious texts one would normally find in a geniza — scrolls of the Torah, Prophets, and writings. Literary fragments would typically be written in a formal style. On the other hand, documentary fragments include more about the day-to-day life of the Jewish community — like court documents, legal writings, and correspondence. These fragments would typically be written in an informal style.

This distinction isn’t an exact science — sometimes a literary fragment would be written in informal script, or a documentary fragment in formal script. But it is a helpful start for researchers to start identifying what types of fragments are part of the Cairo Geniza. For example, team members at the e-Lijah Lab and Centre for Interdisciplinary Research at the University of Haifa are interested in literary fragments for their research into the homilies in Midrash. Identifying formal-style fragments helps find those texts more quickly for research purposes more quickly.

For the transcription phase, sorting into formal and informal styles also helps us give volunteers more choices about how they would like to participate. Formal style, in which the script is neat and carefully written, may be more easily read by someone with little or no expertise of the script. Formal style scripts will be available in the “Easy Hebrew” or “Easy Arabic” transcription workflows. On the other hand, informal style, in which the script isn’t neat and appears hastily written, may be more difficult to read for volunteers who aren’t familiar with Hebrew or Arabic. Informal style scripts will be available in the “Challenging Hebrew” or “Challenging Arabic” transcription workflows in the future, but are not currently available.

How many subjects were sorted formal or informal style?

Screenshot of the sorting interface: Is the Hebrew script written in a formal of informal style? Answers include “Formal” and “informal”.

In Phase 1, volunteers sorted 40,109 subjects from the Cairo Geniza. In the first question, we retired subjects that were classified as being outside the scope of this project (such as blank, damaged, illegible, etc.) As a result, this post discusses the remaining 37,046 subjects that are included in the transcription phase. (All percentages will still be given out of the total project.) There were three options for volunteers when sorting a subject: Formal, Informal, or Both. Starting June 2018, volunteers could only choose from Formal or Informal, a decision we made to streamline the process for transcription. (If both scripts are present, it would be a challenging fragment!)

If a volunteer sorted a subject as something other than Hebrew or Arabic for the first question, they were not given the chance to answer this question. Most fragments were retired after being classified 5 times, but that does not necessarily mean it had 5 script style classifications. A such, we do not have as extensive classification data for this option.

Chart showing initial overview for formal or informal style

6,072 subjects (15.1%) were classified as formal style, which every volunteer who saw the subject sorted it as formal style.

6,432 subjects (16%) were classified as informal style, which means every volunteer who saw the subject sorted it as informal style.

17 subjects (<1%) were classified as both, which means every volunteer who saw the subject sorted it as having both formal and informal style scripts. Working with our content specialists, we decided that fragments that contain both formal and informal style script would be included in the Challenging transcription workflows.

This doesn’t mean that the subjects definitely are written in formal or informal style — it just means that based on the set of instructions given, volunteers identified the style as such.

What did disagreement look like?

24,525 subjects (61%) were contested, meaning volunteers disagreed whether the fragment was written in formal or informal style. At least 1 volunteer sorted it as formal, and at least 1 volunteer sorted it as informal.

Of those subjects, 18,821 (46.9%) were challenging, which means not only were these subjects contested, but volunteers only sorted the fragment as formal or informal styles. Volunteers never sorted the subject as having both styles.

We calculated consensus by finding the average between the script styles — we assigned a value of 0 to formal style and a value of 1 to informal style, added up the values of script classifications for each script, and divided it by the total number of script classifications for that subject. A higher value indicated the subject was more likely informal style; a lower value indicated the subject was more likely formal style.

For example, subject 11598520 (ENA 2898, JTS Library) was classified as formal 3 times and informal 5 times, and classified a total of 8 times.

Subject 11598520: ENA 2898, Library of the Jewish Theological Seminary

(3*0) = 0
(5*1) = 5
5/8 = .625
A score of .625 means that volunteers leaned towards classifying the fragment as informal. Being closer to the center than to a value of 1, there was high disagreement over the fragment’s script style.

Chart for contested subjects for formal and informal style

Of the contested subjects, 9,743 subjects (24%) were classified as more likely formal script, and will be available in the Easy transcription workflows.

Of the contested subjects, 10,492 subjects (26%) were classified as more likely informal script, and will be available in the Challenging transcription workflows.

Of the contested subjects, volunteers were split 50/50 on how to classify 1,780 subjects (4%). This means half of the volunteers chose formal and half of the volunteers chose informal. As we discussed before, this is due to subjects being classified based on the first question. In the transcription phase, 835 of these subjects (2%) will be available in the Easy transcription workflows, and 945 of these subjects (2%) will be available in the Challenging transcription workflows.

Zooniverse works on consensus, and subjects were retired after once they reached the retirement classification count (usually between 5–7). If two categories were tied when the subject was retired, subjects were moved to the Challenging transcription workflows.

Overall, 17,493 subjects (43.6%) were sorted as formal style. In the transcription phase, these subjects will be available in the Challenging transcription workflows.

19,553 subjects (48.6%) were sorted as informal style. In the transcription phase, these subjects will be available in the Challenging transcription workflows.

We know volunteers struggled to identify differences between formal and informal style. Unlike the question of script, in which there is a definitive answer, formal and informal are subjective categories and titles we created specifically for this project. We didn’t know if people would be able to identify or see these differences .

We still get many questions from volunteers about how best to sort into formal and informal categories, and we are working to improve our Field Guide to assist volunteers in making their classifications. The results suggest that volunteers were able to identify some kind of difference between formal and informal using the information provided. However, the results have not yet been evaluated by Geniza or paleography scholars, nor has their accuracy been evaluated in regard to their effectiveness at sorting fragments into ‘easy’ or ‘difficult’ categories. These are both concerns we hope to review in the transcription phase.

What does this mean for the transcription phase?

This means we are starting the transcription phase with the following transcription workflows:

17,031 subjects (42.4%) are classified as Easy Hebrew, and are currently available for transcription.

460 subjects (1.1%) are classified as Easy Arabic, and are currently available for transcription.

At a future date, we plan to launch Challenging tracks for Hebrew and Arabic.

18,516 subjects (46%) from Phase 1 are classified as Challenging Hebrew — they are still part of the project, but are not currently available.

1,397 subjects (3.4%) from Phase 1 are classified as Challenging Arabic— they are still part of the project, but are not currently available.

And as a reminder for Part 1, 3,063 subjects (7.6%) from the sorting phase were found to be out of scope, and have been retired from the project.

Reviewing Sorting Phase Data: Formal or Informal Style?

What does formal and informal mean?

How many subjects were sorted formal or informal style?

What did disagreement look like?

What does this mean for the transcription phase?

Written by Judaica DH at the Penn Libraries