Experiments in computational palaeography

The site of the “Institut de recherche et d’histoire des textes” hosts pages about the CLaMM competition for the automatic classification of medieval scripts. Works submitted to past editions appear to be highly sophisticated AI projects, achieving amazing accuracy levels in the different classification tasks.

Since the database used for the competition can also be downloaded, I used it for a few simple experiments. The database consists in a set of 2000 scans from as many manuscript, each one classified on the basis of the type of script and the date of production.

For my first experiments, I have extracted a small subset of the images: 5 for each of the 12 script classes, i.e. a total of 60 images. Here is the list of the 12 classes, with the addition in square brackets of the date ranges of the samples I used (detailed descriptions can be found here):

1. Caroline [XI Century]
2. Cursiva [XIV-XV Century]
3. Half-Uncial [X-XI Century]
4. Humanistic [XVI Century]
5. Humanistic Cursive [XV-XVI Century]
6. Hybrida [XV-XVI Century]
7. Praegothica [XII Century]
8. Semihybrida [late XV Century]
9. Semitextualis: Textualis [XIII-XIV Century]
10. Southern Textualis (Rotunda) [XIII-XVI Century]
11. Textualis [XIII-XV Century]
12. Uncial [IX Century and before]

I have added the following script classes from two specific manuscripts that I was interested in comparing with the samples from the CLaMM database:

13. Voynich manuscript (Beinecke ms 408) Latin script. The Voynich manuscript is an early XV Century book written in an unreadable and unique alphabet. It contains a few short marginal annotations that appear to date to the same time as the unreadable text: these annotations appear to be written in Latin and German and are sometimes intermixed with words in the unreadable Voynichese alphabet. For this comparison, I have considered the longer Latin-alphabet text that appears on the last page of the manuscript, f116v.

14. Voynichese script. Samples of text from different pages of the unreadable text in Beinecke ms 408.

15. Giovanni Fontana Latin script. BSB Cod.icon. 242, “Bellicorum instrumentorum liber cum figuris” is a cipher manuscript written in Venice in 1420–1430. The manuscript contains paragraphs in plain Latin that were used to build this collection of samples.

16 Giovanni Fontana cipher. Samples from the enciphered main body of BSB Cod.icon 242.

For each of the 16 script classes, 5 samples including three lines of text were selected. The exception is class 13, since the Latin text in the Voynich manuscript is barely enough to build two samples. I added a third sample by digitally removing some noise from the original image.

I experimented with a number of basic digital processing techniques, trying to find easy to interpret measures that could help analyse the different scripts. In the process, I realized that some of the classes cannot be distinguished by my simplistic approach, so I grouped them together in three larger super-classes:

  • Classical scripts: Uncial (12), Half-Uncial (03) and Humanistic-Caroline (04) are represented by yellow and orange symbols in the following diagrams. Caroline (01) is plotted in red and is close to this group.
  • Gothic scripts: Textualis (11), Semitextualis (09) and Southern-Textualis (10) are represented by green symbols.
  • Cursive scripts: Cursiva (02), Hybrida (06) and Semihybrida (08) are represented by blue symbols.

Praegothica (07, orange triangle) is loosely correlated with Classical scripts. Humanistic-cursive (05, gray) is even more loosely correlated with Cursive scripts. Data from the Voynich ms are plotted in pink. Data from Fontana’s manuscript are plotted in black. In both cases, I have used diamonds for the Latin scripts and circles for the invented alphabets.

The first measure I found useful measures the number of black pixels in an adaptively thresholded image. It basically measures how much of page surface is covered in dark ink. Scripts with thick characters and little space between lines will result in a higher value for this measure.

Scripts with low (left) and high (right) “blackness”

Line Separation
The second measure is based on the variance of blackness in a number of horizontal “windows” cutting through the image. Scripts in which lines of text are separated by a wide empty space will result in a high measure. Factors that can cause a lower measure are:

  • the presence of elongated ascenders and descenders (in particular if they include loops)
  • irregularities in the horizontality of lines (in particular when no ruling lines were used)
  • abbreviation symbols that appear as superscripts
Scripts with low (left) and high (right) “line separation”

The following diagram presents results based on the two measure described above. I manually added grey lines to highlight areas for different script classes. 
Classical scripts (yellow / orange) have high Line-Separation and low Blackness. The two Voynich alphabets (pink) and the Latin Fontana alphabet (black diamonds) have low Line-Separation and low Blackness; humanistic-cursive scripts (grey triangles) are scattered all over, with a single sample (marked 05c) appearing at the centre of the Voynich-Fontana cluster.
Cursive (blue) and Gothic (green) scripts are mixed together. Fontana’s cipher alphabet is in an intermediate position between the Voynich manuscript and Cursive scripts.

Forward Diagonality
I have used morphological “erosion” to measure the amount of strokes directed along a bottom-left to top-right diagonal, like the central stroke of Z. High diagonality typically correspond to cursive scripts with a systematic slant. Also scripts with no slant but that avoid straight strokes in favour of bows and loops can cause relatively high diagonality (this is also typical of cursive scripts). Majuscule scripts, in which characters are based on a square box made of horizontal and vertical lines, result in lower diagonality values.

Scripts with low (left) and high (right) “forward diagonality”

The Blackness / Diagonality diagram can be split into three areas.

All the four alphabets from the Voynich and Fontana manuscripts are characterized by low Blackness and average to high Diagonality. Fontana’s cipher has exceptionally high diagonality values, due to the presence of several characters that include one or two diagonal segments with an inclination of about 45 degrees. In this diagram, more samples from the Humanistic-Cursive set (05, grey triangles) appear next to the Voynich-Fontana samples.

Caroline scripts (01, red) appear at the centre of the plot, with both average blackness and average Diagonality. The bottom of the central area, low Blackness and low Diagonailty is occupied by the Classical-scripts; cursive scripts occupy the upper part of the central area, with relatively high blackness and diagonality. 
Diagonality allows for the recognition of Gothic scripts: they occupy the bottom right area of the diagram, with high blackness and low diagonality.

The analysis presented here is based on too small a set of samples and too simple measures to provide fully reliable results. Yet I think that some possible implications are worth considering:
1) For both the Voynich and the Fontana manuscripts, Latin and invented alphabets in the same source appear to be comparable. This provides some evidence in favour of the idea that the marginalia in the last page of the Voynich manuscript belong to the same environment, if not to the same hand, as the main body of the text. Voynichese appears to be closer to the last page Latin script than Fontana’s cipher is to the Latin script in Fontana’s manuscript.
2) Among the samples examined, the Latin script that resulted to be the closest cognate of the Voynich marginal script is the Latin alphabet in Fontana’s manuscript. I did not expect this to happen, but in retrospect the result makes perfect sense. I have extracted from the CLaMM database a set of only 60 samples, representing several centuries of writing in a large part of Europe. Fontana’s script has the advantage of belonging to exactly the same time-frame as the Voynich manuscript. Also, the origin of the Voynich manuscript is generally believed to be either Northern Italy or Germany, and Veneto (where Fontana lived) is the part of Italy that historically had the closest link with German-speaking regions.

Top to bottom: Voynichese, Voynich Latin, Fontana BSB Cod.icon 242, BNF Lat.7271

3) Among the CLaMM samples, the class that exhibits the highest correlation with the two Voynich scripts is Humanistic-Cursive. This is also clear if one examines the plot of the centroids (averages) of the different classes.

The Humanistic-Cursive script is described as characterized by an “italic slant” that appears to be absent in the the best match for the Voynich and Fontana manuscripts: BNF Lat 7271, Ferrara, 1458 (labelled 05c in the plots above). The absence of slant is one of the features that make Lat.7271 a decent visual match for the Voynich script and even more so for Fontana’s Latin, but there also are several differences. In particular, the simple measures considered here do not take into account the shapes of individual glyphs: in this respect, large differences with the Voynich Latin characters are evident. But I think it is encouraging that a 1458 manuscript from Ferrara was identified as a good parallel for a 1425ca work from Venice: also character shapes appear to be similar in Lat.7271 and Fontana’s BSB Cod.icon. 242.

An experiment on the whole dataset
I tried running a test on the whole 2000 images database, using three more properties in addition to those described above:

  • Skeleton: measures the total length of the strokes; it correlates with blackness, but it is not affected by the thickness of the script.
  • Squareness: like diagonality, it is based on morphological erosion. In this case, the sum of the horizontal and vertical components is considered.
  • Compression: the maximum blackness in a think horizontal window cutting through the image. This measure is high for scripts in which adjacent characters touch each other.

For each sample, a 6-dimensional vector was compared and Euclidean distance from the centroids corresponding to the two Voynich scripts was used to select the best matches.
These are the best three results with respect to the Voynich Latin script.

IRHT_P_007273 Humanistic Cursive; 1476-1500; BNF Lat.7357
IRHT_P_007248 Hybrida;1451-1475; BNF Lat.7135
IRHT_P_008827 Hybrida; 1476–1500; BNF Smith-Lesouëf 70

And the best matches for the unreadable Voynichese alphabet:

IRHT_P_006823 Hybrida; 1501-1525; BNF Lat.4215
IRHT_P_006142 Cursiva; 1451-1475; Bayeux, Bibl. du Chapitre 161
IRHT_P_009017 Hybrida; 1501–1525; Paris, Arsenal 1178

All results appear to be too late to be associated with the Voynich manuscript. When examining the Latin script, it can superficially look like some of the results, but character shapes are quite different: in particular, these late scripts have no loops on the ascenders while the Voynich Latin script has loops on severaò characters (‘l’, ‘b’ and ‘h’), often with a peculiar triangular shape.