Voynichese Part of Speech tagging: an update

Marco Ponzi
ViridisGreen
Published in
16 min readFeb 8, 2021

TLDR: The unsupervised POS-tagging algorithm developed by Alexander Clark highlights significant grammatical patterns in a simple English text. Processing of Voynich Quire 20 highlights the structure of lines, rather than that of sentences. If one pre-processes the text in such a way that line-effects are largely removed, very little structure is still detected. Yet is seems promising that the patterns that are found in Quire 20 are largely comparable with those found in Quire 13.

This post is an update of the Voynich Part Of Speech (POS) tagging research I began exploring in February 2019. Please refer to that post for an introduction to the tools discussed here. The results of these experiments are not particularly exciting, but I decided to write them down in order to have something to use as a basis for further research and to hopefully collect suggestions.

I have used the Zandbergen-Landini transliteration ZL_ivtff_1r.txt, ignoring uncertain spaces.

Differently from the previous post, I now tried Alexander Clark’s variant of Hermann Ney’s POS inference algorithm. Clark’s variant is biased “so that it will put words that are morphologically similar in the same cluster”. The effect is that both members of word couples like these/those, hiding/hitting, hiding/hidden will typically be assigned to the same cluster. With respect to the VMS, this means that I am making the assumption that Voynichese words that are similar to each other are related. In general, I would also say that I am assuming that Voynichese words correspond to words in some language, tough this search for patterns in the text is largely independent from this assumptions. If Voynichese words do corresponds to words in some language, there are several reasons why similar Voynichese words can be expected to behave similarly:

  • different inflections of a single root (e.g. man / men);
  • particles attached to words (e.g. the Latin -que);
  • arbitrary spelling variation by part of the scribe (e.g. color / colour);
  • different abbreviations (e.g. communem, 9unem, communē in Latin);
  • inconsistent spacing (e.g. “a word” sometimes written as “aword”).

As a first experiment, I introduced the pseudo-word “nlnl” which corresponds to line breaks (New Line). In the 2019 post, I proposed a diagram representation of frequent sequences in the tags produced by the POS-tagging algorithm. I used graphviz to write a script that generates such diagrams on the basis of the POS-tagged text.

This graph represents the frequent sequences among six POS classes induced on Voynich Quire 20 (the “starred-recipes” section at the end of the ms). Each POS class is represented by the numeric ID generated by Clark’s software and the most frequent word types belonging to that class. The entry point to the “grammar” is highlighted in yellow. The 14 most frequent POS-class sequences are represented as edges, with thickness and colour corresponding to absolute frequency.

The following table lists the 10 most frequent word types in each class; for each word-type, the number of tokens is listed.

CL 0:nlnl         CL 1:qokaiin      CL 2:daiin
tokens:1281 tokens:2399 tokens:1092
types:168 types:718 types:538
ratio:7.6 ratio:3.3 ratio:2.0
hapax:150 ___ hapax:499 ___ hapax:425 ___
1 nlnl 1085 1 qokaiin 118 1 daiin 109
2 ral 7 2 qokain 99 2 dain 42
3 chdor 7 3 okaiin 95 3 saiin 39
4 chs 3 4 otaiin 75 4 dar 34
5 cheolor 3 5 okain 66 5 sain 24
6 rorol 2 6 otain 50 6 dair 22
7 qoteor 2 7 otar 50 7 sar 16
8 qoteol 2 8 otal 43 8 ycheey 15
9 qoteeol 2 9 raiin 40 9 yteedy 12
10 qotchd 2 10 qotaiin 39 10 ykeey 12
CL 3:aiin CL 4:dal CL 5:chedy
tokens:1630 tokens:840 tokens:3933
types:355 types:538 types:1073
ratio:4.5 ratio:1.5 ratio:3.6
hapax:241 ___ hapax:429 ___ hapax:721 ___
1 aiin 122 1 dal 18 1 chedy 175
2 ar 106 2 am 18 2 qokeey 146
3 chey 104 3 otam 16 3 qokeedy 124
4 al 84 4 qokam 12 4 shedy 106
5 ol 81 5 lol 10 5 okeey 91
6 shey 68 6 aral 9 6 qokedy 58
7 cheey 67 7 otor 8 7 oteey 59
8 sheey 44 8 opchdy 8 8 oteedy 56
9 or 40 9 okam 8 9 otedy 56
10 ain 35 10 dam 8 10 chol 51

This graph is easy to describe: the algorithm captured the peculiar line effects that are so typical of Voynichese.
After a newline (CL0) specific words tend to occur. The most frequent of them (daiin) also occurs in other positions, but words like ‘saiin’ or ‘sar’ have a strong preference for the line-initial position. These words are grouped together in CL2.
Similarly, the character EVA:m has a strong preference for the line-final position. CL5 represents those words that tend to appear immediately before a newline (CL0).
The remaining 3 classes CL1, CL3 and CL4 form a tightly connected triangle that basically represents all “normal words” that have no preference for appearing at line boundaries.

These are a couple of examples of POS-tagged Quire 20 lines:

0:nlnl 2:ysheor 3:aiin 1:char 1:okaiin 5:qokeechy 5:checkhy 5:qokeod 3:ar 5:qokeo 5:lkeo 4:lchorom 
<f106r.5,+P0> ysheor aiin char okaiin qokeechy checkhy qokeod ar qokeo lkeo lchorom
0:nlnl 2:saiin 5:chol 1:qotain 1:qokain 3:chl 3:lr 1:chdain 5:qoteey 5:rcheey 3:r 3:ar 4:rodam
<f115v.15,+P0> saiin chol qotain qokain chl lr chdain qoteey rcheey r ar rodam

I used ideas put forward by other researchers to try and remove the supposed alterations that possibly cause these prominent line effects. The goal is to examine paragraph structure and hopefully to get some glimpse of actual POS categories and grammar. I first checked that significant grammatical structures can be detected on a simple text in a known language.

ENGLISH GENESIS

I started by analysing paragraph structure in the first 10,000 words in King James Genesis. This text is of course much more accessible than the VMS, not only because it can be read, but because it has a limited vocabulary and many repeating word sequences.
I ignored line breaks and I introduced instead a paragraph break (“npnp”: New Paragraph) based on punctuation.
This graph illustrates the results I get for 8 POS-classes and the 14 most frequent edges:

The top 10 words in each class:

CL 0:he         CL 1:earth      CL 2:years      CL 3:and
tokens:1664 tokens:1062 tokens:945 tokens:2609
types:207 types:218 types:214 types:80
ratio:8.0 ratio:4.8 ratio:4.4 ratio:32.6
hapax:97 ___ hapax:95 ___ hapax:103 ___ hapax:31 ___
1 he 141 1 earth 98 1 years 65 1 and 1098
2 god 102 2 lord 81 2 sons 47 2 of 434
3 i 100 3 name 34 3 man 45 3 in 159
4 it 91 4 land 33 4 hundred 44 4 was 107
5 him 73 5 waters 32 5 day 32 5 thou 81
6 thee 65 6 wife 31 6 days 30 6 is 77
7 they 60 7 seed 27 7 thing 20 7 were 66
8 abram 59 8 ark 26 8 daughters 20 8 after 66
9 them 51 9 king 23 9 heaven 17 9 for 63
10 noah 40 10 face 23 10 good 17 10 from 59
CL 4:to CL 5:the CL 6:npnp CL 7:every
tokens:1494 tokens:1257 tokens:1200 tokens:491
types:190 types:22 types:123 types:97
ratio:7.8 ratio:57.1 ratio:9.7 ratio:5.0
hapax:69 ___ hapax:6 ___ hapax:59 ___ hapax:44 ___
1 to 126 1 the 850 1 npnp 542 1 every 69
2 unto 117 2 his 129 2 that 136 2 begat 64
3 shall 87 3 a 93 3 which 48 3 an 29
4 said 87 4 thy 65 4 lived 34 4 two 21
5 be 85 5 their 36 5 went 29 5 three 15
6 all 75 6 s 33 6 also 25 6 shem 13
7 with 63 7 your 11 7 as 24 7 great 14
8 will 50 8 thine 7 8 out 22 8 five 14
9 not 42 9 whose 6 9 old 16 9 nine 12
10 had 37 10 no 6 10 up 14 10 thirty 11

The classification is obviously not perfect, but it does represent some basic notions. For instance, compare CL1 and CL5: they contain roughly the same number of tokens, but the numbers of word types differ by almost an order of magnitude. Classes with a high number of tokens per type (“ratio”) correspond to content words (nouns) while classes with a low ratio correspond to function words. CL7 is also interesting in that it appears to group together most quantifiers (but then words like “begat” and “shem” were misclassified).

The most apparent structure in the graph appears to be the triangle that has been highlighted in green. Paragraph break “npnp” (in yellow) is directly connected to the triangle, since most of the sentences begin with the word “and”. This 3,5,1 structure obviously corresponds to sequences of noun phrases and prepositional phrases like the following:

6:npnp 3:and 5:the 1:evening 3:and 5:the 1:morning
6:npnp 3:thus 5:the 1:heavens 3:and 5:the 1:earth
6:npnp 3:in 5:the 1:sweat 3:of 5:thy 1:face
6:npnp 3:and 5:the 1:door 3:of 5:the 1:ark
6:npnp 3:and 5:the 1:beginning 3:of 5:his 1:kingdom
6:npnp 3:and 5:the 1:border 3:of 5:the 1:canaanites
6:npnp 3:and 5:the 1:angel 3:of 5:the 1:lord

The following are the most frequent 5 long POS-class sequences in the tagged 10k words Genesis text. As can be seen, they are all based on the 3,5,1 triangle.

_5_1_3_5_1_ 129
_3_5_1_3_5_ 113
_6_3_5_1_3_ 88

VOYNICH MANUSCRIPT (VMS) QUIRE 20

Since there is no punctuation, in the case of the Voynich manuscript we don’t know where sentence boundaries are. What can be easily done is analysing each paragraph as a separate sequence. As I mentioned above, I applied a set of tentative transformations aimed at reducing the impact of line effects:

  • Word initial EVA:s,t,p,y where removed (with the exception of sh- which is unchanged). In Quire 20, these characters are much more frequent line-initially than elsewhere. For instance, 27 of 39 occurrences of saiin appear at the beginning of lines. See also Emma May Smith’s discussion of linestart transformation. Word initial p- is also connected with Grove words (appearing as the first words of paragraphs).
  • The gallows characters EVA:p and EVA:f, which concentrate in the first line of paragraphs, were replaced with EVA:k and EVA:t respectively.
  • The two symbols EVA:m and EVA:g, which are strongly correlated with the line-ending positions are assumed to be abbreviations. They were replace with EVA:iin and EVA:dy respectively. These replacements were based on the characters that typically precede m/g and what follows those characters in non-line-final positions (e.g. the most frequent -m words in Quire 20 are am and otam, while the most frequent a- word is ‘aiin’ and the most frequent ot- word is otaiin).

Overall, these transformations affect 6% of the words.

The following is the resulting graph for Quire 20 (8 classes, 14 edges). The classes that have been coloured are discussed below.

Examining the details of the 8 classes makes clear that CL2 and CL5 include considerably fewer tokens than the other classes: this is the reason why none of the 14 most frequent edges connects them.

CL 0:okeey      CL 1:ol         CL 2:chedaiin   CL 3:chedy
tokens:1366 tokens:1287 tokens:429 tokens:1813
types:538 types:397 types:216 types:414
ratio:2.5 ratio:3.2 ratio:1.9 ratio:4.3
hapax:405 ___ hapax:272 ___ hapax:167 ___ hapax:281 ___
1 okeey 93 1 ol 92 1 chedaiin 30 1 chedy 201
2 oteey 59 2 al 90 2 chedar 25 2 shedy 113
3 oteedy 57 3 cheol 59 3 chdaiin 15 3 chey 110
4 otedy 56 4 chol 54 4 chedain 14 4 cheey 83
5 okeedy 45 5 otar 51 5 shedain 11 5 shey 79
6 otchedy 41 6 otal 46 6 chdar 10 6 sheey 48
7 okedy 32 7 okal 39 7 shear 9 7 chdy 43
8 oty 26 8 char 28 8 cheodaiin 9 8 cheody 40
9 otchey 21 9 chedal 26 9 chaiin 9 9 chckhy 39
10 olkeedy 19 10 shol 21 10 chdor 8 10 cheedy 30
CL 4:qokaiin CL 5:daiin CL 6:npnp CL 7:qokeey
tokens:1239 tokens:678 tokens:1883 tokens:1674
types:318 types:288 types:350 types:415
ratio:3.8 ratio:2.3 ratio:5.3 ratio:4.0
hapax:218 ___ hapax:237 ___ hapax:232 ___ hapax:254 ___
1 qokaiin 130 1 daiin 119 1 npnp 280 1 qokeey 146
2 qokain 99 2 okain 68 2 aiin 195 2 qokeedy 124
3 qotaiin 52 3 dain 42 3 ar 133 3 qokedy 58
4 raiin 48 4 l 26 4 okaiin 103 4 lchedy 48
5 lkaiin 42 5 y 17 5 otaiin 99 5 qokey 36
6 qokal 38 6 chodaiin 14 6 ain 62 6 qotchedy 35
7 qokar 36 7 chy 9 7 otain 51 7 qokchedy 33
8 dar 35 8 cheodain 8 8 or 47 8 lkeey 34
9 qotain 34 9 ched 8 9 okar 40 9 qoteedy 30
10 kaiin 31 10 ral 7 10 air 38 10 lkeedy 30

The following are the most frequent 5 long POS-class sequences in the tagged text:
_3_7_7_7_7_ 14
_3_7_0_7_7_ 11
_7_7_7_7_7_ 9

Though the length of the text is nearly identical to the fragment from the English Genesis, tagging of the English text results in sequences that are 10 times more frequent than in the VMS. The results also make clear that, while frequent sequences in the Genesis are due to the alternation between some of the classes, in the VMS they are due to the consecutive occurrence of words belonging to CL7. These figures point out that results for the VMS are poor and do not highlight any frequent grammatical structures.

As discussed in the previous post, it is also confirmed that Voynich words belonging to the same class tend to be much more similar than in the English text. In these plots, the green line represents the average Levenshtein ratio (a similarity measure in the 0..1 range) between the 160 most frequent words in the text. The red bars correspond to similarity between the 20 most frequent words assigned to the each class.

Higher word similarity in VMS POS classes

VOYNICH MANUSCRIPT (VMS) QUIRE 13

The only promising result of this exploration is that applying the same transformations and analysis method to Quire 13 produces results that are somehow comparable with Quire 20.
These are the two graphs presented side by side.

Similar structures detected in VMS Quire 13 and Quire 20

The top 10 words in the 8 classes for Q13:

CL 0:shey      CL 1:qokedy    CL 2:shedy     CL 3:dal       
tokens:786 tokens:1144 tokens:1182 tokens:415
types:178 types:217 types:176 types:168
ratio:4.4 ratio:5.2 ratio:6.7 ratio:2.4
hapax:118 ___ hapax:160 ___ hapax:99 ___ hapax:121 ___
1 shey 92 1 qokedy 151 1 shedy 225 1 dal 51
2 chey 83 2 qokeedy 146 2 chedy 208 2 dol 22
3 chckhy 40 3 qokeey 83 3 otedy 49 3 lchey 19
4 cheey 39 4 qol 79 4 okedy 41 4 edy 19
5 sheey 32 5 lchedy 56 5 dy 40 5 lol 17
6 shckhy 28 6 qoky 54 6 okeedy 39 6 rol 14
7 oly 28 7 qotedy 43 7 sheedy 36 7 ldy 8
8 chcthy 26 8 qokey 42 8 olshedy 24 8 dor 8
9 olkeey 22 9 qoteedy 38 9 lshedy 23 9 ral 7
10 sheckhy 20 10 qoty 24 10 cheedy 24 10 kar 7
CL 4:qokain CL 5:olchedy CL 6:npnp CL 7:ol
tokens:933 tokens:530 tokens:649 tokens:789
types:194 types:181 types:212 types:145
ratio:4.8 ratio:2.9 ratio:3.0 ratio:5.4
hapax:144 ___ hapax:126 ___ hapax:152 ___ hapax:99 ___
1 qokain 161 1 olchedy 38 1 npnp 84 1 ol 191
2 qokal 102 2 al 36 2 or 57 2 aiin 68
3 qokaiin 91 3 sheol 25 3 dar 56 3 okain 44
4 daiin 79 4 olkedy 24 4 cheol 30 4 ar 35
5 dain 47 5 olor 17 5 checkhy 20 5 okaiin 32
6 qokar 47 6 shol 16 6 okar 19 6 olkain 29
7 ain 43 7 sheky 15 7 oty 17 7 otain 22
8 qotain 21 8 olshey 15 8 checthy 17 8 okal 23
9 raiin 15 9 oldy 13 9 chol 13 9 qokol 20
10 otaiin 15 10 olky 11 10 sheor 11 10 otar 19

There is a good correspondence among the blue, pink and gray classes in the two quires. There also are differences, most notably ‘chey’ and ‘shey’ belong to the pink class in Q20 but in Q13 are assigned to a different class CL0 (rather than CL2 with ‘chedy’ and ‘shedy’). I have not gone into the details of these differences yet. One could see them as a manifestation of the difference between section-dialects: i.e. the fact that Currier languages A and B have smaller variations and could be seen as a continuum of contiguous and slightly different dialects rather than two clearly separated languages. I hope to explore this aspect in future posts.
Things are more tangled for the yellow and green classes. For instance ‘aiin’ and ‘ar’ belong with the paragraph separator npnp in Q20 but are in the green class in Q13.

Since words belonging to each class tend to be rather omogeneous, it is possible to define regular expressions that approximate the classification produced by Clark’s algorithm.

A: corresponds to yellow and green classes (excluding the special symbol ‘npnp’).
Words starting with one of {o,a,c,s} and ending with one of {l,r,n}.
[^ ]* matches any sequence of non-space characters
[oacs][^ ]*[lrn]

B: corresponds to pink classes, but also including the similar words that were assigned to CL0 in Q13.
Words starting with a bench {ch,sh} and ending with -y.
[cs]h[^ ]*y
Note that for both quires the class highlighted in pink have a relatively high number of tokens per class: they are the best candidates as function-word classes.

C: corresponds to blue classes: words starting with qo- and ending with -y.
qo[^ ]*y

D: corresponds to grey classes: words starting with qo- and ending with something different from -y.
qo[^ ]*[^y]

The following table presents counts of various five-words sequences based on the regular expressions defined above. Actual counts are compared with the average number of counts in 100 random permutations of each text.

A,B,C,C and A,B,C,D sequences in the VMS vs randomly scrambled text

The fact that the numbers are so small confirms that these experiments have not been particularly successful. The numbers are significantly higher than those observed in the scrambled files, but this is not surprising, since these patterns are largely based on the well known preference for -y words to be followed by q- words. Still it seems interesting that four paragraphs start with occurrences of this pattern; this particular count is 40 times higher than what observed in the scrambled files. But the two quires include a total of 364 paragraphs, so the pattern is only observed in 1% of them.
The following are the details of the four paragraph-start matches:

A,B,C,D:
<f77r.38,+P0> <%>pol shedy qoeedy qokaiin chcphey qol ltaiin shedy qol
A,B,C,C:
<f78r.3,@P0> <%>tshedor shedy qopchedy qokedy dyqokol oky
<f79r.13,+P0> <%>pshorol shckhy qotshdy qokaldy opchedy qotar oraiinol
<f115v.29,+P0> <%>polor sheedy qoteedy qokechy lralylshey sheot shedy chteey lky raram

The following occurrence of the non-paragraph-initial pattern could be compared with the last line above:

A,A,B,C,C,D
<f79v.9,+P0> sar.ol.sheey.qokeedy.qokechey.qol<$>

In all the five examples above, the first word has been stripped of the first character, since (as mentioned above) it belongs to those that appear to be the effect of a line-initial transformation. For instance, pol is treated as ol and matches the regular expression for A.

Finally, this is an example of the A,B,C,D pattern occurring across lines:

<f80r.21,+P0> shedy.qokey.shckhey.qotar.chckhy.otol.teol.sheol.qotal.oltain.chcthy
<f80r.22,+P0> qokeedy.qol.shecthy.qokalkeol.qoky.qokal.shedy.sal.olkain.shey.qokl

FURTHER RESEARCH

Though the results presented here are extremely limited, there are several possible lines of investigations that could possibly take this research a little further.
First, I haven’t really explored one of the options listed at the end of the 2019 post:

Performing a more detailed analysis of the composition of each class. Here I focussed on the 20 most frequent words of each class, but looking at all words could provide different insights.

By examining the list of words associated with each class, it seems clear that certain small variations in words result in words that belong to the same class. For instance, for both quires all the words belonging to the o[tk]ee*dy pattern are classified together:

okedy,okeedy,okeeedy,otedy,oteedy,oteeedy

Words following the [sc]he*d*y pattern (e.g. chdy, chedy, cheedy, cheeey, cheey, chey, chy and their sh- counterparts) are classified together in Q20 only. In Q13, -dy words are in CL2, while -ey words are in CL0.

This kind of analysis could point to word differences that appear to be irrelevant for word behaviour, either because they represent different inflections of the same word or because they are arbitrary spelling variations. These could be removed by further pre-processing transformations, while distinctions that could be significant (like possibbly -edy/-ey) would be preserved.

o- vs qo-

I would like to look into the details of qo- vs o-. As discussed by Emma May Smith, these two families of words are clearly related, but the nature of this relationship is unclear. Clark’s algorithm appears to assign o- and qo- words to different classes. This could be largely due to the fact that after -y words the qo- variant appears to be preferred, but it is likely that there is more to this. One could also explore the idea that q- is semantically irrelevant: all occurrences of q- could be removed and we could see if this allows the detection of more patterns.

Uncertain spaces

The Zandbergen-Landini transliteration distinguishes between certain and uncertain spaces. About 8% of spaces are marked as uncertain, since each space affects both the words it separates, the impact on the text is considerable. Here I have ignored uncertain spaces. I would be interested in defining a system to accept an uncertain space if it makes the text more regular (e.g. if it results in a couple of words that appear elsewhere in the ms).

Different transliterations

Clark’s similarity-based algorithm might give different results using different transliteration systems. Here I used EVA-encoded words. In the future, I could run experiments with the Currier and/or CUVA systems.

Mapping dialects

Carefully comparing the results for Q13 and Q20 could allow some insight about a problem that Nick Pelling put forward: mapping Currier A into Currier B (or vice-versa). Both Q13 and Q20 belong to Currier B, but the dialects corresponding to the two quires are significantly different. Transforming these two sections into something more homogeneous could be a small progress in this area of research. Also, the sum of Q13 and Q20 would result in about 16,000 words of text and the longer the text the better the results from statistical algorithms like Clark’s.

Labels

Quire 13 includes several labels, some of which also appear in the text. It should be possible to search for specific patterns in paragraph text that involve labels or words very similar to labels.

Grammar Inference
I have been experimenting with the JMotif Java implementation of the Re-Pair grammar inference algorithm (Larsson, N.J.; Moffat, A., Offline dictionary-based compression, 1999). I could get promising results for the simple English Genesis text. If I manage to make some progress on the much harder Voynichese Q13+Q20, I could be able to try inferring an actual grammar and generating actual parse trees.

--

--