Preliminary thoughts on Voynichese Part of Speech tagging

Published in

ViridisGreen

10 min readFeb 1, 2019

Software tools for unsupervised Part of Speech (POS) tagging have been around for several years. Algorithms mostly look for which words appear next to the same words. On a large enough corpus, such behaviours correlate with the grammatical role of each word.

The application of such techniques to Voynichese (the unknown language of the Voynich manuscript, Beinecke ms 408) has been mentioned in the thesis of Gianluca Bosi (Bologna University) and I am sure others must have experimented with this approach.

The simple experiments discussed here are based on software developed for Alexander Clark’s 2003 paper Combining Distributional and Morphological Information for Part of Speech Induction.
Clark discusses a number of ideas that could help dealing with rare words (one of the problems with Voynichese) but here I will only focus on an older algorithm re-implemented by Clark (originally described in Hermann Ney, Ute Essen, and Reinhard Kneser, 1994, On structuring probabilistic dependencies in stochastic language modelling).
I am not sure I understand much of the theory: I use the software as a black box. I think the algorithm works on bigrams, i.e. considering couples of consecutive words: words are assigned to classes in such a way that the possibility of guessing the next word given the current word is maximized. There will be classes which tend to appear consecutively and others that rarely or never follow each other. The number of word classes is one of the parameters of the algorithm.

A simple sentence

As an example, this is the three classes tagging of a simple sentence:

0:it 2:ascends 1:from 0:the 2:earth 1:to 0:the 2:heaven 1:and 1:again 0:it 2:descends 1:to 0:the 2:earth

The two most frequent words (“it” and “the”) are assigned to class:0
The words that precede those in class:0 are assigned to class:1
The words that follow those in class:0 are assigned to class:2

Basically, the detected structure is 0,2,1,0,2,1,0,2,1…
Even this simple example is not totally regular and in one case a class:1 word is followed by a second class:1 word (“1:and 1:again”). But class:0 is always followed by class:2, and class:2 is always followed by class:1.

This example also shows how such a system can be useful for Part of Speech identification:

The two prepositions (“to” and “from”) are tagged as class:1, together with other two “function words” (“and” and “again”)
All verbs and nouns (ascends, descends, heaven, earth) are tagged as class:2

Of course, the variability in all natural languages is so high that a meaningful tagging can only be produced on the basis of a very large corpus.

English vs Voynichese

In order to avoid difficulties with the differences between Currier languages and have as uniform a text as possible, I focused on a single section of the Voynich manuscript: quire 20.
As a benchmark, I used a portion of the Genesis from King James Bible, considering a similar number of words (about 10,500).

In both cases, I only used 5 POS classes. Such a number is obviously too small to correctly represent all different part of speech categories, but it makes analysis and discussion easier. I fed the whole text to the algorithm, without any kind of punctuation or sentence boundary marks.

These are the most frequent 20 words for each of the classes for the English text:

C:0          C:1          C:2          C:3          C:4
tokens:1860  tokens:2159  tokens:2095  tokens:2048  tokens:2294
types:131    types:84     types:122    types:399    types:439
ratio:14.1   ratio:25.7   ratio:17.1   ratio:5.1    ratio:5.2
hapax:67     hapax:31     hapax:41     hapax:190    hapax:189 and 1117      the 866    of 441      it 93          earth 99
 that 141      he 146     in 163      be 87          said 93
 upon 58       his 131    to 131      him 79         lord 87
 which 52      i 109      unto 121    all 76         years 65
 lived 34      god 102    was 109     thee 67        will 53
 forth 27      a 98       shall 95    abram 59       sons 47
 into 24       thou 82    is 79       them 53        man 45
 came 24       every 69   after 69    not 43         hundred 44
 as 24         thy 68     were 67     noah 41        had 37
 out 22        they 62    for 66      me 37          name 34
 on 21         their 36   with 65     her 37         wife 33
 but 21        my 35      begat 64    there 31       land 33
 up 15         s 33       from 61     went 30        waters 32
 saying 13     she 30     shalt 34    also 26        day 32
 then 12       an 29      when 25     you 20         have 30
 because 12    two 21     made 25     this 20        days 30
 at 12         cain 18    called 23   one 20         seed 27
 wives 10      three 15   make 22     old 20         ark 26
 therefore 9   five 14    behold 22   daughters 20   flesh 25
 where 7       nine 12    let 21      eat 19         son 23

class:0 mostly conjunctions and adverbs
class:1 determiners and numbers (but also subject pronouns)
class:2 12 verbs + 8 prepositions
class:3 contains several object pronouns (him,thee,them,me,her), but it is quite mixed
class:4 16 nouns + 4 verbs (3 of which are auxiliary); even if they do not appear among the most frequent words, several adjectives are also assigned to this class

“Hapax” is the number of hapax legomena (i.e. words which only occur once in the whole text). This value is obviously anti-correlated with the tokens/type ratio. Classes that include fewer word types tend to have fewer hapax legomena. The number of tokens per class is roughly constant. It could be that function words concentrate in classes with a high tokens/type ratio and a low number of hapax legomena.

The most frequent sequences of two consecutive classes are (the numbers correspond to occurrences of the sequence in the tagged text):
1_4 1828
2_3 1010
3_0 917
2_1 868
4_2 862
4_0 804
0_1 801
3_2 500
They can be represented by the following graph:

Some sequences that match those illustrated in the graph:
1:the 4:days 2:of 3:enos 2:were 1:nine 4:hundred 0:and 1:five 4:years
1:the 4:bow 2:shall 3:be 2:in 1:the 4:cloud
1:a 4:dove 2:from 3:him 2:to 3:see 2:if 1:the 4:waters 2:were
0:and 1:the 4:lord 2:plagued 3:pharaoh 0:and 1:his 4:house
0:that 1:his 4:brother 2:was 3:taken

Of course, many more sequences appear a significant number of times. It is also evident that word classes are not clear-cut. Yet the results illustrate how this software can detect something relevant, at least for the most frequent words, even with a relatively short text.

These are the most frequent 20 words for each of the five classes for Voynich Quire 20 (using the EVA transcription by Zandbergen and Landini, including uncertain spaces):

C:0          C:1          C:2          C:3          C:4
tokens:1817  tokens:2585  tokens:1692  tokens:1893  tokens:3125
types:613    types:618    types:550    types:787    types:507
ratio:2.9    ratio:4.1    ratio:3.0    ratio:2.4    ratio:6.1
hapax:465    hapax:396    hapax:407    hapax:623    hapax:203aiin 208    chedy 200     qokeey 159    ar 149     qokaiin 121
 al 138      chey 125      qokeedy 136   or 72      daiin 121
 y 122       shedy 116     qokedy 61     otar 59    l 111
 ol 113      okeey 97      oteedy 57     otain 52   qokain 100
 okain 68    shey 82       lchedy 52     r 49       okaiin 97
 ain 58      cheey 78      qokey 40      cheo 39    otaiin 77
 dain 48     oteey 63      lkeey 34      s 32       chol 63
 air 35      otedy 60      okedy 32      char 32    o 55
 am 27       cheol 55      qoteedy 31    qotar 29   otal 52
 sain 24     sheey 49      lkeedy 30     dair 27    dar 44
 a 21        okeedy 46     qoky 29       lkar 22    qokar 43
 aiiin 20    chdy 38       qol 27        sar 21     okal 43
 cheeo 18    chckhy 38     qotedy 25     ch 21      raiin 41
 qokol 13    keedy 34      qoteey 22     lor 20     qokal 41
 shody 11    sheedy 26     qoty 19       kar 19     okar 41
 oteol 9     sheol 25      qotal 18      otair 18   qotaiin 40
 ral 8       cheedy 24     qokeeey 17    chor 16    lkaiin 39
 okol 8      shol 22       lkedy 17      chear 15   kaiin 36
 oiin 8      keey 21       otey 15       tar 14     saiin 35
 cheeody 8   chedaiin 21   okeeey 15     aiir 14    qotain 34

class:0 aiin and variants; mostly very short words
class:1 initial bench (ch/sh) -y or -l ending; 13 words start with a bench ending with -l or -y (vs a total of 3 in the top 20 words of the other four classes)
class:2 [ql]..[yl]: 16 words start with q- or l- and end with -y or -l (vs a total of 2 in the top 20 words of the other four classes)
class:3 final -r (16 words vs a total of 4 in the top 20 words of the other 4 classes)
class:4 -aiin -ain -al as suffixes i.e. attached to preceding characters (14 words vs 7 in the top 20 words of the other 4 classes)

I thought that some of the classes might correspond to line-initial / line-final or paragraph-initial / paragraph-final words, which are known to be peculiar. But this is not the case:

    l.init. l.final p.init. p.final
C:0  248      217      57     32
C:1  202      203      33     55
C:2  144      138      54     28
C:3  218      251      58     52
C:4  272      275      26     61

The most frequent sequences in the tagged text:
1_4 1147
3_0 1135
4_1 1006
1_2 945
0_1 797
4_3 746
4_4 630
2_4 540

These frequent sequences can be represented by this graph:

The following are a few fragments that illustrate word sequences compatible with the most frequent sequences:

<f104v.13> 4:lkaiin 3:cheetar 0:aiin 1:cheitaiin
<f105v.11> 4:okaiin 3:os 0:aiin 1:chckhodu 2:qoteedy
<f106r.40> 3:ar 0:aiin 1:sheey 4:lkaiin 1:sheedy
<f111v.32> 1:chey 4:tain 3:chkar 0:alkar 1:chey 2:qol
<f112v.18> 4:saiin 3:or 0:aiin 1:chey 2:qokeedy
<f112v.46> 1:cheky 4:chokain 3:char 0:am 1:chey 4:kain
<f113v.47> 4:lkaiin 4:tair 1:shey 4:qotain 3:ar 0:akal 1:shey

Again, this is highly simplified: words in each class do not entirely fall into the morphological patterns I described, and, though those listed above are the most frequent consecutive occurrences of word classes, all other combinations occur, even if some of them are extremely rare.

Discussion

In the output for Voynichese, words in each class are morphologically similar. This can be verified quantitatively by a similarity function like Levenshtein ratio (1 for identical words, 0 for totally different words). The average similarity between the 100 words set of the 20 most frequent words from all the 5 classes can be considered as a benchmark. The values of this average are:

English Genesis   0.195
Voynich Quire20   0.266

Possibly due to the well-known rigidity of Voynich morphology, the average similarity is considerably higher in Voynichese than in English.
If we compare the average similarity among the 20 most frequent words within each class, we can see that in English the values are close to the benchmark. On the contrary, in Voynichese, only one of the classes (c:0) has a value close to the overall average. All the other classes have higher values, with the similarity ratio of c:2 being higher than the double of the average value.

Similarity between the top 20 words in each class

There are several non-mutually exclusive reasons that could explain this similarity:

Similar words correspond to morphological variants of a unique root. E.g. in many languages the singular and plural of a word are different but quite similar words.
Similar words correspond to accidental spelling differences. In medieval manuscripts, it is not uncommon to see the same word written differently.
The fact that prefixes and suffixes of Voynichese words are dependent on the suffix and prefix of the preceding word is well known. According to the Transformation Theory by Emma May Smith, this could be the effect of phonological adaptations. This phenomenon could be so pervasive that it misleads the algorithm into classifying morphology rather than part of speech.

It is also worth noting that “loop sequences” with repeated occurrences of the same class are much more frequent in Voynichese than in English.


    English Voynich
0_0   37     119
1_1   18     187
2_2   75     423
3_3  212     393
4_4  190     630TOT  532    1752

Only a small part these 1752 “loop sequences” in Voynichese are due to exact reduplication (like “daiin daiin”). 158 cases of exact reduplication occur in quire 20. These involve 92 words, which are spread on all of the 5 classes:
c0 8
c1 18
c2 22
c3 16
c4 28
My superficial impression is that the high number of loops is a symptom of a poor classification; grammatically, one can expect that some word classes do not appear consecutively. We can observe this in the English example, where c0 and c1 (correlating with conjunctions and articles respectively) have particularly rare consecutive occurrences of members of the same class. Also, one can expect that reduplication is restricted to only a few part-of-speech classes (the fact that reduplicating words are assigned to all classes is surprising).

Further research

This line of investigation promises a number of potentially interesting experiments. Here are some ideas:

Performing a more detailed analysis of the composition of each class. Here I focussed on the 20 most frequent words of each class, but looking at all words could provide different insights.
Using a higher number of classes.
Introducing a “sentence-start” marker. This is an option mentioned by Clark. Quire20 is divided in a number of paragraphs: a precious piece of linguistic information. Of course, one can expect that Grove words will be grouped into a single class (they are the peculiar Voynichese words that appear in the first position of most paragraphs).
One could introduce a “line marker” as well. We know that in Voynichese a line of text appears to be “a functional unit”, with other peculiar words (different from Grove words) appearing in the first and last position of most lines.
Clark’s software allows to set a threshold that forces all words with fewer than that number of occurrences into a fixed class. 5 is suggested as reasonable value. One could experiment with different values for this threshold (which I did not use in the taggings discussed here).
Clark has developed a variant of Ney’s algorithm that favours the grouping of similar words into the same class. We see that this is already happening “spontaneously” with Voynichese, but it should happen much more with Clark’s algorithm.

Preliminary thoughts on Voynichese Part of Speech tagging

Written by Marco Ponzi