Preliminary thoughts on Voynichese Part of Speech tagging

Marco Ponzi
ViridisGreen
Published in
10 min readFeb 1, 2019

Software tools for unsupervised Part of Speech (POS) tagging have been around for several years. Algorithms mostly look for which words appear next to the same words. On a large enough corpus, such behaviours correlate with the grammatical role of each word.

The application of such techniques to Voynichese (the unknown language of the Voynich manuscript, Beinecke ms 408) has been mentioned in the thesis of Gianluca Bosi (Bologna University) and I am sure others must have experimented with this approach.

The simple experiments discussed here are based on software developed for Alexander Clark’s 2003 paper Combining Distributional and Morphological Information for Part of Speech Induction.
Clark discusses a number of ideas that could help dealing with rare words (one of the problems with Voynichese) but here I will only focus on an older algorithm re-implemented by Clark (originally described in Hermann Ney, Ute Essen, and Reinhard Kneser, 1994, On structuring probabilistic dependencies in stochastic language modelling).
I am not sure I understand much of the theory: I use the software as a black box. I think the algorithm works on bigrams, i.e. considering couples of consecutive words: words are assigned to classes in such a way that the possibility of guessing the next word given the current word is maximized. There will be classes which tend to appear consecutively and others that rarely or never follow each other. The number of word classes is one of the parameters of the algorithm.

A simple sentence

As an example, this is the three classes tagging of a simple sentence:

0:it 2:ascends 1:from 0:the 2:earth 1:to 0:the 2:heaven 1:and 1:again 0:it 2:descends 1:to 0:the 2:earth

The two most frequent words (“it” and “the”) are assigned to class:0
The words that precede those in class:0 are assigned to class:1
The words that follow those in class:0 are assigned to class:2

Basically, the detected structure is 0,2,1,0,2,1,0,2,1…
Even this simple example is not totally regular and in one case a class:1 word is followed by a second class:1 word (“1:and 1:again”). But class:0 is always followed by class:2, and class:2 is always followed by class:1.

This example also shows how such a system can be useful for Part of Speech identification:

  • The two prepositions (“to” and “from”) are tagged as class:1, together with other two “function words” (“and” and “again”)
  • All verbs and nouns (ascends, descends, heaven, earth) are tagged as class:2

Of course, the variability in all natural languages is so high that a meaningful tagging can only be produced on the basis of a very large corpus.

English vs Voynichese

In order to avoid difficulties with the differences between Currier languages and have as uniform a text as possible, I focused on a single section of the Voynich manuscript: quire 20.
As a benchmark, I used a portion of the Genesis from King James Bible, considering a similar number of words (about 10,500).

In both cases, I only used 5 POS classes. Such a number is obviously too small to correctly represent all different part of speech categories, but it makes analysis and discussion easier. I fed the whole text to the algorithm, without any kind of punctuation or sentence boundary marks.

These are the most frequent 20 words for each of the classes for the English text:

C:0          C:1          C:2          C:3          C:4
tokens:1860 tokens:2159 tokens:2095 tokens:2048 tokens:2294
types:131 types:84 types:122 types:399 types:439
ratio:14.1 ratio:25.7 ratio:17.1 ratio:5.1 ratio:5.2
hapax:67 hapax:31 hapax:41 hapax:190 hapax:189
and 1117 the 866 of 441 it 93 earth 99
that 141 he 146 in 163 be 87 said 93
upon 58 his 131 to 131 him 79 lord 87
which 52 i 109 unto 121 all 76 years 65
lived 34 god 102 was 109 thee 67 will 53
forth 27 a 98 shall 95 abram 59 sons 47
into 24 thou 82 is 79 them 53 man 45
came 24 every 69 after 69 not 43 hundred 44
as 24 thy 68 were 67 noah 41 had 37
out 22 they 62 for 66 me 37 name 34
on 21 their 36 with 65 her 37 wife 33
but 21 my 35 begat 64 there 31 land 33
up 15 s 33 from 61 went 30 waters 32
saying 13 she 30 shalt 34 also 26 day 32
then 12 an 29 when 25 you 20 have 30
because 12 two 21 made 25 this 20 days 30
at 12 cain 18 called 23 one 20 seed 27
wives 10 three 15 make 22 old 20 ark 26
therefore 9 five 14 behold 22 daughters 20 flesh 25
where 7 nine 12 let 21 eat 19 son 23

class:0 mostly conjunctions and adverbs
class:1 determiners and numbers (but also subject pronouns)
class:2 12 verbs + 8 prepositions
class:3 contains several object pronouns (him,thee,them,me,her), but it is quite mixed
class:4 16 nouns + 4 verbs (3 of which are auxiliary); even if they do not appear among the most frequent words, several adjectives are also assigned to this class

“Hapax” is the number of hapax legomena (i.e. words which only occur once in the whole text). This value is obviously anti-correlated with the tokens/type ratio. Classes that include fewer word types tend to have fewer hapax legomena. The number of tokens per class is roughly constant. It could be that function words concentrate in classes with a high tokens/type ratio and a low number of hapax legomena.

The most frequent sequences of two consecutive classes are (the numbers correspond to occurrences of the sequence in the tagged text):
1_4 1828
2_3 1010
3_0 917
2_1 868
4_2 862
4_0 804
0_1 801
3_2 500
They can be represented by the following graph:

Some sequences that match those illustrated in the graph:
1:the 4:days 2:of 3:enos 2:were 1:nine 4:hundred 0:and 1:five 4:years
1:the 4:bow 2:shall 3:be 2:in 1:the 4:cloud
1:a 4:dove 2:from 3:him 2:to 3:see 2:if 1:the 4:waters 2:were
0:and 1:the 4:lord 2:plagued 3:pharaoh 0:and 1:his 4:house
0:that 1:his 4:brother 2:was 3:taken

Of course, many more sequences appear a significant number of times. It is also evident that word classes are not clear-cut. Yet the results illustrate how this software can detect something relevant, at least for the most frequent words, even with a relatively short text.

These are the most frequent 20 words for each of the five classes for Voynich Quire 20 (using the EVA transcription by Zandbergen and Landini, including uncertain spaces):

C:0          C:1          C:2          C:3          C:4
tokens:1817 tokens:2585 tokens:1692 tokens:1893 tokens:3125
types:613 types:618 types:550 types:787 types:507
ratio:2.9 ratio:4.1 ratio:3.0 ratio:2.4 ratio:6.1
hapax:465 hapax:396 hapax:407 hapax:623 hapax:203
aiin 208 chedy 200 qokeey 159 ar 149 qokaiin 121
al 138 chey 125 qokeedy 136 or 72 daiin 121
y 122 shedy 116 qokedy 61 otar 59 l 111
ol 113 okeey 97 oteedy 57 otain 52 qokain 100
okain 68 shey 82 lchedy 52 r 49 okaiin 97
ain 58 cheey 78 qokey 40 cheo 39 otaiin 77
dain 48 oteey 63 lkeey 34 s 32 chol 63
air 35 otedy 60 okedy 32 char 32 o 55
am 27 cheol 55 qoteedy 31 qotar 29 otal 52
sain 24 sheey 49 lkeedy 30 dair 27 dar 44
a 21 okeedy 46 qoky 29 lkar 22 qokar 43
aiiin 20 chdy 38 qol 27 sar 21 okal 43
cheeo 18 chckhy 38 qotedy 25 ch 21 raiin 41
qokol 13 keedy 34 qoteey 22 lor 20 qokal 41
shody 11 sheedy 26 qoty 19 kar 19 okar 41
oteol 9 sheol 25 qotal 18 otair 18 qotaiin 40
ral 8 cheedy 24 qokeeey 17 chor 16 lkaiin 39
okol 8 shol 22 lkedy 17 chear 15 kaiin 36
oiin 8 keey 21 otey 15 tar 14 saiin 35
cheeody 8 chedaiin 21 okeeey 15 aiir 14 qotain 34

class:0 aiin and variants; mostly very short words
class:1 initial bench (ch/sh) -y or -l ending; 13 words start with a bench ending with -l or -y (vs a total of 3 in the top 20 words of the other four classes)
class:2 [ql]..[yl]: 16 words start with q- or l- and end with -y or -l (vs a total of 2 in the top 20 words of the other four classes)
class:3 final -r (16 words vs a total of 4 in the top 20 words of the other 4 classes)
class:4 -aiin -ain -al as suffixes i.e. attached to preceding characters (14 words vs 7 in the top 20 words of the other 4 classes)

I thought that some of the classes might correspond to line-initial / line-final or paragraph-initial / paragraph-final words, which are known to be peculiar. But this is not the case:

    l.init. l.final p.init. p.final
C:0 248 217 57 32
C:1 202 203 33 55
C:2 144 138 54 28
C:3 218 251 58 52
C:4 272 275 26 61

The most frequent sequences in the tagged text:
1_4 1147
3_0 1135
4_1 1006
1_2 945
0_1 797
4_3 746
4_4 630
2_4 540

These frequent sequences can be represented by this graph:

The following are a few fragments that illustrate word sequences compatible with the most frequent sequences:

<f104v.13> 4:lkaiin 3:cheetar 0:aiin 1:cheitaiin
<f105v.11> 4:okaiin 3:os 0:aiin 1:chckhodu 2:qoteedy
<f106r.40> 3:ar 0:aiin 1:sheey 4:lkaiin 1:sheedy
<f111v.32> 1:chey 4:tain 3:chkar 0:alkar 1:chey 2:qol
<f112v.18> 4:saiin 3:or 0:aiin 1:chey 2:qokeedy
<f112v.46> 1:cheky 4:chokain 3:char 0:am 1:chey 4:kain
<f113v.47> 4:lkaiin 4:tair 1:shey 4:qotain 3:ar 0:akal 1:shey

Again, this is highly simplified: words in each class do not entirely fall into the morphological patterns I described, and, though those listed above are the most frequent consecutive occurrences of word classes, all other combinations occur, even if some of them are extremely rare.

Discussion

In the output for Voynichese, words in each class are morphologically similar. This can be verified quantitatively by a similarity function like Levenshtein ratio (1 for identical words, 0 for totally different words). The average similarity between the 100 words set of the 20 most frequent words from all the 5 classes can be considered as a benchmark. The values of this average are:

English Genesis   0.195
Voynich Quire20 0.266

Possibly due to the well-known rigidity of Voynich morphology, the average similarity is considerably higher in Voynichese than in English.
If we compare the average similarity among the 20 most frequent words within each class, we can see that in English the values are close to the benchmark. On the contrary, in Voynichese, only one of the classes (c:0) has a value close to the overall average. All the other classes have higher values, with the similarity ratio of c:2 being higher than the double of the average value.

Similarity between the top 20 words in each class

There are several non-mutually exclusive reasons that could explain this similarity:

  • Similar words correspond to morphological variants of a unique root. E.g. in many languages the singular and plural of a word are different but quite similar words.
  • Similar words correspond to accidental spelling differences. In medieval manuscripts, it is not uncommon to see the same word written differently.
  • The fact that prefixes and suffixes of Voynichese words are dependent on the suffix and prefix of the preceding word is well known. According to the Transformation Theory by Emma May Smith, this could be the effect of phonological adaptations. This phenomenon could be so pervasive that it misleads the algorithm into classifying morphology rather than part of speech.

It is also worth noting that “loop sequences” with repeated occurrences of the same class are much more frequent in Voynichese than in English.


English Voynich
0_0 37 119
1_1 18 187
2_2 75 423
3_3 212 393
4_4 190 630
TOT 532 1752

Only a small part these 1752 “loop sequences” in Voynichese are due to exact reduplication (like “daiin daiin”). 158 cases of exact reduplication occur in quire 20. These involve 92 words, which are spread on all of the 5 classes:
c0 8
c1 18
c2 22
c3 16
c4 28
My superficial impression is that the high number of loops is a symptom of a poor classification; grammatically, one can expect that some word classes do not appear consecutively. We can observe this in the English example, where c0 and c1 (correlating with conjunctions and articles respectively) have particularly rare consecutive occurrences of members of the same class. Also, one can expect that reduplication is restricted to only a few part-of-speech classes (the fact that reduplicating words are assigned to all classes is surprising).

Further research

This line of investigation promises a number of potentially interesting experiments. Here are some ideas:

  • Performing a more detailed analysis of the composition of each class. Here I focussed on the 20 most frequent words of each class, but looking at all words could provide different insights.
  • Using a higher number of classes.
  • Introducing a “sentence-start” marker. This is an option mentioned by Clark. Quire20 is divided in a number of paragraphs: a precious piece of linguistic information. Of course, one can expect that Grove words will be grouped into a single class (they are the peculiar Voynichese words that appear in the first position of most paragraphs).
  • One could introduce a “line marker” as well. We know that in Voynichese a line of text appears to be “a functional unit”, with other peculiar words (different from Grove words) appearing in the first and last position of most lines.
  • Clark’s software allows to set a threshold that forces all words with fewer than that number of occurrences into a fixed class. 5 is suggested as reasonable value. One could experiment with different values for this threshold (which I did not use in the taggings discussed here).
  • Clark has developed a variant of Ney’s algorithm that favours the grouping of similar words into the same class. We see that this is already happening “spontaneously” with Voynichese, but it should happen much more with Clark’s algorithm.

--

--