Two Voynich word-models

Published in

ViridisGreen

14 min readJul 18, 2019

The Voynich manuscript (VMS, Beinecke MS 408) is written in a unique alphabet and nobody has been able to extract any meaning from it. Words in the manuscript are highly structured. Some people believe that this regularity could be related with lexical morphology. In 1991, the way in which the “circle” characters (o,a,y in the EVA transcription system) alternate with the other characters has been analysed by Jacques Guy, who came to the conclusion that the “circles” behave like vowels in a phonetic writing system. Whatever its cause, word structure is one of the phenomena that clearly prove that Voynichese is all but random.

The subject of this post is the comparison between two models of Voynichese words:

Jorge Stolfi’s crust-mantle-core context-free grammar;
a context-free grammar based on Emma May Smith’s syllable ranking model.

Since I have taken liberties with both models, anyone interested in the details should refer to the web pages of the two authors (linked below). Other word models have been proposed in the past, some of them are mentioned by Stolfi.

Code, grammars and input files used for this experiments can be found here: https://github.com/marcoponzi/vms_word_models

Jorge Stolfi’s model
Stolfi discusses his model in this June 2000 web page. He classifies most characters into three types (core, mantle and crust) that can only appear in a specific order. “Circle” characters can appear more freely in different parts of words. In this model, both core and mantle can occur symmetrically in two distinct sections of words, so the overall structure is actually made of five parts:
crust-prefix, mantle-prefix, core, mantle-suffix, crust-suffix

As Stolfi writes:

words are parsed into three major nested “layers” -crust, mantle, and core- each composed from a specific subset of the Voynichese alphabet:
core: t p k f cth cph ckh cfh
mantle: ch sh ee
crust: d l r s n x i m g [and ‘q’; ‘g’ and ‘x’ are not included in the the actual grammar]
…
suppose we assign “densities” 1, 2, and 3 to the three main letters types above, and ignore the remaining letters. The paradigm then says that the density profile of a normal word is a single unimodal hill, without any internal minimum.
In other words,as we move away from any maximum-density letter in the word, in either direction, the density can only decrease (or remain constant). The possible density profiles (ignoring repeated digits) are
1 2 3
12 21 13 31 23 32
121 123 131 132 231 232 321
1231 1232 1321 2321
12321

As can be seen, the “core” is made of the so-called “gallows” (EVA:t, p, k,f) and “benched gallows” characters (EVA: cth,cph,ckh, cfh); “benches” (EVA: ch, sh) and “e-sequences” form the “mantle”; all other characters belong to the “crust”.

Reformulation of Jorge Stolfi’s model
Stolfi provides a formal definition of his model as a context-free grammar. He also links scripts for the processing of the grammar and word-lists to be used as input. I was unable to execute Stolfi’s scripts, but it was not difficult to semi-automatically translate his grammar into the format required by the Python parsing library Lark.
The input file also had to be minimally pre-processed, in order to restore EVA:ee that had been replaced with “bh” (Stolfi’s grammar expects the usual “ee”). Also, the total number of tokens in the input file is not perfectly identical to the number of tokens listed in the grammar file (35027 vs 35128). But the success rate produced by Lark is extremely close (actually slightly better) than than that listed in the grammar file (33803/35027=96.5% vs 33827/35128=96.3%).

Stolfi’s grammar file includes frequencies that are used as weights for the rules:

The text is contaminated by sampling, transcription, and possibly scribal errors, amounting to a few percent of the text tokens — — which is probably the rate of many rare but valid word patterns. Thus, a purely qualitative model would have to either exclude too many valid patterns, or allow too many bogus ones. By listing the rule frequencies, we can be more liberal in the grammar, and list many patterns that are only marginally attested in the data, while clearly marking them as such.

The experiments discussed here focus on the binary classification of words as valid or invalid: the grammar I used does not include frequencies, which moreover I wouldn’t know how to represent in Lark. My preliminary results showed that Stolfi’s original grammar is indeed too “liberal” accepting a large number of non-Voynichese words. So I removed low-frequency rules until I obtained results with an overall quality comparable to that of Emma May Smith’s model.

Emma May Smith’s model
Emma presented her model in a recent series of blog posts. In particular:
A New Word Structure
Matching [o]
Body Rank Order: refinement to initial [y]

Her model is based on syllables, so it is important to understand how syllables are defined in this context.

You can think of a syllable as a string or bundle of sounds uttered close together: a vowel (or vowel-like sound) and optional consonants pronounced before or after it. When speaking about the components of a syllable the vowel is called the nucleus, the consonants before it the onset, and the consonants after it the coda.
…
We divide up a word by first finding all the occurrences of [o] and [y]. There are as many syllables as occurrences of these glyphs. Each syllable consists of the [o] or [y], all the glyphs before it until another [o] or [y] or start of the word is reached, and all the glyphs after if it is the final [o] or [y] in the word.

Emma considers the characters [a] and [y] as equivalent: whenever she mentions [y] one should be aware that she is also referring to [a]. In addition to this, benches, benched-gallows and e-sequences that are not followed by a “circle” are also considered as the right-side boundary of a vowel. Emma describes this feature as y-deletion: in addition to [y] and [a] there is a third “manifestation” of [y] that corresponds to syllables ending without an explicit “circle”.

Let us take an example: [qokedar] (8 tokens). First we restore the missing [y] after [e] and change [a] to [y], giving: [qokeydyr]. Then we apply the syllable division process outlined above: there are three occurrences of [o] and [y] in the word, giving the syllables [qo] + [key] + [dyr].
It should be noted that [qo] and [key] are syllables as bodies only and only [dyr] has a coda. Indeed, the process of syllable-finding makes it impossible for there to be a coda except at the end of words, as glyphs are always assigned to a following syllable where possible.
…
1. Each syllable body has a Rank of 1, 2, or 3.
2. Each word has a number of slots equal to its number of syllables.
3. From left to right the Rank of each syllable body must increase (or stay constant).**
[** A later post specifies that syllable Rank cannot stay constant: it must strictly increase]
The ranks for all syllable bodies which occur in five or more word types are as below:
Rank 1: cheeo, cheey, cheo, chey, cho, chy, qo, o, y, sheey, sheo, shey, sho, shy, so
Rank 2: ckhey, ckhy, cthey, ctho, cthy, do, lo, ro, sy, eey, kchey, kcho, kchy, keeey, keeo, keey, keo, key, ko, kshey, kshy, ky, lchey, lchy, lkeey, lky, lshey, pchey, pcho, pchy, po, py, tcheo, tchey, tcho, tchy, teeo, teey, teo, tey, to, ty
Rank 3: dy, ldy, ly, ry

This model defines possible words by using two constraints that work together:

only a limited number of syllable types is accepted;
if a word is made of more than a single syllable, syllables must appear in the order defined by their rank.

An additional rule is the special treatment of words starting with ‘ych’ or ‘ysh’:

the text of the Voynich manuscript must be “normalized” by removing the [y] from the start of words beginning [ych, ysh].

For example, “ychey” should be treated as a single syllable (this pattern is the only allowed possibility for syllables with more than a single “circle”).

Reformulation of Emma May Smith’s model
In the following discussion, I will use a model that derives from Emma’s syllable ranks, but with significant differences.

Emma’s special treatment of y+bench suggested me the idea of introducing Rank0, i.e. splitting Emma’s Rank 1 into two distinct Ranks:
Rank0: so, qo, o, a, y
Rank1: syllables consisting of a bench character, an optional e-sequence, and an optional “circle” (a deleted-y is assumed in case there is no circle)

This makes words like “qochey” and “ycheey” acceptable (as Rank0+Rank1).

Syllables “do, lo, ro, sy” are assigned to Rank3 instead of Rank2. A result of this update is that Rank2 syllables are now required to contain either a gallows (benched or not) or a bench preceded by another “non-circle” character (e.g. lchey). In this way, the resulting grammar is also more easily comparable with Stolfi’s, with Rank2 playing a role similar to Stolfi’s Core+MantleSuffix.

Syllable “eey” was also moved from Rank 2 to Rank 3, in order to account for words like “toees” (2:to, 3:eeys), “loeey” (2:lo, 3:eey), “qotoeey” (0:qo, 2:to, 3:eey).

In order to accept words like “dalor” (Rank3+Rank3) or “dalary” (Rank3+Rank3+Rank3), I allowed Rank3 syllables to appear as many as three times consecutively (an exception to the rule that only allows strictly increasing syllable ranks). Also in this case, the resulting model is closer to Stolfi’s model, with Rank3 behaving similarly to the OR.OR structure defining the “Crust” layer.

The model was encoded as a Lark grammar, so that it could be more easily compared with Stolfi’s model. The symbol ‘U’, mapping to the empty string, represents the occurrences of a deleted-y. My treatment of this feature is simplified with respect to Emma’s analysis. In particular, I rarely take into account the characters that follow ‘U’ (while the possibility of actual y-deletion depends both on the left-side and right-side context): an accurate representation of y-deletion in a context-free grammar is certainly possible, but for this post I preferred to focus on simplicity rather than precision.

Overall, my interventions on Emma’s model were much more extensive than on Stolfi’s. I may have introduced errors due to both misunderstanding Emma’s ideas and inappropriately altering / simplifying them.

Data-sets
All input files were filtered by removing characters that are not included in the grammars I considered, in particular: b,g,j,u,v,w,x,z.

I used two Voynichese input files:
Stolfi list of words (with ‘bh’ restored back to ‘ee’);
Takeshi Takahashi’s EVA transcription.

I created two comparison files containing non-Voynichese words:

an excerpt from King James Bible;
a processing of Takahashi’s transcription were characters were randomly moved around.

All words that appear in one of the two lists of Voynichese words (Takahashi’s and Stolfi’s) were removed from the non-Voynichese files, so that the two sets of Voynichese and non-Voynichese words are entirely disjoint.

I ran a second set of experiments where, for each of the four input files, only the 1000 most frequent word types were considered. While in the first experiment tokens were weighted by their number of occurrences, in this second experiment each word type was processed as having the same weight.

Results
As shown by the following table, the overall results for the two grammars are similar, both when considering all word tokens and when only considering the most frequent word types. The two models score error rates close to 9%, i.e. success rates are about 91%. The highest error rates are caused by the processing of the quasi-Voynichese file generated by scrambling characters in Takahashi’s transcription.

Structure of Voynichese words
Lark can generate images of the parse trees it produces. These images will be used in the following discussion of a few sample words.

If you want to enlarge a parse-tree image, open it in a new tab and change the width that appears in the url (e.g. change “/max/700/” to “/max/900/”).

qokedar
Let’s start with the word that Emma used as an example in the passage quoted above.
Stolfi’s parsing of the word exemplifies how his grammar models the alternation of “circles” and “non-circles” by means of the “OR” structure: i.e. a circle is connected to the following non-circle. Accordingly, “ok” and “ar” are joined together. In Voynichese, non-circles can follow each other while circles almost never do: Stolfi modelled this by making the circle in OR optional. “dar” is then seen as two consecutive occurrences of OR: “d”+”ar”.
Since the word has no “benches”, both mantle layers are empty.
In Emma’s model, the word is split into three syllable bodies (“qo”, “ke[y]”, “da”) and a coda (“r”). The second body includes a “deleted-y” that, as I said above, I represent by the symbol “u”. The coda should be interpreted as linked to the last syllables, though in creating the grammar I found it more convenient to represent it as following the list of syllable bodies.

qokedy / qokeody
This is a much more frequent word than qokedar (272 vs 8 occurrences, in Takahashi’s transcription).
In Stolfi’s parsing, the second OR that corresponded to “ar” must now be dropped, since a non-circle is mandatory. It is replaced by a structure intended to model circle-endings (opt-o-final).
In Emma’s model, the parsing is identical to that of qokedar, but for the absence of a coda.

Comparison with a third similar word (“qokeody”) illustrates how y-deletion (represented by ‘u’) works in the case of “qokedy”.

ochepchody
This is one of the longest words that is accepted by both models. In the syllable-based model, it includes all the possible syllable ranks (Rank0 + Rank1 + Rank2 + Rank3). According to the revised model that allows repeated occurrences of Rank3, also “ochepchodaly” and “ochepchodalaly” would be acceptable.
Stolfi’s grammar parses “ochepchody” has having an empty crust-prefix: “qodalchepchody” would also be accepted, as well as “qodalchepchodaly” (with a second OR in the crust-suffix).

dairal / daral
Both models reject this word. Its anomaly is the presence of “i” in a mid-word positions. Stolfi’s model expects “i” to appear in the “final” section; Emma’s expects it in the “coda”.
On the other hand, “daral” is accepted by both models. In the syllable-based model, this is made possible by the extension that allows multiple consecutive occurrences of Rank3 syllables.

raraiin / rorol / dydy / totol
While the reduplication of words is frequent in Voynichese, syllable-reduplication isn’t. Yet the phenomenon is not absent, as these words exemplify. In the syllable-based model, the first three words can be accepted if one allows the repetition of Rank3 syllables. Some words (e.g. raraiin) are also accepted by Stolfi’s model, thanks to the OR.OR sequences included in the grammar. The parsing of “raraiin” is very similar to that of “daral”.
Both models allow at most one gallows per words: both reject “totol”.

polchedy / olchedy
This is a “Grove-word”, i.e. a normal word that appears to be transformed by a prefixed “gallows” when appearing at the start of a paragraph. Both models expect a bench that immediately follows a gallows character, but here they are separated by “ol”: both reject “polchedy”.
On the contrary, the supposed original word “olchedy” is parsable.

English words
Stolfi’s grammar generates many more false-positives on English tokens than the syllable-based grammar (2956 vs 910). The difference is actually due to only a limited number of frequent word-types misclassified by Stolfi’s grammar. I think this is again an effect of how OR.OR works, with the possibility of omitting the circles, so that the actual pattern can be R.R. This matches the consonant bigrams that are frequent in English. Some of the words accepted by Stolfi’s model but rejected by Emma’s are: “ark”, “refrain”, “stars”.

Some English words accepted by both models: “altar”, “shekel”, “archer”. The parsing of “archer” can be compared with that of “olchedy”.

Conclusions
The two models discussed here are quite effective in discriminating between Voynichese and non-Voynichese words. My re-formulation of Emma May Smith’s model makes it closer to Stolfi’s: in particular, the bench-gallows-bench structure that Stolfi represents as mantle-core-mantle is closely paralleled by Rank1 and Rank2 syllables.
The differences between the two models are mostly related with [drls], i.e. those parts of words that Stolfi labels as “crust” and parses according to OR.OR.OR segments. Again, by introducing the possibility of repeating Rank3 syllables, I made the syllable-based model closer to Stolfi’s (with Rank3 repetition being roughly equivalent to RO.RO.RO).
While Stolfi’s model underlines the symmetrical position (with respect to gallows) of the various non-circle characters, Emma’s model makes clear that syllable positions appear to be governed by the asymmetrical principle represented by the “ranking” system.
As for the implications of word structure, when presenting his model, Stolfi writes:

The nature and complexity of the paradigm, and its fairly uniform fit over all sections of the manuscript (including the labels on illustrations), are further evidence that the text has significant contents of some sort. Moreover, the paradigm imposes severe contraints on possible “decipherment” theories. In particular, it seems highly unlikely that the text is a Vigenère-style cipher, or was generated by a random process, or is a simple transliteration of an Indo-European language. On the other hand, the paradigm seems compatible with a codebook-based cipher (like Kircher’s universal language), an invented language with systematic lexicon (like Dalgarno’s), or a non-European language with largely monosyllabic words.

Of the three options proposed by Stolfi, I believe that an artificial language and a non-European language are entirely possible. I consider a codebook-based cipher less likely, since there are phenomena (e.g. the alternation of circle and non-circle characters, glyph-combinations across word breaks) that support the idea of the phonetic nature of Voynichese. It has sometimes been suggested that Voynichese could be something similar to Farfallino or Javanais: these are basic examples of “phonetic ciphers” and I think that something along those lines (but more complex) might be an option.

I wonder if something like glossolalia could also be a possible explanation. Both javanais and glossolalia are mentioned in this 2003 message by Jacques Guy. Like natural languages, this pseudo-linguistic phenomenon originates in speech, rather than writing, so it must certainly be pronounceable: this could account for the alternation of circle and non-circle characters studied by Guy. But until now I have been unable to find a transcribed corpus large enough to verify if the rigid word structure of Voynichese could be compatible with glossolalia.

Two Voynich word-models

Written by Marco Ponzi