Neural-Network transliteration of the Codex Seraphinianus

Published in

ViridisGreen

15 min readAug 26, 2023

A while ago I applied to the Voynich manuscript the neural-network software for handwritten text recognition discussed by Harald Scheidl.

Having found suitable online scans of the Codex Seraphinianus (Luigi Serafini, 1981), I applied the same approach to that book. This was in part inspired by a Reddit post by u/maimonidies.

In the following, all page numbers refer to pages in the archive.org PDF from the link above. The scans do not include Serafini’s original numbers that appeared at the bottom of each page. Serafini always stated that the text in the book is meaningless, e.g.: “The writing of the Codex is asemic, i.e. meaningless. It evolved over time, as can be observed by carefully flipping through the text. To me, it looks asemic, and I certainly haven’t encoded any hidden message in it. But perhaps, in the future, it will be discovered that there was an alien spacecraft orbiting Earth in the mid-Seventies, transmitting psycho-messages at a very high frequency. Thus, I might have unconsciously transcribed the images and alphabet of a remote extragalactic civilization.” (translated from this interview by Fabio Pariante, 2022).

EXTRACTION OF WORD IMAGES

Pages from the PDF were converted to images with a 300 DPI resolution.

The text in the Codex is written in regular, well separated horizontal lines and words are separated by wide spaces. So I just treated as text areas those areas of the pages that showed a horizontal pattern. This method only detects paragraph text, so isolated words like image labels and most titles were ignored. 1189 text areas were extracted.

This image shows the 8 text areas extracted from page 106. In this case, bold text was not recognized and was skipped.

I then picked lines as the darker horizontal stripes of each area. Words were separated using the OpenCV connectedComponentsWithStats function. This step produced 44966 images, some of which actually contain a couple of distinct words. About 200 of these turned out to be duplicates and were manually removed.

Each word image was preprocessed by applying a local adaptive threshold (used to enhance the ink / background contrast) and a fixed slant correction of 8.5 degrees.

The following figure shows a couple of word images before and after preprocessing.

TRANSLITERATION ALPHABET

Melka and Stanley (“Performance of Seraphinian in reference to some statistical tests”, 2012) wrote: “The term transcription is formally avoided since it mostly undertones mapping from a sound system to script.” I follow their example and use the term “transliteration” instead.

They manually transliterated three paragraphs from the Codex, for a total of about 200 words. They often mapped sequences of repeated “minims” into different glyphs: the small text they analysed required 40 different glyphs.

Since Serafini’s text looks complex and I needed to be able to pick a sufficient number of samples for each glyph in the transliteration alphabet, I decided on a somewhat simplistic approach. This probably misses some important aspects, but it turned out to be manageable for Neural Network training. In particular:

1. all punctuation and diacritics were ignored (as Melka and Stranley did)

2. small bits of the words were treated as individual glyphs, breaking complex combinations into smaller parts.

The following table shows the 33 ASCII characters used to label the training sample. ‘?’ was used for all rare glyphs that cannot be matched with any other glyph in the table. In addition to these, the space character was also used.

Several of the glyphs mapped to uppercase characters tend to appear word initially. In particular: A, C, D, E, M, P, R, S, W, X. Glyphs that tend to be word final are ‘b’, ‘d’, ‘j’ and ’n’; ’n’ could be decomposed as ‘id’, but since it is extremely frequent and the curl often overlaps vertically with the ‘i’, I chose to handle it as an individual glyph. CS glyphs shown here were also used as a rudimentary “font” to render transliteration output. In order to do so, they had to be adapted to a horizontal arrangement; this required a considerable distortion for L in particular, which tends to appear diagonally oriented from bottom-left to top-right. This issue is only relative to the pseudo-font and had no impact on the Neural Network training and performance.

Z corresponds to one of the symbols that represent numbers: the other digits were not examined, but the presence of Z should be enough to highlight many numeric sequences occurring in the text. The base-21 numeric system used in the book was decoded by Allan C. Wechsler (1987) and Ivan A. Derzhanski (2004).

A major difficulty was that several glyphs show all possible intermediate shapes, e.g. ‘i’ transitions into ‘e’ and ‘r’ into ‘o’, with progressively clearer loops; ‘rg’ (horizontally developing) turns into G (sloping downward) and then into B (vertical); ‘o’ turns into ‘g’ with loops that progressively elongate from the base-line to clear descenders. Boundaries are often hard to set and probably a properly developed Neural Network can be more consistent in classification than myself.

The three words in the image above show the transliteration of the most common glyphs.

TRAINING AND VALIDATION

About 440 images (usually containing a single word) were manually labelled. 10% of the samples were used for validation: i.e. the Neural Network only used this validation-set to check if optimization on the other samples was also resulting in better results on these “unseen” samples. This avoids the “overfitting” problem, when a Neural Network learns its training set perfectly but becomes unable to process anything slightly different from it.

Since the transliteration system requires in a high number of output characters per word, I used Scheidl’s “line-mode” option, originally designed to parse whole lines of text.

At the end of the training, 78/431 (18%) of validation characters were incorrectly classified. Only 12 of the 44 validation samples were correctly parsed.

3 of the 32 errors were due to an ‘e’ being read as ‘i’. As discussed above, distinguishing between these two glyphs is tricky because the script forms a continuum with ‘e’s with a clear loop and other cases where the loop is very small or barely visible.
4 errors were due to confusion between two other glyphs, in particular: l/L R/s o/r M/B.
7 other errors were due to a single missing or added glyph. E.g. expected: ‘Rrgiin’ / output: ‘Rrgiiin’.
6 error samples resulted in an edit distance of 2 (e.g. two glyphs being confused with two other glyphs).
The remaining 12 samples caused a higher number of errors (from 3 to 8).

It can be said that roughly 1/4 of the words are correctly parsed, 1/4 have seriously flowed readings, the remaining half have minor defects.

The worst validation result was ‘Mi_MOrgiifrgeeffeergO_ctg’ (p.292) that was read as ‘Bi?_BOrgiirgieffiirgO_cyg’. The recognized sequence is still visually close to the input, with possibly the missing central ‘f’ being the most obvious of the 8 differences.

‘Mi MOrgiifrgeeffeergO ctg’ — the most poorly handled validation-set sample (8 errors)

The longest of the 12 correctly transliterated validation samples is ‘EleeeeBrrgiiiLb’ (p.87)

‘EleeeeBrrgiiiLb’ — the longest correctly recognized sample in the validation set

Concatenating all the expected labels to form a single text, and the actual output for the validation set to form a second text shows that the neural network output has slightly lower values for both character entropy and character conditional entropy (this was computed with the Entropy.java class that user nablator shared on the voynich.ninja forum). The validation set is small (less than 500 characters) so it is not sure that this measure is meaningful; for what is worth, it seems to suggest that the entropy of the actual text is higher than what the neural network transliteration shows.

TRANSLITERATION OF THE CODEX

The trained neural network was applied to the word boxes extracted from all detected text. This roughly corresponds to 80% of the whole text (labels, titles and undetected paragraphs were skipped).

The transliteration file can be downloaded from github. Lines starting with ‘#’ mark page and text-area boundaries; in a few cases, ‘#’ was also used to manually “comment-out” duplicated transliteration text. Spaces separate the output of distinct word boxes; ‘.’ was used to mark spaces that the Neural Network detected inside individual word boxes. In the following, ‘.’s are treated as spaces: the total number of transliterated words is then 47227. In a relatively small number of cases, ‘id’ was output by the neural network; in the following discussion, these cases were replaced with ‘n’.

Considering the high error rate discussed above, this transliteration cannot of course be regarded as reliable. It can still be used for computing sufficiently robust statistics and extracting features for subsequent human checking.

The image above shows the first two lines of the “stag tree” paragraph from p.91. The top shows the original CS text with Melka and Stanley’s transliteration. The bottom shows the Neural-Network output together with the synthetic recreation of the text.

The first line of the NN transliteration contains one clear error (‘i’ for ‘O’ in the second word). In the second line, the second glyph ‘o’ was read ‘r’ and a spurious ‘e’ was added to the first word, in the fourth word, the initial ‘M’ was read ‘B’ and an ‘u’ was added. In total, 5 errors in 3 different words appear in the two lines. It should be noted that also Melka and Stanley’s transliteration has some dubious points, e.g. the final glyphs of the first and second word are mapped to the same character, though only the first has a loop; for both lines, the transcription starts with ‘d’ but also in this case the glyphs look different.

The longest sequence that the neural network detected in two different passages is ‘Srrgiin Srrgiin Srrgiiiffrrgiin’.

‘Srrgiin Srrgiin Srrgiiiffrrgiin’ p.206, right column, 4th line from the bottom; p.216, right column, fourth line from the top.

The longest word detected to occur more than once is ‘SrrgiiiGrrgiiiffrrgiin’. Actually, the occurrence at p.194 has a single ‘i’ before the final -n.

‘SrrgiiiGrrgiiiffrrgin’ p.194, left column, second line from the bottom; ‘SrrgiiiGrrgiiiffrrgiin’ p.206, right column, fifth line from the bottom.

The two occurrences of the slightly shorter ‘prrrgOeiiiffrrgiiLj’ look more consistent and could indeed be regarded as two instances of a single word type.

‘prrrgOeiiiffrrgiiLj’ — p.63, left column, second complete line from the bottom; p.75, left column, second line from the bottom.

The original edition of the CS was printed in two volumes, with p.179 in the PDF being the first page of the second volume. The following table shows character statistics for the neural-network transliteration of the whole text and of each of the two volumes (V1, 21641 transliterated words, and V2, 25586 words).

The greatest differences between the two volumes are highlighted in green for higher frequency and magenta for lower frequencies. The change in ‘i’ / ‘e’ results in a similar total for the two glyphs in V1 and 2; since the two glyphs form a continuum, this possibly just marks a minimal drift in the handwriting. On the other hand, ‘E’ and ‘S’ are quite different initial glyphs and there is no reason to expect a systematic confusion between the two. Plotting the number of words starting with E/S in pages containing at least 200 words clearly shows that there is a sharp change in perfect correspondence with the V1/V2 boundary.

Frequency of word-initial E- / S- in pages with at least 200 words

The following table shows the top 20 most frequent words with a length of at least 3 glyphs in Volume 1 and 2.

In order to get a further idea of the accuracy of the Neural Network output, the following are collections of 18 random samples for two words from V1 (Eliiifrrj) and V2 (Srrgiiin). It’s easy to spot errors, e.g. ‘e’ read ‘l’ at the top and bottom of the Eliiifrrj image; miscounting i-sequences for both words. But the comparison confirms that the system does capture significant differences.

Randomly selected samples of words recognized as ‘Eliiifrrj’ and ‘Srrgiiin’

I highlighted ‘E’ and ‘S’ in two fragments from V1 (p.47) and V2 (p.206). This exemplifies the change in distribution between the two volumes.

Occurrences of E- and S- in two passages from Volume1 (p.47) and Volume2 (p.206)

COMPARISON WITH OTHER TEXTS

The CS transliteration was compared with four natural language texts and with the Voynich manuscript.

The language texts used are:

EN “Alice in Wonderland”

FR “Albertine Disparue” Chapter 2 (a passage appears in CS p.212)

IT “Divina Commedia”

LA “De Bello Gallico”

Punctuation was removed and both a mixed case and a lowercase-only version (prefixed “l”) were processed.

For the Voynich text, the Zandbergen-Landini ZL_ivtff_2b.txt transliteration was used (ignoring uncertain spaces). Only paragraph text was considered. Currier A and B were processed separately and the text was encoded both as the original EVA and as CUVA (conversion table here). Voynich samples are marked V; E and C define the encoding; A and B the Currier subsets.

Samples for CS are marked S. The two volumes (V1 and V2) were processed separately. In addition to the output of the neural network, the text was processed to produce a more compact output:

Ss (single): all repeating characters are reduced to a single occurrence. “Eliiifrrj Srrgiiin” is rendered as “Elifrj Srgin”

Sb (bigram / trigram). Some frequent bigrams and trigrams were replaced with a single character:

iii->w,
rrg->Q,
ii->v,
rg->q,
ff->F,
rj->J,
El->V,
in->m,
ee->U,
LB->K,
Lj->k.

For instance: “Eliiifrrj Srrgiiin” -> “VwfrJ SQwn”

A few scatter plots are discussed below. Of course, the values plotted for the Codex Seraphinianus must always be taken with a big grain of salt.

The plot above shows alphabet size (X) vs the frequency of the most frequent character (Y). The raw CS transliteration results in a very high frequency for ‘i’, but this is entirely corrected by both the “bigram” and “single” encodings.

Mean and standard deviation of word lengths

Also the mean and standard deviation of word length are easily adjusted to come close to the examined natural languages. This is not the case for the Voynich manuscript, where both very short and very long words are rare, resulting in a particularly low standard deviation.

Character entropy and character conditional entropy (computed on the first 54000 characters of each text)

On the other hand, character entropy and character conditional entropy behave similarly to the Voynich manuscript. Even when several frequent bigrams and trigrams are replaced with single characters, samples from the CS remain far from natural language samples. Similar plots were published by Koen Gheuens who performed an extensive set of experiments with different substitutions in the context of Voynichese. As discussed above, it is possible that the neural network artificially reduces both h1 and h2, but my impression is that this effect is not very large.

MATTR (moving average type token ratio) was also imported by Keon Gheuens into the field of Voynich research. It was further explored in this paper by Luke Lindemann.

The plot shows MATTR computed on a window of 5 words (X) vs a window of 200 words (Y). Y values are mostly comparable with those of Voynichese and inflected languages like Latin and Italian. X values are lower for Voynichese than for Latin languages and still lower for the CS. This means that the same word tends to often repeat within 5 words from the previous occurrence.

Perfect reduplication (X); partial reduplication (Y)

The plot above shows the frequency of perfect reduplication (X) and partial reduplication (Y) as a percentage of the number of words in the text. Partial reduplication is here defined as an edit distance of 1 between two words being 3 or more characters long. Perfect reduplication is the immediate repetition of the same word (words consisting of a single character were ignored). All language samples cluster near the origin, with values very close to zero for both measures. CS Volume1 has a number of reduplication occurrences that is about half the rate observed in the Voynich manuscript. Volume2 has values close to the Voynich manuscript, or higher if words are processed by compressing consecutive repetitions of the same character. Partial reduplication is close to 1%: very high when compared with language texts, but lower than values for the Voynich manuscript. Examples are provided below.

MATTR5 values for the CS are lower than those for the Voynich manuscript, but reduplication rates are comparable, so the difference must be due to some other factor. I generated another plot to check the X.a.X pattern, where two identical word tokens are separated by a third word token.

X shows the total percentage of the X.a.X pattern; Y shows the same measure only considering X with a length of 3 or more. It seems clear that the high MATTR5 values for the CS are due to very short words. Many of these will be OCR errors, or irrelevant sequences like “Z word Z”, “? word ?”. I think it likely that MATTR values for the CS transliteration are underestimated.

This final plot shows the frequency of the most frequent line-initial character (X) compared with the frequency of that same character as word-initial in any line position (Y). Y value is 1 if the character tends to be word initial and line initial with the same frequency; a stronger preference for the line initial positions results in values greater than 1. Lower-case samples for language texts have values close to 1, with the exception of IT (Dante) that (being poetry) is affected by grammar: each verse tends to be a complete sentence. Variable-case samples have values smaller than 1, since uppercase letters often appear at the beginning of lines, making e.g. ‘t-’ in English more likely word initially midline rather than line initially, where it is penalised by the many occurrences of T-.

For Voynich A, EVA ‘o’ appears to be “neutral”, with the same preference for the word initial position independently from also being line-initial. On the other hand, ‘D’ in the CUVA encoded B sample is more than twice as frequent line-initially than it is in general, showing a strong preference to occur there. CS Volume1 shows E as the most frequent line-initial character and it shows a similar frequency word-initially in general. Volume 2 instead shows that the prefix ‘r’ / ‘rrg’ (encoded ‘Q’ in Sb) is 3 times more frequent line-initially than word-initially in general.

In the following detail from the left column of p.347, I highlighted word-initial occurrences of ‘rrg’. The preference for the line initial position is noticeable.

Page 347: word-initial ‘rrg’ concentrates at the start of lines

CONSECUTIVELY REPEATED WORDS (REDUPLICATION)

As discussed above, the consecutive repetition of identical words appears to be frequent in the CS.

The transliteration file contains 10 sequences of words that are consecutively repeated 3 or more times. These are the corresponding details from pages 115, 133, 245, 256, 262, 275, 275, 314, 319, 325.

Words repeated three or more times as detected by the neural network

The 7th sample (p.275) is clearly an hallucination due to insufficient training on punctuation samples. All the other samples appear to be valid. Samples 3 (p.245) and 8 (p.314) could be sequences of five consecutive repetitions of words Eliiin and Srrj respectively. If one considers the difference between ‘e’ and ‘i’, the first and last tokens in the p.245 sequence are clearly different, so in this case the number of repetitions is lower.

The total number of repeated words at least two characters long (not counting those that include ‘Z’ and therefore appear to be numbers) is 259: 63 in Volume1 and 196 in Volume2. Some of these will certainly be spurious, but it is also certain that the phenomenon of reduplication is frequent in the Codex.

The longest repeating words detected are:

p.194 right L17 SrrgiiiBrrgiin SrrgiiiBrrgiin

p.21 right L16 SrrgiiiBrrgiin SrrgiiiBrrgiin

p.28 left L17 EleeeBrrgiiLb EleeeBrrgiiLb

p. 345 left L15 SrrgOiiirrgin SrrgOiiirrgin

Repetition of ‘SrrgiiiBrrgiin’ (top two images), ‘EleeeBrrgiiLb’ and ‘SrrgOiiirrgin’

CONCLUSIONS

The Neural Network system developed by Harald Scheidl appears to be perfectly fit for processing the Codex Seraphinianus. The results presented here were produced with only a few days of work on an old laptop: I have no doubt that more work with more computing power would produce better results. Possibly it would be enough to increase the size of the training set by manually labelling more words, but it’s likely that a revision of the transliteration alphabet would also be beneficial.

However unreliable, the transliteration discussed here was enough to highlight some properties shared with the Voynich manuscript but not with language texts:

the existence of two distinct sections with different glyph frequencies;
high rates of both perfect and partial reduplication;
line effects (in particular the preference for certain glyphs to appear line-initially).

Neural-Network transliteration of the Codex Seraphinianus

Written by Marco Ponzi