Neural-Network Handwritten Text Recognition for the Voynich manuscript

Marco Ponzi
ViridisGreen
Published in
11 min readJan 31, 2023

The subject of this post is a supervised OCR system for the automatic transliteration of handwritten text in the Voynich manuscript (Beinecke ms 408, an unreadable 15th Century manuscript written in a unique alphabet).

These experiments are based on the neural-network software that Harald Scheidl discussed in his 2018 online article “Build a Handwritten Text Recognition System using TensorFlow”. Before continuing, I suggest you read Scheidl’s article, which is clearly presented and illustrated.

I used the newest code version of Scheidl’s Python software: the only update needed to make it work was installing TensorFlow 1.15.0 instead of 1.3.

Dataset creation

Scheidl discussion is based on the IAM dataset of handwritten text. In order to experiment with Voynichese, I needed a new dataset with the same structure but with samples extracted from the Voynich manuscript. Possibly, the best way to do this would be to use the word rectangles from the XML files underlying voynichese.com, but I was unable to understand to which images they apply (the coordinates do not seem to be consistent with the Voynich scans at voynichese.com). Therefore, I wrote a few custom scripts to extract labelled samples from the Beinecke scans of the manuscript.

The first step is manually removing all non-word details from the page. This step is not strictly necessary, but it significantly improves the results.

Text are from f46r

The second step was implemented in Python / OpenCV. The image of a page is processed to extract boxes for individual words. First, horizontal regions corresponding to individual lines are identified: I allow for a couple of degrees of rotation and for a slight distortion of the line: nearly not enough to make this process reliable, given how irregular text lines in the Voynich ms are.

Detected line positions

Line images are processed individually to identify areas containing writing. This is done simply by blurring the line image and applying an adaptive threshold. All pixel blobs that are large enough are enclosed into a box. Boxes drawn in white were rejected as noise (typically because they are too small or nested inside larger boxes). Purple boxes are considered as containing writing. Sets of purple boxes that are close to each other are merged into larger green boxes, each hopefully corresponding to something between part of a word and a couple of words.

Word boxes in part of f46r, line 4

Another Python script compares the widths of the green rectangles in a line with the lengths of EVA words for that page in the Zandbergen-Landini transliteration. Various combinations of merging image boxes and text words are tried, looking for a way to make box widths and EVA lengths consistently match. When the process is successful (this happens for about 50% of the lines) the boxes are labelled with the corresponding EVA text and they are written as individual images to be used to train the neural network. Uncertain spaces are counted as characters when matching boxes and text, but they are not included in the labels; on the other hand, spaces are included in the labels of boxes that include more than one word. For instance, three green boxes are merged to form a single final box labelled “chckhy.sheky”.

Text box labelled ‘chckhy.sheky’

These are the nine training samples produced for the line we are discussing:

The matching process described above often results in false positives (boxes that only include part of a word): these were manually removed, but likely some still appear in the training set. In general, the quality of the dataset is very far from ideal, with boxes that could be more accurately cropped or words that form a significant angle with the box border; apparently, the Neural Network was able to manage this level noise.

I used a bash script to augment the training set. For each sample, several synthetic samples are created by applying small alterations. Each sample is a black and white image normalized to the 0–255 range:

  • rotating the image
  • adding a border
  • slant distortion
  • changing the distribution of grey levels (gamma correction)
  • horizontal and vertical stretching

These are the resulting samples for the first word in the line:

Two more samples are created by writing the label with the “eva hand 1.ttf” font and processing the result to remove some details. Background noise is added and some of the distortion methods described above are applied to these two samples too:

780 text boxes were extracted from the manuscript. The samples contained a total of 867 word tokens. Each box resulted in 7 training samples, for a total of 5460 samples.

The samples were extracted from the following 22 pages:

f5r f5v f10r f10v f11r f11v f20r f20v f44r f44v (Scribe 1)

f31v f33r f34v f43v f46r f80v f81v f85r1 (Scribe 2)

f104r f104v f106r f106v (Scribe 3)

I organized the samples in a directory hierarchy similar to that in the IAM dataset and added an index file “words.txt” also modelled along the IAM format.

In the labels, I replaced benches and benched-gallows with single uppercase characters (C,S and F,K,P,T — though F is actually not represented in the dataset).

The following 30 characters are included in the dataset labels:

. ‘?@CKPSTacdefghijklmnopqrstxy

Where ‘.’ corresponds to a space between words and was actually encoded with the space character ‘ ’.

‘?’ corresponds to unreadable characters and ‘@’ was used to represent all high‐ascii characters: these two symbols occur in a total of only 5 dataset words.

Since benches and benched-gallows are encoded as single uppercase characters, there are not many remaining occurrences of ‘c’ and ‘h’.

Training and performance

I ran Scheidl’s neural network software on the dataset described above. The training process splits dataset words into two sets:

  • 95% of the words are fed to the neural network;
  • 5% of the words are used to validate the networks’ performance at each epoch. Training stops when performance on the validation set does not improve for a user-specified number of epochs (30, in this case).

The 5% validation set is needed to avoid over-fitting: i.e. having a Neural Network that is perfect at handling its training set, but somehow “fixated” on it and unable to recognize samples it didn’t “see” before.

Training took about 5 hours on the CPU of my old laptop. Results at the end of training (124 epochs) were:

Character error rate: 7.628865979381444%.

Word accuracy: 69.23076923076923%.

The following are the results of running the neural network on lines that were not included in the training dataset. I manually extracted images for each word; for each line, the normalized grey-scale images that were fed to the software are shown.

Scribe 1, f8v, line 15

NN output: “or”:0.979 “Col”:0.998 “Caas”:0.394 “Ceky”:0.995 “Cor”:0.997 “Ceain”:0.994 “Cal”:0.666 “Ceeky”:0.988 “Cor”:0.996 “y”:0.991

ZL transliteration:

or.chol.chan.ch{ck}y.<->chor.cheain.char.cheeky.chor.ry

Here 4 words out of 10 are not correctly classified: chaas for the expected chan, checky for ‘ch{ck}y’, ‘chal’ for ‘char’, ‘y’ for ‘ry’. It should be noted that, by replacing benches and benched gallows in the training labels, only occurrences of ‘c’ and ‘h’ outside such combinations appeared, therefore the error on the fourth word is somehow expected. The serious error on the second word at least comes with a relatively low confidence score; the chal/char error only has a marginally lower score than the rest. The error on the last word ‘ry’ is further discussed below

Scribe 4 (not included in the training set) f68r2, line 4

NN output: “Col”:0.991 “Cesy”:0.325 “oram”:0.905 “qoeKy”:0.948 “qoKeol”:0.996 “qokeol”:0.964 “Ceool”:0.667 “dColal”:0.417

ZL transliteration: chor.chesy.or[ee:a]d.qoeeckhy.qockheol.qokeol.cheoal.dcholal

Here 50% of words contain errors (4 out of 8). This poor performance is partially explain by the scribe (which the NN did not have a chance to see during training) and to the presence of some relatively bizarre characters (the ‘ee’ sequences in words 3 and 4).

Scribe 2, f82r, line 21

NN output: “daiin”:0.999 “Ceoky”:0.947 “lkedy”:0.996 “sol Sey”:0.639 “daiin”:0.999 “Cey”:0.998 “qol”:0.997 “Cedy”:0.997 “qokeey”:0.998 “dal”:0.947

ZL transliteration: daiin.cheoky.lkedy.salshey.daiin.chey.qol.chedy.qokeey.dal

Here the only error is ‘salshey’, where a space is detected mid-word and the ‘a’ is misread as an ‘o’.

My overall impression is that the ~70% word accuracy reported at the end of training is probably optimistic for the whole text, even if accurate boxes for single words are provided. But considering the limited extent and the poor quality of the training set, I think that the system is performing very well. A better training set would improve performance, though it’s difficult to predict the entity of such improvement.

Neural-network output matrixes

In the following, I visually show the output of the Neural Network where values of the whole matrix were normalized to the 0..1 range.

Columns in the matrix are referred to as “time steps” (since writing occurs left to right in this case, the columns actually represent information that was encoded by the scribe in that temporal order).

Each row in the matrix corresponds to a single character in the target alphabet. For readability, I do not show 6 rows corresponding to characters that only rarely occur in the dataset. The number of columns in the output depends on the geometry of the image: wider images result in more columns. Scheidl’s software is configured for a maximum of 32 time-steps.

These are the matrixes for the two occurrences of ‘daiin’ in f82r, line 21. The top row in a matrix corresponds to the space character that sometimes appears in my dataset when an image includes more than one word. The bottom row is a ‘blank’ character whose high values mark time-steps where the Neural Network could not detect any character; this appears as a 31st character in the alphabet.

The identified word and its corresponding probability are computed from this matrix according to a method called Connectionist Temporal Classification (CTC). During this process, the highest scoring character is collected for each row. For instance, for the image on the left, this step results in:

-----daaiii-iii-n

where ‘-’ represents the bottom row ‘blank’.

Now all consecutively repeated characters are removed, resulting in:

-dai-i-n

Finally, the blank space is removed: ‘daiin’

The probability of an individual decoding is computed as the product of the probabilities for each time-step. There are many paths in a single matrix that result in the same word, e.g.:

--ddddaaiii-iii-n
ddd-aai--iiiii--n
dai----------in--
--d--a--i--i--n--

The probability for the word is the sum of the probabilities for all such possible paths. As Scheidl discusses in this other post, CTC makes it possible to train the NN on images of whole words that are only labelled with the word itself: the positions of individual characters are learned by the Network during training.

A side effect of this method is that the matrix and the input image are not exactly aligned. For instance, the last character of the word often activates the rightmost column of the matrix, even if the image contains some space near its right-side border.

The following two graphs show the matrixes for f68r2.line4 qoeeckhy (misread as ‘qoeKy’=’qoeckhy’, with a single ‘e’) and f82r.line21 ‘qokeey’ (which was correctly identified). The first plot clearly shows that the problem is with the anomalous ‘e’s: this causes a lower probability, as well as the more serious problem of a single cell in the ‘e’ row being the highest value for its column, hence a single ‘e’ instead of ‘ee’. It is possible that the noisy background also contributes to the problem. The matrix on the right shows two clear ‘e’ spikes separated by a darker cell corresponding to an occurrence of the bottom-row ‘blank’, allowing for the correct reading ‘ee’.

The following matrixes illustrate another error, where the last word in f8v.line15 is read as ‘y’ instead of ‘ry’.

Here the problem is likely due to the different contrast level between the two letters, with ‘r’ being blotched and quite dark and ‘y’ being faded and barely readable. If the image is processed by applying a local adaptive threshold (right-side plot) the word is correctly identified. It seems strange that the most clearly readable character is not “seen” by the NN, which focuses on the faded ‘y’ instead. Anyway, this example possibly shows that min-max normalization is not always a sufficient preprocessing step for both the training samples and the words submitted for identification.

This is the processing of the word “thronu[m]” from the St.Gall Latin manuscript CSG 942. The image on the left shows that the system fits the word to something that is close to an acceptable Voynichese word (EVA:dyeoaiin), but the resulting probability is rather low (20%). The image on the right was edited according to my guess of how the NN interpreted the word, and this considerably increases the probability of the recognized word (60%).

The previous example illustrates another interesting feature of this system: during its training, the Neural Network also learns about the structure of Voynichese words, so that it can use context to identify the correct character. This can be also be shown by running the system on random input: the recognized word has a very low confidence (2%), but it is almost plausible according to Voynichese morphology (EVA:chdody).

Conclusions

The technology discussed by Scheidl is promising for the creation of a full machine transcription of the Voynich manuscript. Of course, instead of simply applying what he developed for a rather different use-case, custom software could result in better results (Scheidl provides many suggestions about ways to improve the qualitative performance of the NN for different scenarios). A robust extraction of words from page images should be implemented, e.g. by using the voynichese.com XML files, or maybe, with more effort, by more neural network technology. It is well possible that line-level or maybe even paragraph-level processing could also be successfully tackled.

--

--