Using Python and Conditional Random Fields for Latin word segmentation
Word segmentation is a very common task to be performed on texts in languages like Chinese and Japanese, where words are not being separated by spaces or other delimiters. However, it is less well known that in the Roman empire, for several centuries the Latin alphabet was in use without any delimiting characters as well (scriptio continua). Thus, creating a script for the segmentation of Latin text is a non-nonsense task and could possibly be of use to Historians. A possible scenario would include the use of OCR techniques to automatically turn images into text written in scriptio continua; another module would then be used to segment the text and make the task of translation a little less tedious.
In this article, a CRF (Conditional Random Field) will be trained to learn how to segment Latin text. Using only very basic features and easily accessible training data, we are going to achieve a segmentation accuracy of 98 %.
What are Conditional Random Fields?
Conditional random fields are a machine learning algorithm that’s being used for sequence classification. Unlike the more widely known algorithms like SVMs, CRFs are not being used to label objects individually. Instead, they take into account the fact that in many tasks, the order of observed objects influences their meaning — e.g. when making sense of text, we don’t observe the words each on its own but also see how they relate to each other.
CRFs are related to Hidden Markov Models. HMMs, however, are a generative model, while CRFs are discriminative. While CRFs can’t be used to generate artificial object sequences, their classification performance in most cases is better than the performance of HMMs. If you are more interested into the theoretical aspects of CRFs, you should take a look at this tutorial.
Getting the training data
The Latin Library contains a huge collection of freely accessible Latin texts. Let’s make use of it.
We take the content of the first 49 author pages — however, we will not even need this much data later on.
Next, create a list of all links pointing to Latin texts. The Latin Library uses a special format which makes it easy to find the corresponding links: All of these links contain the name of the text author.
It would be a lot of text if we would add every book available into our training data. Let’s just take the first 200 book pages.
We will train our CRF for segmenting sentences. Thus, we are going to split the complete text into individual sentences next, and remove very short “sentences”. Sometimes, names are being abbreviated (like M. for Marcus) which could cause some confusion, but we will ignore that for now.
Our CRF will not be able to see space characters. Instead, it is going to learn where to add them. We will transform the sentences now — every sentence will be represented by its individual characters. Each character will be labelled “0” if it doesn’t start a new word, “1” if it does.
Creating features and training a CRF
Finally, we can create some simple features. These will be really simple features — we are only going to use some n-grams.
Create some training and testing data:
The training took about 15 minutes on my MacBook. As you can see, I applied the trained CRF on some basic Latin sentences I remembered from my years at school, and the trained model seems to work quite well on a first glance. Let’s see what the measures say:
The model performs with an accuracy of 97.6% and an F-Score of about 92.5% — all of this without any parameter optimisation and feature engineering. With some tuning, it should be easily possible to further improve this model.
CRFs are successfully being applied to many NLP tasks like Named Entity Recognition, POS-tagging and word segmentation. They also can be applied to computer vision tasks like image segmentation. Sadly, CRFs are not part of any of the more common data science packages in Python. However, using the Python-CRFSuite, you can quickly get started with sequence labelling tasks in the data scientist’s favourite language.