Using Python and Conditional Random Fields for Latin word segmentation

Please note: All the code described in this article can be found at my github page. There is a very related Stanford report from which some of the feature ideas have been taken.

Word segmentation is a very common task to be performed on texts in languages like Chinese and Japanese, where words are not being separated by spaces or other delimiters. However, it is less well known that in the Roman empire, for several centuries the Latin alphabet was in use without any delimiting characters as well (scriptio