As human society and technology evolves, language grows and continuously adapts to reflect changes in culture and communication styles. Since the modernization of Japan in the 1900’s, the Japanese have, by and large, lost the ability to read kuzushiji, the script used between the 9th and 20th centuries. Over 3 million books, literature and drawings have been preserved today, many of which are from the Edo period. Only a small fraction have been translated since there are few remaining scholars able to translate these texts.
Since kuzushiji was not standardized, words were often written in many styles and formats. There are likely well over 5000 unique characters in the language, which highlights the difficulty of making translations. With rapid advancements in machine learning, specifically in image recognition, a number of researchers and machine learning practitioners have built algorithms to help historians identify text and digitize content to unlock the history hidden in these historical documents.
What is machine learning?
It is the process of training a computer system to execute a specific task without a human specifying instructions. Training is achieved by exposing the system to many examples, typically in the millions and with a built-in reward system that encourages it to maximize successful attempts to predict the outcome. By the end of the training process, the system has come up with it’s own set of instructions, based on statistical inference, to allow it to accomplish the task.
So, how are researchers able to digitize text with machine learning? The process of developing a machine learning solution starts with the data. The National Institute of Japanese Literature (NIJL) created and released a kuzushiji dataset containing 1 million images of cropped handwritten characters and their kanji equivalents (a modern Japanese language), which was curated by the Center for Open Data in the Humanities (CODH).
I’ll walk through one method of achieving this which is not unlike how people process images and text. The scope is limited to detecting and recording each character found on a page.
For a machine to be able to translate a whole page of kuzushiji text to modern day kanji, some preprocessing is needed to first identify whether a marking or object in the image is a kuzushiji character or not. This is similar to when we quickly scan an image with our eyes, figure out if text is present and if so, which language it is. Machine learning systems solve the problem of detecting the presence of text on a page by training on the language script and inferring its visual characteristics. This process is referred to as object detection and there are multiple ways to achieve this.
The image below is based on an object detection method which estimates the likely center point of an object, which in this case is a handwritten kuzushiji character. It records where it is located and goes on from there to segregate the object from the rest of the image (see CenterNet).
The heat-map on the right shows areas of high intensity and low intensity. The less intense (darker) areas have a lower probability of being a kuzushiji character and are ignored. The higher probability (lighter areas) areas are considered centers. They are segregated and passed onto the next step: character recognition.
Optical character recognition (OCR) is a machine learning method to classify text, it is akin to someone who is learning to read or learning to read an unfamiliar script for the first time; most will translate each character one at a time rather than focusing on other context (such as word shape or surrounding letters/words). Since OCR systems and object detection systems are solving image-based problems, both methods are a form of machine learning that is described as computer vision or deep learning and is often referred to as Artificial Intelligence (AI). The technology is quite complex, but follows the same learning principles (see ResNet). After the machine classifier system has learned the language — in this case, after it’s been trained on the kuzushiji character images — it’s ready to predict character images from whole pages of literature, diagrams and art. The system processes each character segment and returns a list of possible kanji character matches, much like a search engine ranking, listing the highest scoring kanji character match at the top with a score.
The last step is to stitch the outputs together and annotate the original historical image with character annotations appearing in the correct positions and recording the character digitally in readable order.
Machine learning image recognition methods are of course imperfect and error-prone. But this technology has huge time-saving advantages when it comes to processing and digitizing image collections. Ultimately, machine learning is best used to assist and supplement human expertise and skills. This is especially true for specialty domains where there are few experts able to confidently execute work such as translating and deriving meaning from classical scripture. I can see this technology will benefit the community of Japanese historians and researchers best by assisting them to prioritize the works of art and literature that should be explored for historical significance.