Starting with Japanese literature
AI research relies on a set of common tasks in order to compare the models built by different teams. In the same way that athletes compete under standard conditions set by their sports’ governing bodies, AI researchers around the world have spent the last twenty years training their programs to classify the 70,000 handwritten digits in MNIST, a dataset built before neural networks were able to read the handwritten numbers on bank checks. It was a difficult task when it was introduced in 1998, but now, in the words of Mila PhD student Alex Lamb, it is “done to death.” Because so many programs can solve it with greater than 99% accuracy, it is no longer useful for showing whether a new program advances the state of the art or not. As a result, researchers have started creating harder spinoff tasks with the same standard conditions, such as EMNIST (a mixture of upper- and lower-case letters along with digits) and FashionMNIST (pictures of clothing items, to be classified as shoes, shirts, etc.) Alex wants to add another criterion to these spinoffs: instead of just making new versions of MNIST which are harder to solve, why can’t we make ones which are useful outside of our own research community?
Throughout Japan’s long period of isolation before Emperor Meiji’s rapid campaign to modernize the country in the late 1800s, Japanese books were produced through a wood-block printing process with a form of cursive writing called kuzushiji. The National Institute of Japanese Literature (NIJL) has set the goal of digitizing at least 300,000 of the nearly two million pre-modern books registered in the national catalog to make sure their content will survive as the paper copies degrade, but getting good photographs of all the pages is only the first part of the problem. Today, the kuzushiji writing system is intelligible only to a small fraction of the Japanese population with advanced degrees in classical literature. In 2016, researchers at Kyoto University and Osaka University launched KuLA, an mobile app using the NIJL’s database to teach people kuzushiji with flashcards and a cute cartoon mascot. But translating kuzushiji is more than just a matter of memorizing and classifying the characters, because some characters can be rendered in more than one way, and some can represent different sounds in different contexts.
For machine learning researchers, kuzushiji also presents an interesting problem because of the character distribution’s “heavy tail”: there are over 4,000 different characters in kuzushiji, as a result of three different alphabets blending throughout the Middle Ages, and many of those characters occur very infrequently. Reading kuzushiji perfectly with no mistakes would require breakthroughs in “few-shot learning,” the computer scientists’ terms for classifying a new type of data correctly after seeing it only a handful of times, which is an open problem in AI and a frequent topic for workshops at major conferences.
Beyond the NIJL’s quest to preserve kuzushiji books, the general idea of saving endangered languages by attracting machine learning researchers to use them as benchmark tasks holds promises for other organizations like UNESCO, which maintains an interactive atlas of endangered languages, the National Geographic Society, which supports the Living Tongues Institute for Endangered Languages, US Library of Congress, which collaborates with Wikitongues. The first step is to build a dataset of the language’s characters with the same format as MNIST (70,000 black-and-white images from 10 categories, 28x28 pixels, with the characters all centered and scaled to the same size) which can easily be dropped into existing machine learning structures. The next step is post it where researchers will see it. The kuzushiji version of MNIST, called KMNIST, developed by Alex and his CODH colleagues with additional help from Kazuaki Yamamoto at the NIJL, Mikel Bober-Irizar at the Royal Grammar School of Guilford, and David Ha at Google AI, is now included on PyTorch’s official list of supported datasets, alongside of MNIST, EMNIST, and Fashion-MNIST.
Alex admits that machine learning systems which can only read the 10 types of characters included in KMNIST would be of little value to literature scholars, but he calls this task “a gateway drug,” expressing the hope that models (and researchers) trained on KMNIST would be competent to move on to the other datasets his team has assembled, like Kuzushiji-49, which contains the 49 most common characters, and Kuzushiji-Kanji, which contains 3,832 rare characters and stands as a credible replacement for the popular Omniglot dataset, introduced for few-shot learning in 2015 and beginning to suffer from the same overuse as MNIST. The final step is to read raw pages of these pre-modern books, which brings the added problems of distinguishing text from illustration and moving between the columns of text in the proper order.
To goose the research community and hasten its progress up this ladder of tasks, interested parties could use a platform like Kaggle or InnoCentive to launch a competition with a reward for the first team to build a system which performs above a certain threshold. Alex’s colleagues at CODH are interested in doing this for kuzushiji books but unable to comment further now.
Meanwhile, for AI researchers testing approaches to the basic problems of pattern recognition, it doesn’t matter what the content of their MNIST-spinoff tasks are, as long as their format is standard and their data is public so that other teams can compare their results. Using those degrees of freedom to benefit other fields seems like the civic thing to do.