NLP for Low-Resource Settings
Natural language processing (NLP) is a field of Artificial Intelligence that tries to establish human-like communication with computers. Although it can boast significant success, computers still struggle with comprehending many facets of language, such as pragmatics, that are difficult to characterize formally. Moreover, most of the success is achieved for popular languages like English or other languages that have text corpora of hundreds of millions of words. But we should understand that these are only about 20 languages from approximately 7,000 languages in the world. The majority of human languages are in dire need of tools and resources to overcome the resource barrier such that NLP can deliver more widespread benefits. They are called low-resource languages languages, or languages lacking large monolingual or parallel corpora and/or manually crafted linguistic resources sufficient for building statistical NLP applications.
Why Is It Important?
It might look like we need only a dozen of languages to do fine in the world, so why bother with minor or extinct languages? However, building NLP applications for such languages can at the same time reinforce the ties between the world and ensure its diversity.
- Preservation. An obvious task for NLP is to process and document languages that do not have a writing system and many are falling out of use before they are gone forever.
- Educational applications. Sometimes even languages that were doomed extinct come back to thrive. Everyone knows the revival of Hebrew or Gaelic, and with the help of NLP techniques such revival can be sped up drastically.
- Knowledge expansion. Much of the world’s knowledge is not in text, corpora contain what people said, but not what they meant, or how they understood things, or what they did in response to the language. New developments in NLP might give insights into the connection between the word and the meaning — not by pure statistics, but for comparing more diverse languages.
- Monitoring demographic and political processes. People speaking minor languages are usually hidden from our sight, but if we think that Africa has a population of over 1.2 bln. people we’ll realize that it’s important to get closer to them.
- Emergency response. We are mankind and we are equal between the acts of God. Extending our prevention network with messages understandable to more people will save lives.
Approaches to low-resource NLP
Common NLP models require large amounts of training data and/or sophisticated language-specific engineering. However, such amount of data is unavailable for most languages and, in many cases, you cannot find a linguistically trained speaker to build a language model for many languages, even if they are spoken by millions of people, like the whole Bantu family.
Two main approaches to NLP in the low-resource setting, where the amount of data and the knowledge of the language are insufficient for traditional approaches are 1) traditional approach that focuses on collecting more data for a single language or a variety of languages; 2) approaches that try to apply transfer learning.
The first approach starts with a data collection phase to compile text or speech in the language, or languages, of interest, and usually results in producing an NLP tool, such as POS tool or machine translation engine. There are a number of languages that were covered as a part of single-language projects, including Welsh, Punjabi, Luo, Quechua, etc.. These approaches yield useful results, but require extensive preparatory work in data collection and processing that typically requires aid from an expert. The major drawback, however, is that the obtained results are not directly applicable to other languages, requiring new corpora for every language added.
At the same time, expanding corpora for low-resource languages is a valuable undertaking in itself. One of the most notable examples of such work is Crúbadán project led by Kevin Scannell. By crafting Web search queries designed to return Web-pages in specific low-resource languages, they have built corpora for 18721 languages.
Other examples of the many language approach include the Human Language Project, which describes a common format for annotated text corpora, and the Leipzig Corpora Collection, which has built corpora for 124 languages and offers statistics about each of these languages, such as word frequencies and contexts.
Unsupervised learning that does not depend on the manually labelled data, is another promising approach to NLP for low-resource languages. It covers unsupervised feature induction, such as Brown clustering, Word vectors methods, unsupervised POS tagging and unsupervised dependency parsing. Brown clustering refers to clustering a vocabulary into word classes to derive “lexical representations” based on the intuition that similar words have similar distributions of words to their immediate left and right. Word embeddings, or word vector are the cornerstone of many NLP approaches. While they used to require extensive datasets, recent research proves that zero entries in the co-occurrence matrix also provide valuable information. Jiang et al. (2018), for instance, designed a Positive-Unlabelled Learning approach to factorize the co-occurrence matrix and validate the proposed approaches in four different languages.
Cross-Lingual Transfer learning
Language models and transfer learning have become one of the cornerstones of NLP recently. The central idea underlying the transfer learning approach is that there are certain commonalities between languages that could be exploited to build, for example, a language model for one language from another model. The process of cross-lingual transfer learning refers to transfer of resources and models from resource-rich source to resource-poor target languages on several levels:
Transfer of annotations (such as POS tags, syntactic or semantic features) via cross-lingual bridges (e.g., word or phrase alignments). However, training such models with cross-lingual transfer learning usually requires linguistic knowledge and resources about the relation between the source language and the target language. Recent developments, though, offer techniques that do not require ancillary resources such as parallel corpora. In Kim et al. (2017), for instance, a cross-lingual model utilizes a common BLSTM that enables knowledge transfer from other languages, and private BLSTMs for language-specific representations without exploiting any linguistic knowledge between the source language and the target language.. The cross-lingual model is trained with language-adversarial training and bidirectional language modelling to represent language-general information and preserve the information about a specific target language.
Transfer of models refers to training a model in a resource-rich language and applying it in a resource-poor language in zero-shot or one-shot learning. Zero-shot learning refers to training a model in one domain and assuming it generalizes more or less out-of-the-box in a low-resource domain. One-shot learning is a similar approach that uses a very limited number of examples from a low-resource domain to adapt the model trained in rich-resource domain. This approach is particularly popular in machine translation where the weights collected for a rich-resource language pair are transferred to low-resource pairs. An example of such an approach is a model by Zoph et al. (2016). A “parent” model is trained in a high-resource language pair (French to English) and some of the trained weights are reused as the initialization for a “child” model which is further trained on a specific low-resource language pair (Hansa, Turkish and Uzbek into English). Similar approach was explored by Nguyen and Chiang (2017) where the parent language pair is also low-resource but it was related to the child language pair.
Joint Multilingual or “Polyglot” Learning converts data in all languages to a shared representation (e.g., phones or multilingual word vectors) and trains a single model on a mix of datasets in all languages, to enable parameter sharing where possible. This approach is closely related to recent efforts to train a cross-lingual Transformer language model trained on 100 most popular languages and cross-lingual sentence embeddings. The latter approach learns joint multilingual sentence representations for 93 languages, belonging to more than 30 different language families and written in 28 scripts. With the help of a single BiLSTM encoder with a shared BPE vocabulary for all languages, which is coupled with an auxiliary decoder and trained on parallel corpora, the approach allows learning a classifier on top of the resulting sentence embeddings using English annotated data only, and transfer it to any of the 93 languages without any modification.
Drawing a conclusion, we can once more say that the actual reason many specialists work on NLP problems is to build systems that break down barriers. Given the potential impact for mankind, building systems for low-resource languages is one of the most important areas to work on. There are already a lot of promising approaches dealing with low-data settings that may include low-resource languages, dialects, sociolects and domains, but notwithstanding the pursuit to find linguistic universalities, there is still no universal solution to cover all the languages in the world.