Low-resource language: what does it mean?

Published in

NeuralSpace

3 min readJun 10, 2022

To build Natural Language Processing (NLP) solutions for any language, the most vital thing one needs is data in that language. There are above 7000 languages spoken by people across the world but out of these 7000 languages, only about 20 have text corpora of hundreds of millions of words. English is by far the language with the largest amount of data, followed by Chinese and Spanish. Other languages with large datasets include Western-European languages but also Japanese.

On the other hand, the majority of the languages spoken in Asia and Africa lack the training data that is required to build accurate state-of-the-art NLP systems. These languages are called low-resource languages.

Technically speaking, whenever a language is lacking large monolingual or parallel corpora and/or manually crafted linguistic resources sufficient for building statistical NLP applications, it is considered a low-resource language.

So you may ask yourself why a language, let’s say, Hindi which is spoken by more than 500 million people is a low-resource language while a Western-European language, let’s say, French, which is only spoken by about 100 million people, is a high-resource language. For that, you need to dig deeper into how datasets are actually created. Given that state-of-the-art language models need gigabytes if not even terabytes of data, that can easily correspond to billions of written sentences, an approach of manually creating data only for these models is infeasible. It would not only take years to create such datasets, there is also barely any organization that would like to pay hundreds of employees to not do anything else than writing random sentences in a document.

The solution is that nearly all marge datasets are scraped from the Internet, often from social media networks like Facebook or Twitter, where billions of comments and posts are written by people from all over the world. Data that are essentially free to use for training language models. But, here comes the caveat: in which languages do people write on the Internet? How many tweets are written in English compared to Hindi? And thinking of historical data, when was the first user-generated content being uploaded to the Internet and where? The answers to most of these questions are “English” and “USA”, and that’s exactly why we speak of languages with low resources (read, small datasets) and languages with high resources (read, large datasets).

Current NLP solutions majorly focus on a few high-resource languages although there are about 3 billion low-resource language speakers (mainly in Asia and Africa). Such a large portion of the world population is still underserved by NLP systems because of various challenges that developers face when building NLP systems for low-resource languages. Check out this article on Challenges in using NLP for low-resource languages and how NeuralSpace solves them to know more.

Join the NeuralSpace Slack Community to connect with us. Also, receive updates and discuss topics in NLP for low-resource languages with fellow developers and researchers.

Low-resource language: what does it mean?

Written by Felix Laumann