Tips & tricks on how to do NLP of the language for which you are not a native or proficient speaker

Please keep in mind that the following tips and tricks are for native or proficient English speakers.

The robot reads a book written by a human. Image by Francis Ray from Pixabay.

Natural language processing (NLP) is a branch of Linguistics and Artificial Intelligence that helps machines understand, interpret, and manipulate human language.

Here, we present two usages of NLP: text classification and text translation.

There are around 7,000 languages spoken around the world, but English dominates in NLP.

Language resource distribution of Joshi et al. (2020). The size and color of a circle represent the number of languages and speakers, respectively, in each category. Colors (on the VIBGYOR spectrum; Violet–Indigo–Blue–Green–Yellow–Orange–Red) represent the total speaker population size from low (violet) to high (red). English, which belongs to class 5, is the most popular language for NLP.

Languages in categories 5 and 4 that lie at a sweet spot of having both large amounts of labeled and unlabeled data available to them are well-studied in the NLP literature. On the other hand, NLP has largely neglected languages in the other groups. You can notice many NLP usages and examples in English, but it is challenging to do NLP in some other language.

So, what if your NLP task includes the language for which you are not a native or proficient speaker?

Let’s go through the following example. The task is to do text classification in the German language. The German language is also one of the popular languages for NLP (class 5 in the previous plot), but still not as popular as English. You can find the dataset here.

For simplicity and fast execution, I chose a small dataset with 1006 total records (503 health-related and 503 not health-related).

We need to understand our data and prepare for some Machine Learning (ML) or Deep Learning (DL) model.

To do this, we will follow several steps:

  1. Check data quality;
  2. Data preprocessing.

Check data quality

We can see that this data has nice reviews about its quality.

Let’s see what this data tells us — pick some sentence/s randomly to translate and see what the data whispers to us!

Since we concluded that English is the most popular language for NLP and German is also popular, let us assume Google Translate works well for English to German translation. Text translation is an NLP task, and Deep Learning boosts Google Translate. That is why we can instantly translate words, phrases, and web pages between English and over 100 other languages.

We will translate the content from German to English and then again, from English to German:

1. Manually using Google Translate,

2. Using some library for translation.

Google Translate — Examples

Example of the sentence labeled as health-related. German to English translation.

Notice that the used sentence is labeled correctly. Is it translated correctly? Let us translate the given sentence in English to German again:

Example of the sentence labeled as health-related. English to German translation.

We do not get the original German sentence. Does it mean that the translation is not good enough or that different sentences can mean the same?

German to English translation.

Voila! The same meaning can be represented in different ways, and that is what we must be aware of.

There is also vice versa — One word can have different meanings.

For example, the word “root” in English as a noun can mean:

  1. the part of a plant which attaches it to the ground or to support;
  2. the basic cause, source, or origin of something;
  3. in mathematics, it is a solution to an equation, usually expressed as a number or an algebraic formula.

This leads us to the next thing we should care about — the context of the given text. How to recognize the text context is still a hot topic in NLP.

Different subjects, topics, cultures, and science, and industry fields have different vocabulary and phrases. Therefore, it is important to know what kind of data you have and the source of data. Is it some daily news or gaming forum, do people use everyday language or some language that lawyers or doctors speak, etc?

Currently, we deal with social media-like data. That means that there are different types of people; we can expect some everyday language, slang, short sentences, misspelled words, missing punctuation, etc. If the training and validation data are social media-like data, then the test data should also be social media-like.

It is also important to understand what the target variable represents and who the users are.

For example, hospitals may be more interested in finding health-related sentences that mention some diseases, while Pharmaceutical companies want to find the most popular drugs in social media.

In our case, we need to classify the sentence to a health-related one (health_related = 1) or one which is not health-related (health_related = 0).

Bonus tip: Google Translate offers a list of alternative translations when you click on the first translation.

English to German translation. Google Translate offers a list of alternative translations.

Library for text translation

Manual language translation can take a long time. Therefore, we can use some libraries for text translation.

To see how we can translate the entire or some part of the given dataset, we will use google-trans-new, a free and unlimited python API for google translate. You also can choose some other library or programming language for text translation or even language detection.

  • Note that sometimes you need to normalize or clean text before the translation, which is part of the next section: data preprocessing.
Raw dataset.

We need to do Text encoding and normalization:

from unicodedata import normalize
for i, row in df.iterrows():
text = row[‘text’]
# Text encoding and homographic normalization
text = normalize(‘NFKD’, text).encode(‘ascii’, ‘ignore’)
text = text.decode(‘UTF-8’,’ignore’)
# Put back cleaned text
df.at[i,’text’] = text
Normalized dataset ready for text translation.

Text translation to English:

from google_trans_new import google_translator  
translator = google_translator()
df['text_en'] = df['text'].apply(translator.translate,lang_tgt='en')
The observed dataset with an additional column with an English text translation.

Now, you can have a bigger picture of what your data contains.

Data preprocessing

Universal steps for data preprocessing:

  1. Tokenization,
  2. Cleaning,
  3. Normalization,
  4. Lemmatization,
  5. Stemming.

More about each of these steps, you can find here.

Stop words are commonly used words, such as “the”, “a”, “an” and “in” in English. They usually do not have any importance for ML and DL models. Therefore, we can avoid them. Pay attention to require the stop words library that contains the stop words for the language of your interest.

Besides stop words, you can also research the most frequent words in the text for both health_related=1 and health_related=0 label cases. These words also should not as crucial for the models.

Lemmatization and stemming are also different for different languages. For some languages, lemmatization or/and stemming are not good practice for the text preprocessing part.

So, how do the mentioned steps look like on our data?

  • Note that we already have done the normalization part, so it is not shown below. Since the sentences are short, we will skip the Lemmatization and Stemming part. Also, we will focus only on a German text.

We will use the nltk library since it contains stop words for German and regular expression operations (re) for string matching:

import string
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import re
stop_words = set(stopwords.words("german"))
table = str.maketrans('', '', string.punctuation)
def clean_text(text, for_embedding=False):
# lovercase text
text = text.lower()

# Keep only ASCII + European Chars and whitespace, no digits
RE_ASCII = re.compile(r"[^A-Za-zÀ-ž ]", re.IGNORECASE)
text = re.sub(RE_ASCII, " ", text)
# tokenize words
tokens = word_tokenize(text)

# remove punctuation from each word
tokens = [w.translate(table) for w in tokens]

# remove short tokens
tokens = [word for word in tokens if len(word) > 1]
# remove stop words
tokens = [word for word in tokens if not word in stop_words]

txt = ' '.join(tokens)
return txtdf["text_cleaned"] = df["text"].map(lambda x: clean_text(x, for_embedding=False) if isinstance(x, str) else x)
The observed dataset with an additional column that contains the preprocessed text.

To be surer about data preprocessing in some other language, try to find NLP literature for the language of interest or at least for its language family.

There are also examples and reviews for some libraries containing preprocessing steps in the needed language, so you can also check this out.

For the models, test, validation, and training data must have similar context, topics, source, and language, especially if we have ML and DL models trained on specific datasets — gaming forums, genetics research, etc.

For DL models, also you need to pay attention that word embeddings are prepared for the language of interest.

Conclusion

  • English is the most popular and universal language for NLP — to understand data better, we can translate the given text in some language to English.
  • There are some universal steps for data processing but pay attention to adapt everything to the rules of the language of interest.
  • It is important to know what kind of data you have, the source of data, and your NLP task with it.
  • The ML and DL models, test, validation, and training data must have a similar context, topics, source, and language. For DL models, word embeddings have to be prepared for the language of interest.

Useful tools used in this article:

1. Google Translate and google-trans-new for text translation;

2. Natural Language Toolkit nltk for NLP;

3. Regular expression operations (re) for string matching.

Other useful tools:

1. googletrans for text translation;

NLP tools:

2. TextBlob;

3. CoreNLP;

4. Gensim;

5. spaCy;

6. polyglot;

7. scikit-learn.

Keep on NLP, and do not be afraid to state:

“NLP — No Language is a Problem”! ;-)

Thanks a lot to Edin Hamzic, who reviewed this article.

References

https://arxiv.org/abs/2004.09095

--

--