Why is the linguistic context important while working on textual data?

15 min readSep 3, 2023

Why is the linguistic context important while working on textual data?

Not long ago I shared with you the importance of data cleaning in delivering accurate, meaningful insights. One of the prerequisites that I highlighted back then was getting a real understanding of your data. Not only reviewing the ‘meat’ but also arming yourself with relevant domain knowledge (whether it is yours or an expert that supports you) will come in handy.

Today let’s zoom in on one of the disciplines that is very close to my heart and is strongly linked with data analytics — linguistics. It focuses on analyzing languages from multiple angles, starting from semantics and phonology, through syntax, and ending on socio- and psycholinguistics. Languages and their form are impacted by a variety of factors, from social group adoptions to the conventionalization of processes (changes in meaning, speech assimilation, etc.), hence if we are to use text data as a source of our analysis, addressing all the linguistic nuances might not be smooth sailing. Text analysis usually requires going beyond a purely binary analysis and enhancing your steps with higher and more advanced cognitive engagement.

To give you a grasp of what to watch out for, I will share some of the cultural and linguistic nuances that I find particularly important to understand the context and accurately evaluate sentiment and in-depth meaning. In fact, you do not need to build the most advanced mBERT model to find this information useful. Whenever you are working on the text data, having a grasp of cultural nuances will help you analyze the data more accurately and efficiently. Plus I invite you to zoom into languages that you already know and think about them in the context that I will share in the moment. You might discover something new.

1. Language register

Language registers refer to the level of formality used in a language. Different contexts and situations call for different language registers. In other words, the language that you use in professional language is not necessarily the language you use among your friends.

Here is a brief explanation of the five main language registers with examples:

Static (or frozen): Language that remains unchanging, often found in historical documents or religious texts. For example, the language in the Bible or a national anthem is static because it is passed down unchanged through generations, with limited to no influence of any factors.
Formal (or regulated): Language that is used in professional or official contexts. It is staying away from slang, colloquialisms, and informal expressions. For example, academic papers, official documents, and business correspondence often use formal language.
Consultative (or professional): A standard form of communication. People speak in a slightly formal manner, but it’s not as formal as official communication. Examples include teacher-student discussions, doctor-patient conversations, and meetings.
Casual (or group): Informal language used among friends and people of the same age group or social standing. It includes colloquialisms, slang, and regional expressions. Examples include text messages between friends, social media posts, and casual chit-chats.
Intimate (or personal): Language used with people with whom one has a close relationship, such as family members or close friends. It may include private jokes, nicknames, and non-verbal cues that are understood only by the people involved.

So far so good. The complexity here lies in the fact that each language has its own morphological, phonological, and syntactical rules that influence the way these registers are used. For example, in English, formal language often involves using more complex sentence structures and avoiding contractions (e.g., using “cannot” instead of “can’t”). In contrast, in Japanese, formal language strongly relies on using different verb forms and honorifics.

To handle language registers properly in text analysis, we need to consider the specific linguistic characteristics of the language being analyzed. Stopwords, which are common words that are often removed in text analysis (e.g., “the,” “a,” “and” in English), may vary in importance depending on the register. For example, in a formal register, certain words that are considered stopwords in a casual register might be necessary to maintain the formality of the text. Therefore, the key is to be mindful of the stopwords being removed and to consider the context in which the text is written.

What’s more, vocabulary normalization, which involves converting different forms of a word into a standard form (e.g., converting “am,” “are,” “is” to “be” in English), may also need to be handled differently for different registers. For example, picture a situation with a casual register. Here it might be appropriate to normalize “gonna” to “going to”, however, in a formal register, “gonna” should not be used at all. Therefore, to handle different registers appropriately, we might need to create a mapping for vocabulary normalization.

2. Cultural context

References to historical events, folklore, literature, or pop culture can carry specific meanings in a particular culture but may not be understood or may have different associations in another culture. We do not need to reach the language level to see how awareness of the context impacts perception. Have you ever tried to show your grandparents memes that make you crack up? Or have you (if you are a Star Wars fan) ever used ‘May the Force be with you’ in front of somebody who has never watched Star Wars (like me)?

To give you some other examples, the phrase “Beware the Ides of March” is a reference to Shakespeare’s play “Julius Caesar” and is often used in English to express a warning. However, this reference may be lost on someone who is not familiar with Shakespeare or the historical events his plays are based on. Similarly, the phrase “Catch-22”, derived from the title of Joseph Heller’s novel, is used in English to describe a no-win situation or a dilemma that is impossible to resolve. Yet, this phrase may not be understood in the same way by someone unfamiliar with the book or the concept it represents.

Another example can be found in the use of mythological references. In Western literature, references to the Greek myth of Icarus, who flew too close to the sun and fell to his death, are often used to symbolize the dangers of overambition. However, a similar reference in another culture might involve a completely different set of characters and moral lessons. For instance, in Chinese culture, the myth of Icarus might be replaced by the story of Houyi the Archer, who saved the world by shooting down nine of ten suns.

Dealing with texts rich in cultural contexts may involve creating custom translation dictionaries that include common cultural references and their equivalents in the target culture, or developing models that are trained on culturally-specific text data. Additionally, it may be necessary to provide additional context or explanation for certain references, to ensure that the meaning is not lost in translation (for example by including a footnote explaining the original reference).

3. Morphological complexity

© Sadeniemi, Markus & Kettunen, Kimmo & Lindh-Knuutila, Tiina & Honkela, Timo. (2008). Complexity of European Union Languages: A comparative approach. Journal of Quantitative Linguistics. 15. 185–211. 10.1080/09296170801961843.

Morphology is the study of the structure and formation of words, and languages vary significantly in their morphological complexity. For instance, some languages, like English or Italian, have relatively simple morphological rules, while others, like Turkish or Finnish, have highly complex rules for word formation (which is why they are usually considered the most difficult languages to learn). Hence breaking down expressions into meaningful chunks (morphemes) might bring its’ own challenges.

For a more formalized approach, we differentiate:

agglutinative languages

In agglutinative languages, such as Turkish, words are formed by stringing together morphemes, each with a single grammatical or semantic meaning. For example, in Turkish, the word “evlerimden” can be broken down into the morphemes “ev” (house), “-ler” (plural suffix), “-im” (my), and “-den” (from), which mean “from my houses”. If we are using a tokenizer, which is not trained to handle this level of morphological complexity, it might incorrectly break down the word into subparts that do not carry any meaning.

- inflectional languages

In inflectional languages like Russian, words change their form to express different grammatical characteristics such as case, gender, etc. For instance, in Russian, the word for book, “книга” (kniga), changes to “книги” (knigi) in the genitive case, and “книгу” (knigu) in the accusative case. Lemmatization (reduction of a word to its base or root form) is especially challenging in inflectional languages as the same word can appear in many different forms.

- analytics languages

Analytic languages like English and Chinese use word order and auxiliary words to express grammatical relationships. For instance, in English, the future tense is formed by adding the auxiliary word “will” before the verb, as in “will do”. While it might seem more straightforward from a morphological perspective compared to agglutinative and inflectional languages, it introduces its own challenges for tasks such as part-of-speech tagging.

-polysynthetic languages

Polysynthetic languages like Inuktitut tend to take morphological complexity to an extreme level, combining multiple morphemes into a single word to express what in other languages would be a complete sentence (it would love to learn how to pronounce these words!). For example, in Inuktitut, the word “tusaatsiarunnanngittualuujunga” means “I cannot hear very well”. Developing NLP tools for polysynthetic languages is particularly challenging due to the scarcity of annotated data and a very high level of morphological complexity.

This morphological variation impacts the pre-processing tasks such as tokenization, lemmatization, and stemming, which are crucial for the subsequent stages of NLP analysis. For example, a tokenizer that works well for English might not be suitable for a language with agglutinative morphology, where words are often formed by combining multiple morphemes.

4. High context vs. low context cultures

Due to the broad-brush cultural differences between societies, we differentiate low-context and high-context cultures. Now, an important note. As long as some cultures tend to be pretty extreme, rather than absolute positioning, we are talking about the contextual spectrum on a scale in comparison to other countries. If you are curious about how the cultures relate to one another, check out Erin Meyer’s book ‘The Cultural Map’.

High-context cultures are those in which communication relies on implicit, indirect cues. A lot of the meaning in communication is derived from the context, the speaker’s tone of voice, facial expressions, body language, or even the person’s status or role. These cultures often value group cohesion, interpersonal relationships, and tradition. Examples of high-context cultures include Japanese, Arabic, and Korean cultures.

Low-context cultures, on the other hand, rely on explicit, direct communication. Words are used to express thoughts, ideas, and emotions as directly as possible, and there is less reliance on non-verbal cues to infer meaning. These cultures often value individualism, directness, and clarity. Examples of low-context cultures include German, American, and Scandinavian cultures.

Understanding these differences is crucial for effective intercultural communication and is also a significant consideration in the field of NLP and AI. For instance, developing chatbots or translation services that can understand and appropriately respond to communication from different cultural contexts is a significant challenge.

5. Idioms and Phrases

Many languages have idioms or phrases that carry a specific meaning in their culture but may not be directly translatable or understood in another culture. Many languages have idioms or phrases that carry a specific meaning in their culture but may not be directly translatable or understood in another culture. For example, the English idiom “cutting off your nose to spite your face” is a way of describing a situation where someone acts out of anger or spite to harm someone else but ends up harming themselves instead. A direct translation of the idiom into German is “Schneide deine Nase ab, um dein Gesicht zu ärgern” (to cut into one’s own flesh), while actually, a German would say “sich ins eigene Fleisch schneiden” instead.

Another example of an idiom that doesn’t translate directly is the English idiom “a piece of cake,” which means something is very easy. A direct translation into Spanish, “un pedazo de pastel,” does not carry the same meaning for Spanish speakers. Instead, the equivalent idiom in Spanish is “pan comido,” which literally translates to “eaten bread.”

6. Contextual Meanings

Words can have different meanings based on the cultural context in which they are used. For example, the word “freedom” in the United States is often associated with positive concepts such as liberty, independence, and the right to express oneself, given the country’s history. However, in a country like South Africa, which has a history of apartheid, “freedom” might be associated more with the struggle to put an end to racial segregation and oppression, and the ongoing efforts to address the legacies of that period. Similarly, the word “democracy” might be viewed positively in countries like Sweden, with a long history of stable democratic governance, but in countries like Iraq (where ‘democracy’ has been associated with political instability), it may carry a more negative or skeptical connotation.

7. Named Entities

Proper nouns, such as names of people, organizations, or places often carry cultural significance. For example, the name “Washington” might be associated with the United States’ capital or a past president in an American context, but it might not carry the same associations in other cultures.

Similarly, the name “Nelson Mandela” will evoke a strong association with the anti-apartheid movement and the presidency in South Africa, but in other cultures, this association may be weaker, or Mandela might primarily be associated with other aspects of his life or work. These examples highlight the importance of considering the cultural context when analyzing named entities in text data. Named Entity Recognition (NER) models need to be trained on diverse datasets that include examples from various cultures to accurately identify and process named entities in a global context. Moreover, the associations and sentiments attached to named entities might need to be analyzed and interpreted differently based on the cultural context of the text or the audience. Therefore, it is crucial to incorporate cultural knowledge into the NER process and the analysis of the identified entities.

8. Sentiment Analysis

The sentiment associated with certain words or phrases can vary dramatically across cultures. For instance, the color white is usually associated with purity and peace in Western cultures. Brides traditionally wear white at weddings, and it is the color worn by doctors and nurses, symbolizing cleanliness and safety. In some Eastern cultures though, white is the color associated with mourning and death. For example, in Chinese culture, white is the color of mourning and is traditionally worn at funerals. Similarly, in Indian culture, widows usually wear white saris as a symbol of mourning.

Similarly, the number 13 is considered unlucky in many Western cultures, to the point where some buildings either skip the 13th floor or label it as 12B or 14A. However, in some Asian cultures, the number 13 is considered lucky. On the flip side, the number 4 is considered very unlucky in Chinese culture, as it sounds similar to the word for ‘death’, in Western cultures it does not carry the same negative connotation.

These differences in sentiment and association can have significant implications for NLP tasks such as sentiment analysis, where the goal is to determine the emotional tone or attitude expressed in the text. Therefore, it is important to take into account cultural differences in sentiment and association when developing and deploying NLP models and to incorporate this knowledge into the pre-processing and analysis steps. This may involve creating custom sentiment dictionaries for each target culture or using culturally specific sentiment analysis models.

9. Honorifics

Honorifics, and expressions of respect, introduce significant complexities in NLP data pre-processing, turning it into a challenge. For instance, in Japanese, the use of “-san” after a person’s name is a common honorific that shows respect, similar to Mr. or Mrs. in English. In Spanish, “Don” or “Doña” is used before a first name to show respect, usually for much older or respected individuals for the first contact (otherwise it is kind of corny). In Filipino, adding “Po” in a sentence is a way to show respect, especially to elders.

The context and cultural norms associated with honorifics vary widely between languages, which calls for highly specialized algorithms to accurately identify and interpret. For example, in Japanese, it is common to use the person’s last name followed by “-san” unless you are very familiar with them, while in Sweden, it is common to use first names even in professional settings. Similarly, in Filipino culture, it is considered polite to use “Po” even when speaking to strangers, while in other cultures, this level of formality might not be necessary.

These differences in the usage and significance of honorifics across cultures can affect various NLP tasks, such as machine translation, named entity recognition, and sentiment analysis. For example, a machine translation model might incorrectly translate the Japanese “Tanaka-san” into “Mr. Tanaka” in English, even though this level of formality might not be appropriate in the given context.

How can these nuances be handled in a text analysis project?

1. Customized Pre-processing: Customize the pre-processing steps for each language you are working with. For instance, tokenization, stemming, and lemmatization should be language-specific. Python’s Natural Language Toolkit (NLTK) and the Spacy library offer language-specific tokenizers and lemmatizers. You may use Spacy’s French language model for tokenizing French text as it will be more aware of the nuances of the French language compared to a generic tokenizer. For Turkish I found a morphological analyzer Zemberek (it is a Java library, but accessible in Python for example through jpype), which will list morphemes of a given word.

import jpype
import os

jvm_path = jpype.getDefaultJVMPath()
class_path = "-Djava.class.path=zemberek-full.jar"
jpype.startJVM(jvm_path, class_path, "-ea")
TurkishMorphology = jpype.JClass('zemberek.morphology.TurkishMorphology')
morphology = TurkishMorphology.createWithDefaults()

word = "okula"
results = morphology.analyze(word)

for result in results:
    print(result)

jpype.shutdownJVM()

2. Cultural and Linguistic Knowledge: Incorporate cultural and linguistic knowledge into your NLP models. For instance, if you know that a particular language uses honorifics, you can create a custom list of these and ensure they are handled correctly during pre-processing. For example, in Python, you can create a list of common honorifics in a language and then check if any words in your text match these before processing. In the example below, if the token is in the list of honorifics, it adds _honorific to the token.

def preprocess_honorifics(text, honorifics):
    tokens = text.split()

processed_tokens = []

for token in tokens:
        if token in honorifics:
            processed_token = token + '_honorific'
        else:
            processed_token = token.lower()
        processed_tokens.append(processed_token)
    processed_text = ' '.join(processed_tokens)
    return processed_text

text = "Dr. Smith and Mrs. Jones are attending a meeting with Mr. Kim."
honorifics = ["Dr.", "Mrs.", "Mr."]
processed_text = preprocess_honorifics(text, honorifics)
print(processed_text)

3. Language-specific Named Entity Recognition (NER): Using language-specific Named Entity Recognition models will help you identify and process named entities accurately. For instance, the Spacy library provides pre-trained NER models for +72 languages. Aside from installing Spacy, for English, you will also need to install en_core_web_sm .

import spacy
nlp = spacy.load('en_core_web_sm')

text = "Barack Obama was born in Hawaii."
doc = nlp(text)

entities = [(ent.text, ent.label_) for ent in doc.ents]

for entity in entities:
    print(f"{entity[0]} ({entity[1]})")

4. Context-aware Sentiment Analysis: Use context-aware sentiment analysis models that take into account the cultural and linguistic nuances of the language. You can train a sentiment analysis model on a culturally specific dataset or fine-tune a pre-trained model using a dataset that includes cultural nuances.

5. Use of Translation Services: When necessary, employ translation services to convert text into a language that your NLP model can process more accurately. However, be mindful that translation can sometimes alter the meaning of the text.

6. Regular Expression: Use regular expressions in Python to accurately identify and handle cultural-specific text patterns, like date and number formats in the example below.

import re

def convert_us_to_uk_date(date_str):
    us_date_pattern = r'(\d{2})/(\d{2})/(\d{4})'
    return re.sub(us_date_pattern, r'\2/\1/\3', date_str

us_date_str = '05/30/2023'
uk_date_str = convert_us_to_uk_date(us_date_str)

print('US Date:', us_date_str)
print('UK Date:', uk_date_str)

You can use a similar approach to handle other culturally specific text patterns, for example, number formats, currency formats, and address formats.

— — — —

The world of languages is filled with hidden nuances that may not be immediately obvious but can greatly impact your analysis (one way or another). Language is not just a communication tool, but a lens through which we interpret the world, hence a nuanced understanding of linguistical and cultural contexts can lead to more insightful analyses and better-informed decisions.

I hope you find some of these tips useful. Stay tuned!

— — — —

If you feel like discussing this further, drop me a line at:

So much fun to talk to other data geeks!

Written by Iza Stań