Deep Learning Challenges for Code-Mixed Conversational Text

Devanshi Bhatt
Spectrum Labs
Published in
6 min readMay 5, 2020

The growing use of the Internet and online social platforms has enabled a massive exchange of thoughts and experiences virtually. According to the Internet World Stats, 58.8% of the world’s total population use the internet in various regions and languages. However, the many boons of the internet have come with a price. While the internet has become a massive churning pot of ideas, it has also become a medium for spreading toxicity. 72% of Americans are active on social media and 53% have been personally subjected to harassing behavior online. In fact, per Unicef, more than a third of young people in 30 countries have reported being a victim of online bullying. It thus becomes important to analyze the content on such conversational platforms in order to recognize and respond to the toxic behaviors that arise on them. And it’s important to do this across all different languages. Further, since users from around the world are on these platforms, our analysis must work across many languages.

In order to identify toxicity in text data, Text Classification can be used which combines Natural Language Processing (NLP) with machine learning or deep learning approaches. The typical workflow of a text classification task would look like this:

Text Classification workflow

Traditional machine learning algorithms are based on methods like count vectorization or TFIDF vectorization which take into account the frequency of words in a given corpus. These features can provide some signal around common words and phrases, however, they are unable to capture the semantics which come from the context of those words. Recent work has focused on pre-trained word embeddings like word2vec, GloVe, and ELMo which can statistically quantify the relative semantics between words by embedding them into high-dimensional vectors. Additionally, deep learning approaches provide tools to leverage these high-dimensional vectors for solving our text classification problems.

The word vectorization approaches work well when the language of text is same as the language of conversation. However, when people communicate online by text, they may make use of not only different languages but also different types of characters. For example, a person who knows both, Hindi and English, depending on the platform they’re using, could chat in one of three ways:

1. English language using English characters

2. Hindi language using Hindi characters

3. Hindi language using English characters

In our experience working with text from a variety of platforms and verticals, we come across the first two categories of text data on a daily basis. However, the third category of text is not very common in the public NLP domain, which makes it particularly interesting, but challenging to analyze. In the above example, this category consists of Hindi language spelled phonetically using English characters. You could use a pre-trained word embedding to vectorize text data which belongs to the first two categories, but how would you vectorize text from the third category when there isn’t sufficient data to pre-train an embedding? Let’s understand the problem better by considering three scenarios of the same message:

English message in English script: I love you

  • Word tokenization: I, love, you
  • Vectorization: Each token is assigned a vector based on the n-dimensional pre-trained English word embedding.

Hindi message in Hindi script: मै तुमसे प्यार करती हूँ

  • Word tokenization: मै, तुमसे, प्यार, करती, हूँ
  • Vectorization: Each token is assigned a vector based on the n-dimensional pre-trained Hindi word embedding.

Hindi message in English script: mei tumse pyar karti hoon

  • Word tokenization: mei, tumse, pyar, karti, hoon
  • Vectorization: How do you vectorize this?

Here is where the challenge of vectorization for our target text begins.

  1. Open Source Word Embedding — There are publicly available pre-trained word embeddings for 100+ languages provided by Facebook’s fastText library and pre-trained multilingual models provided by Google’s BERT, that can be used to train models on code-mixed text. However, there exists no such pre-trained embedding/model that can be used on our target text.
  2. Creating a Word Embedding — In order to be able to analyze such text it becomes important to use a word embedding and if there isn’t one available, then how about we create one? Again, pre-training word embeddings requires a large number of samples to be effective. Alternatively, a word embedding could be built via a single word translations, but this would be a lengthy procedure of mapping each word in Hindi (in this case) to its corresponding translations in English which could then be used to vectorize the text.
  3. One-to-many Mapping — In order to create a word embedding, an important issue to consider is the one-to-many mapping of words. In the above example, every word in the English-characters-Hindi sentence can be written in more than one way, for example, mei == (mai, mein, main) OR tumse == (tumko, tujse, tujhse, teko). This means that when mapped from English to English-characters-Hindi (OR Hindi to English-characters-Hindi) all possible spellings of each word would have to be included.
  4. Availability of Data — The biggest challenge, however, is the requirement of data that can help in building such a word embedding with a one-to-many mapping. If there is enough data available, then problems 1 and 2 can be solved. However, the kind of language used in the above example is only prevalent in chat/conversational settings, where the language of communication is informal. Such conversational data is not available in open sources like Wikipedia, newspapers, or blogs, where this style of writing is not used.

The challenges stated above are not limited to a Hindi conversation in English characters. The same issue can (and does!) arise for conversational data between users that communicate in languages that don’t use the Western Latin alphabet (eg. Hindi or Arabic) or use a logographic language (eg. Japanese, Chinese, Korean).

Open Question — What is the best approach to analyze code-mixed conversational data?

One potential solution, as mentioned above, would be to create a word embedding which can vectorize the tokens in text data. However, as there are very limited sources to get the data required for this task, it would need a team of linguists to come together and create such an embedding by exhausting the vocabularies of the target languages.

Another solution to deal with this problem is to use the googletrans library which does a decent job of detecting and converting text from a detected source language to a specified target language. This library can be used at the preprocessing stage before vectorizing the input text tokens. However, language detection introduces its own errors as does the translation process. For a text classification task, particularly with conversational text, the errors in the translated text can have a significant impact on accuracy since the semantics of the sentence might be lost in the process of translation. Here are a few translations for the example above using googletrans:

All the sentences here mean the same thing, however, the style of writing affects the translation which, in turn, would affect the performance of a model trained on this data.

This means vectorizing by character rather than by word or token. While this can help reduce the amount of data needed when compared to pre-trained word embeddings, care must still be taken regarding the complexities from the one-to-many problem; the more variation there is in spelling a word the more data the model needs to learn that word.

At Spectrum Labs, this is the kind of problem we tackle as we help platforms address the rising toxicity they see. We also hope to see additional work here such as Deep Learning Technique for Sentiment Analysis of Hindi-English Code-Mixed Text using Late Fusion of Character and Word Features by Siddhartha Mukherjee and Curriculum Learning Strategies for Hindi-English Codemixed Sentiment Analysis by Anirudh Dahiya et al. We are encouraged by these works as we seek to make the internet a safer place for all its users, no matter their language (or languages) of choice. These are challenges not only for us, but for any NLP solutions that seek to help a global audience. Please comment below if you’ve seen or conducted related work and, in future posts, we’ll share what we learn and the progress we make.

--

--