Improving language detection for chat (discord).

Jourdelune
4 min readDec 31, 2022

--

Detecting the language of messages on online chat platforms is often a challenge, such as on discord where users send very short messages, sometimes with very few words (e.g. “aahh yes!”).

The language of these sentences is very difficult to identify and often this leads to bad results. This is a particular problem for me as I have a translation discord bot and it leads to unnecessary translations which spam the chat. So I started working on improving language detection on Discord messages using several different approaches.

At first, I simply used two different fastText models, if the two models detected the same language, then this detection was considered reliable. This posed two problems, the first is that for short texts, the two detectors rarely had the same language, and in the second case, the detection was erroneous even if the detectors had chosen the same language.

So I chose another approach. I tested many open source detectors and unfortunately none was reliable enough to decide on the language alone, only, if taken in groups, I noticed that often the right language was detected in their results, but rarely as a first language. Here is an example to illustrate this:

As we can see, in this fictitious example the majority language is indeed English, but if we stop at the two languages ​​with the greatest probability, it would be either Spanish or Danish.

So I used the article A Language Detection System for Short Chats in Mobile Games to look at the different approaches I could use, and here’s how I decided to organize my detector.

I will use a generic fastText detector (lid.176.ftz) as well as Lingua to detect the text, I will add all the probabilities of each language and keep the highest. I tested this approach and I had about 80% correct identification rate on my discord message dataset, not bad… but not enough.

By reading the messages that escaped the detectors, I realized that some, too short, or made of expression like “xDDDD” were always badly detected. So I added two other systems to my detector.

As suggested in the paper, a system using a dictionary to identify languages ​​is really very efficient for very short texts, of a single word for example. So I retrieved the most used words in my datasets and made a dictionary for each language, I also established a priority between each language, according to the amount of people speaking it on discord. If two texts had the same probability for two different languages ​​(like “I” which can be considered as an English and French expression), then the English language would be chosen. Finally I trained a fastText model on discord messages.

I also added some pre-processing step, to remove punctuation, special characters, emojis and remove too many repeating letters (to turn “ahhhhhhhh” into “ah”) to get better results.

By combining these four approaches — Lingua, the FastText model lid.176, a FastText model trained on discord messages and dictionary detection, then adding the probability of each language and taking the highest probability, I was able to create a language detection system for Discord messages that is quite accurate, with 90% good detection and 97% if I don’t count the messages that the detection reliability is too low.

I made it into a usable python package of which here is the github repo: https://github.com/Interaction-Bot/LanguageDetection.git.

You can also test it with pip install ShortLanguageDetectionand running this simple code.

import ShortLanguageDetection

detection = ShortLanguageDetection.Detector() # reliable_min=0.5 in arguments for less wrong detection.
print(detection.detect('hi'))
# ('en', True)

There are, however, some disadvantages to this system. The first is that for very short texts, they will tend to detect languages ​​only used on discord, which is the purpose of the task but can cause problems if the language is not one of the one used on discord. This can be changed by increasing the number of languages ​​in the dictionary as well as in the fastText detector trained on discord messages. The second possible improvement would be to normalize the probabilities, for example the language aa (Afar) is only detected by the fastText lid.176 model, so its probability should not be divided by the number of detectors, but only by those who are able to detect it.

Don’t hesitate to contribute on github and I hope it was useful to you.

--

--