NLP Language Detection (Identification): Word Level or Sentence Level?
Numerous open source language detection(identification) libraries are available and working pretty well; among of these include polyglot, SpaCy, and nltk. However, if you experiment with these libraries, you might discover that while they perform better than averagely, they are not as accurate or as intelligent as human detection, particularly when it comes to text data that contains text in multiple language (code-mixing), such as Malaysian Rojak Language.
There are several approaches can be done to assure a high accuracy of language detection when it comes to text data, such as: -
- Using multiple language detection to detect the same text data. Looping over and over again, store each detected language in a list, get the mode of language detected.
- Create a language detection system that focuses only on the languages that are needed, but that needs a lot of training data (corpus data) that must encompass 90% of the text written in a given language. Indeed, it takes a lot of time and computer power.
Speaking of that, here comes a question. For each approach (library used or the customized language detection system), should it detect the language of each word or the language for a sentence as a whole. A direct answer for this question would be “depends on your usage and application”.
Word Level Language Detection
When using word-level language recognition, applications like dictionary searching (to determine which language a word is in using the appropriate data) or language translation should be present (for a word). But there is a problem lies within where a single word might be exist in more than one language. How are we (the machine) suppose to know which language of a word should be? It should be knowing the sentences (words before or after the word) to know the overall language, right?
Sentence Level Language Detection
Sentence Level Language Detection seeks to identify a sentence’s overall language. Due to the presence of multiple languages in a sentence, language identification at the sentence level is difficult (Code Switching & Code Mixing). A statement may contain more than one language, yet machines are completely ignorant of the possibility of other languages being used.
Current Opinion
Currently, I will prefer a language detection system that able to detect both word and sentence level. For word level, it should list out available language belongs to the text. While sentence level, it should list out detected languages existed in a sentence, later identify which language exist the most in the sentence or identify the position where a new language is in.
This topic is the subject of ongoing research. This article will be updated and amended as soon as possible.
Thanks for reading.