An Exploration of Short Text Language Identification

Michael Chen
Attenchen to Detail
4 min readFeb 16, 2021

Recently I took an interest in identifying the language of a snippet of text. Let’s say I get a text message. It says “Bienvenue!” If I have seen this word, I’d already have a mental map of the word to its language and meaning. The problem is that people typically understand one language. Maybe two or three languages if you’re gifted. If I were to build a computer to do this, it would be a huge “mental” map of all possible words to their respective meanings. While this is plausible, I decided to employ machine learning to the task.

Job to be Done

Let’s start with what the task is. To narrow down the scope of this to a long weekend exploration, I decided to focus on just six Latin based languages: English, French, Dutch, Italian, Portuguese, and Spanish. All text are in Unicode. For value proposition’s sake, let’s say I work for Apple and my task is to quickly identify what language a user is typing in and automatically switch to that keyboard. This seems like a multinomial classification task. The input is a sentence and the output is the language used.

Representation Matters

Next, let’s think about what the representation of a sentence should be in order to best capture its language. The bag of words approach to representing text is always the first thing that comes to mind for NLP tasks. It’s simple, intuitive, and often works really well out of the box. I could use all possible words for all the languages involved and then convert each sentence into a large count dictionary. That feels too much. Instead, using just the top 200 to 300 words from each language seems to be a more lean approach. I’d like to think of this as one data hyperparameter. Let’s not remove stop words as that could introduce unwanted bias. After all, stop words in one language may not appear in another language and could be a linguistic trait. This could be considered another data hyperparameter. Lastly, I can try limiting the length of the sentences to a maximum of 200 characters and a minimum of 50 characters. This is another design choice made to speed up processing and standardize the data a bit, so it’s up to the ML engineer what they want to pick.

A twist to the bag of words approach would be to use a bag of characters approach instead. Same concept, different implementation. Not only do words carry information on what language a sentence is in, so do the characters. I can thus convert each sentence into a bag of n-grams (n would be yet another data hyperparameter!).

Lastly, another representation technique is quantizing each character into a one hot vector where the length of the vector is the number of characters considered. The sentence is represented by a sequence of quantized characters (vectors).

Modeling the Relationship

Now, let’s think about how to model the relationship between a sentence (as a bag of words) and the language it’s in. I can do a basic probabilistic approach like Naive Bayes to classify sentences. How many times have I seen this word used in a Spanish sentence compared to all of the sentences? I could also try to divide up instances using methods like logistic regression or KNN. How similar is this unknown bag of words to that bag of words that I know is Spanish?

Separately I can also take the neural network approach. I say separately because explainability is always a challenge with neural networks. They’re not as intuitively explainable as the traditional methods mentioned above. Anyway, I can plug the representation of a sentence into a relatively shallow neural net of say 3 to 4 layers and it will do just fine for this use case. I can even add a dropout layer to spice things up.

I can even take a step further and use recurrent neural networks to model not just the character distributions but also the sequential nature of the characters for each sentence. To combat the problem of vanishing gradients, I employ LSTM neurons.

Evaluating the Relationship

Just like in dating, sometimes you want to see if the relationship is the right fit. By the way, happy Valentine’s day & Lunar New Year & President’s Day! What a weekend.

So how do I want to evaluate my model? Were my design choices about the data correctly made? Since this is a multinomial classification problem, I will take a look at the confusion matrix for the test predictions.

Looks like this model needs some more tweaking!
Looks like this model did a good job!

For an in-depth look at metrics for multi-class classification, this can serve as a good guide.

Thanks for reading! If you’re curious, the full code repository for my exploration can be found here.

Fun fact: this kind of classification can also be applied to identifying programming languages.

Have fun hacking and learning~

--

--

Michael Chen
Attenchen to Detail

ML@ROBLOX — Trying to make some sense in a hectic world