Multi-Language Documents Identification

Dima Shulga
HiredScore Engineering
3 min readFeb 13, 2020

When dealing with global natural language data, the first step of the pipeline is usually language identification. We need to know the language of a document for many reasons, starting with understanding whether we support the language at all, gathering statistics on how many documents we have for each language, and moving to do algorithmic decisions like what type of model should we use, should we translate the document and so on.

Many times, the documents we deal with may contain more than one language, for example, a resume of a person may be written in Hebrew, but some position titles or other technical terms are written in English. A tool that can understand this is desirable.

Here we introduce our method to deal with these types of documents. We call this model seqtolang as the goal of the model is given a sequence of words, it assigns a language. Given a document with more than one language, not only we want to be able to tell what languages are there, but also, where are they.

Implementation Details

The model is a sequence-to-sequence model using recurrent layers (LSTM) on top of word vectors. Inspired by this paper, we build our word vectors by summing the embedding of its ngrams:

Where wi is the word vector of the ith word in a document and the sum goes over all the ngrams of the word. We add special characters < and > at the beginning and the end of each word to provide information about word start and word end. For example, with n=3, the word english is broken to:

<en, eng, engl, gli, ish, sh>

The word vectors are then passed to a bi-direction LSTM layer to extract contextual information for each word. The output of each LSTM cell is passed to a fully connected layer with the softmax activation function for the final classification of each word.

To get the language probabilities aggregation, we compute the mean softmax scores across all words in the document.

The model was trained on 36 languages from the Tatoeba dataset and evaluated on the WiLi-2018 dataset with an accuracy score of 98% (for the supported languages).

Both the code and the model are available on github.

Interested in this type of work? We’re always looking for talented people to join our team!

--

--