Analysis of Translations Does Not Mean Support for the Language!

I’ve been wanting to write this post for a while now. Motivated by Meta Brown’s post On Translation and Text Analytics, I’ve decided to write about my take on the same stand.

While most people understand the challenges of a good text-driven engine, be it free-text search or sentiment analysis, most does not understand the complexity in it. It is often misunderstood by the traditional software engineers that the software is the center of attention, that once the software is set, pumping a new set of data, another language is trivial. However, this could not be further away from the truth.

Challenges of Multi-language Algorithms

There is no doubt among Natural Language Processing (NLP) scientists and computational linguists that English is the most studied language in the field. The main driving force behind this phenomenon is the availability of digital and labelled data. Problem arises when we need to extend a proven algorithm to support other language, either for academic or commercial reasons. Various operations baked into the algorithm are simply not applicable to non-English languages. For example, the most European terms change genders depending on the specific context. These grammar rules will need to be factored in to achieve the same quality of result as the original algorithm.

When the statistical solutions were introduced, the need for language specific algorithms seems to be mitigated. By treating terms ad semantically-free event occurrences, the algorithm can be data driven and remain language agnostic. In reality, however, the multi-language problem did not go away at all, but simply pushed out of the algorithmic layer and into the data domain.

Data, Data, Data

At the heart of all statistics-driven engines, which includes most big data and NLP system today, is data. If the data is of poor quality, then the result cannot be trusted. It is the same reason why, as data scientists, we spend more time observing and cleaning the data than implementing the actual analysis engine.

Many engineers tend to forget that fact when they were tasked with supporting new languages for their text-driven systems. “We already perfected English. How hard can it be for French/German/Japanese/etc?” Perhaps it’s because the software industry has been predominately English that it’s easy to forget the basic lesson. Most marketers have already learned this lesson many times before: You cannot take automated translation as gospels; in fact, often you can’t trust manual translation either.

The way which most automated translation systems work is by observing co-occurrences of terms in the pre and post-translated texts. Given a large enough corpus (body of documents or articles), statistically accurate translation will emerge. However, these translations are still not deterministic, meaning there will always exist a margin of error greater than 0.

Therefore, when translated text is used as training data for an NLP algorithm, the translation’s margin of error is added into the engine. Even more of a taboo is to use translated text as the query input for an English-trained system. Because the English engine did not account for the probability of mistranslated terms, it trusts each inputted term as the truth. Therefore, each mistranslated term within the query multiplies the eventual error margin of the result.

So What’s the Solution?

As most commercial and domain-specific machine learning engines, text analysis systems should also evolve toward a hybrid technology. Inspired by Bayesian statistics, it would leverage language-specific knowledge as a basis and train itself using data in the target language. Sure, it is a time consuming task to meet both criteria. Ask your boss if your brand reputation is worth the extra time and money protecting.


Originally published at www.whosbacon.com.