How to detect non-English language words and remove them from your keyword insights

Learn how to improve business insights by addressing one of the most common problems faced in digital advertising.

Sharan Biradar
MiQ Tech and Analytics
6 min readMay 23, 2019

--

Great insights lead to business value

These days, organizations have so many opportunities to gather insights for their business, whether it be from research, surveys, customer feedback, sales, advertising data, or beyond. It’s up to the person who creates these insights to make sure that they follow the two ‘I’s: easy to interpret and intuitive to understand.

Interpretability is largely dependent on communication skills possessed by the insights creator, while intuition comes from business experience.. As a novice in the analytics field almost four years ago, I used to think that I was an expert at creating insights. It was only after I had witnessed the head of my team presenting his insights that I understood that there was more to learn. What made his insights not just better, but more unique? They were simple, easier to understand and structured in a way that helped the reader make better decisions. More importantly, they were easier to interpret and intuitive to understand.

One common insight used to improve business value in digital advertising is based on keywords, and how they can be targeted online to reach more potential customers. While business is usually done in the English language in countries like US, UK, AU, CA and so on, people of these countries still browse websites in their native language, which leads to non-English language words appearing in keyword-based insights. From a targeting point of view, this is fine, but from an interpretation and understanding point of view, the presence of non-English words can create confusion and distrust in the insights for businesses who work in predominantly English-speaking, writing, and searching countries. That’s why it’s important to detect non-English words and either remove or even better, translate them.

In this article, we are going to learn how to automatically detect non-English words using Python and come up with an algorithm to remove them.

Deriving better insight from keywords

Here’s how we do it. The first thing we do is explore the different scripts of each language. Every language falls into one of two categories: Latin alphabets, like English, French, German, etc. and non-Latin alphabets like Chinese, Hindi, and Japanese.

  • English — “This is an example sentence”
  • Chinese — “这是一个例句”
  • Hindi — “यह एक उदाहरण वाक्य है”
  • Japanese — “これは例文です”
  • French — “Ceci est un exemple de phrase”
  • German — “Dies ist ein Beispielsatz”

The first step in tackling the problem is to figure out how to detect non-Latin languages and Latin languages. We can use a simple regex solution to filter out non-Latin alphabets.

Phase two is more difficult because we need to differentiate words that are present in English from words that are in, for instance, Italian. One way we can do that is to use dictionaries. If a word is present in an English dictionary, then it is an English word. However, when you have languages that have complex word structures or morphological variants of a word, then the dictionary alone wouldn’t be enough. The dictionary needs to have 100% of all the possible words and variants of a language.

What we need is something that takes care of complex and compound words automatically and has meanings for base words. Spell checkers are present almost everywhere — Word, Libre office, Browsers. We can use these spell checkers to detect non-English language words. In this article we will be using one such spell checker — Hunspell, because it is an open source platform and is used widely by other softwares like Libre Office, Openoffice, Mozilla Firefox, Chrome, and macOS.

So, let’s begin.

Let’s install the prerequisites and packages first. I am doing this in AWS CentOS machine. If you are trying in Ubuntu then please use apt-get. If you are trying locally in Linux machine, then there is no need to run the symbolic link code.

Shell script for installing the Hunspell package and language dictionaries

Python script for detecting and flagging language words —

Python script for word detection and flagging

So your sample input file looks like this –

Input Data

And the output from the code would be like this –

Output Data

Now, you can use the ‘Flag’ column to safely filter for your intended language words.

Knowing the pros and cons

So we have learnt that insights need to have two ‘I’s for them to be consumed. They should be easier to interpret as well as intuitive to understand. To ensure that they are easier to interpret, they should be written in simple language. To ensure that they are intuitive to understand, they need to make business sense. We looked at one industry example i.e. digital marketing and dove deeper into one problem area of its insights; namely, keyword insights which can be notorious for showing non-English language words. We explored the common patterns present in most languages and utilized them to come up with a practical solution: using a combination of regular expressions and spell checkers to identify non-English language words and either remove them or flag them for translation. But it is also important to be mindful of the pros and cons associated with using this solution –

Pros:

  • Does a splendid job in detecting non-Latin language words, and does a really good job in detecting Latin language words
  • The solution can be incorporated for other use-cases where non-English language detection is required
  • No need to worry about converting words to their base form for detection, the solution takes care of it automatically
  • The solution also ensures that proper nouns like names of people, places and so on are left mostly untouched
  • Can easily be integrated with Hive, Spark or others for automated data jobs
  • It uses an open source spell checker hence there is active contribution by community to respective language dictionaries
  • Can be used to detect almost all prominent languages
  • ASCII encoding based removal only helps for non-Latin language words as it replaces such words with special characters. For Latin language words, there won’t be any change since they have corresponding ASCII codes. And this solution is better than ASCII based encoding filter for non-English language detection
  • Dictionaries can be added with white-listed words, to ensure they are untouched the next time

Cons:

  • Highly dependent on the presence of language dictionaries in a spell checker
  • You need to be mindful of the case of the word for easier identification
  • Incorrect installation of the language pack can sometimes go unnoticed. If words of a specific language are still being flagged as True, then the corresponding language pack was incorrectly installed and needs to be installed again
  • You must remember to save your input data in UTF-8 or Unicode format to preserve the characters present
  • Informal, slang words are hard to detect unless those words are added to the spell checker
  • If you want to evaluate based on sentences, then the sentences need to be broken down to unigrams
  • Not useful for filtering words suitable for research/literature purposes

Steps to try to improve the above solution:

  • Modify the dictionary or create your own dictionary with proper morphology lookup to detect for words used in a specific context e.g. Medical Journals, Scientific Literature and so on
  • Combine the above solution with words from Gbooks, Wikipedia and News articles to increase diversity of detection
  • Use an ensemble of spell checkers to improve detection

What other ways do you think can be tried out to improve this? Please write below in the comments.

--

--

Sharan Biradar
MiQ Tech and Analytics

Helping the team build one-of-a-kind marketing intelligence platform