Thoughts on NLP 🤔

Tow
AMPOS Developers
Published in
6 min readMar 10, 2018

--

People try to find a way to communicate with machines in a more natural way because if computers understand languages, human life will be more convenient.

The field that tries to achieve this objective is Natural Language Processing (NLP). It is one of the most active fields in computer science and it relates to other fields such as Data Science, A.I., and Linguistics. The most popular NLP approach is creating a language understanding model with the help of Machine Learning technique.

There are many challenges obstructing development of NLP. The most important problem that makes NLP hard is language ambiguity at every level and each language also has different syntax and structure.

For example,

1. Segmentation “Unionized”

Union + izedorUn + ionized

2. Homophone

Same sound
Different spelling
Different meaning

I scream / Ice-creamTo / Two / TooBear / BareSee / SeaRight / Write

3. Homograph

Different sound
Same spelling
Different meaning

Minute
- Time (minit)
- Small (mīˈn(y)o͞ot)

4. Homonym

Same sound
Same spelling
Different meaning

Close
- The supermarket that close to my house will close at 5pm
Ring
- The phone which near the ring was ringing for half an hour.

5. Number

12 (twelve or one-two)Time 10.30
- Half past ten
- Ten thirty
Year
- 1999 : Nineteen ninety-nine
- 2000 : Two thousands
Some languages don’t have zeroSome languages don’t use base-10
- Nimbia, base-12
- French, mix of base-10 and base-20
- Traditional Welsh, base-20 with a pivot at 15

6. Inflection

Make | Makes | Made | MakingSome word in Sri Lanka has 216 word inflectionThai has no word inflection

7. Tone

Kai kai kai kai (Khır k̄hāy k̄hị̀ kị̀)
= ใครขายไข่ไก่
In Thai sentence means "Who sell the chicken eggs"

8. Order

What is your name?In Thai is "ชื่อของคุณคืออะไร" = Your name is what?

9. RTL / LTR

English read from left to rightArabic read from right to left

10. Grammar

"Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo." is grammatically correct sentence in American English.

To teach machine to interact in a way similar to humans require a wide range of data. Good data can only be created by skillful humans, linguists in most cases. However, some dialects, tribe languages or languages which are spoken by small groups might not have any linguist or their linguists might not have enough expertise such as Sylbo (see the link below). For this reason, NLP development in some languages is lagging behind.

Another point to consider is that because languages are evolving, linguists take critical roles in developing NLP. If linguists lack the knowledge of the history of languages, they could make mistake. Moreover, in some languages, there are no linguists at all. For those languages we couldn’t rely on linguists to develop NLP, so should we find other eligible native speakers to work in place of linguists or is it possible to train native speakers to help generate and check data?

Advantages:
More data can be generated in shorter time. For translation, choices of words selected by non-linguist translators can be easier to understand (commonly used) and can be less formal.

Disadvantage:
More people working on a task means there is more management overhead cost. Data might be inconsistent since non-linguist translators might speak in different dialects or have different background. Drawing a conclusion from inconsistent data could be harder without linguists, so data quality could be expected to be lower than when only linguists are involved.

For a field like Machine Translation, if not enough data is given, translation between a language pair without data can be suffering from context losses.

To be able to translate a sentence to another language, without availability of source-to-destination translation model, one first translate it to English and then translate English to the destination language. This approach might lead to the loss of contexts because some concepts do not exist in English. Accordingly, the translated result might be inaccurate or ambiguous.

Using English as the medium language for translation between two other languages is popular only because English is an international language with the most number of speakers. In fact, this could pose a problem since English might not be the best language to capture contexts and concepts of other languages. The best way to prevent context loss is to find a person that is native to both languages, for example, a person that is native to Javanese and Turkish to translate a sentence between Javanese and Turkish. Nevertheless, finding a speaker that is native to both languages is extremely hard. It also means that Javanese-and-Turkish translation data is rare, so creating a good model for this language pair is almost impossible.

When adopting a holistic view of a language, not all problems occur. As for syntax and structure, even though there are languages that share no features, like Xhosa and Japanese, there are languages that share many features like Thai and Lao. Grouping similar languages altogether could benefit the learning process. For example, grouping Japanese and Korea could result in better quality of translation than grouping Japanese and Portuguese.

Collaboration between groups of similar languages

helps result models could benefit from similarity between languages. They could be better than models created from group of not-so-similar languages and they could better than models created for single languages.

Beside this, linguists who are experts in different languages will have an opportunity to learn from each other and from other languages.

In more technical terms, languages with similar structures can share data and multi languages model can be created using shared data. With this approach, less data are required per language.

In addition, one model can be shared by many languages. As a consequence, programmer will have less work to generate models

However, grouping languages is not trivial. Sometimes, if languages are grouped wrongly, grouping might not help processes and models at all. Even worse, this could slow down the whole process. So there is a risk to bear.

Google translate

is probably the most powerful and most used translate tool at this moment.

But if we look back to 10 years ago, whether an article was translated by Google translate, we could know immediately once we read it.

Nowadays, Google improved Google translate by using an artificial neural network to learn by itself which call Google Neural Machine Translation (GNMT). This system increase the quality of translation by making it can translate more speed, accuracy and fluently.

To overcome an issue such as context loss, Google creates an approach called Google’s Multilingual Neural Machine Translation. This approach is extended from GNMT. It is interesting since they are able to blend abundant of data from multiple languages pairs to create a model.

The outcome is a model which capable of translating between many languages pairs, rather than one language pair per model. This enable zero-shot translation or translate from source-to-destination.

Since the model works, it proves that blended data from languages sharing commonality can help us overcome such a critical issue.

--

--

Tow
AMPOS Developers

(*ˊᗜˋ*)/ ,,♥ write when i feel like it.