Natural language processing (NLP) — neural machine translation and other use cases

SeniorQuant
Natural language processing (NLP)
6 min readJan 13, 2022

Natural language processing (NLP) has become one of the most important fields in the data science and machine learning field.

NLP is concerned with interaction between machines and human language, with particular focus on:

  • understanding
  • generation
  • analysis of human languages

NLP has a long history and the NLP systems at the beginning mostly used different groups of rules.

An example for that is words stemming.

However, since the 1990s, the rules methods started being replaced with statistical approaches, utilizing machine learning models.

Neural Machine Translation

Today, NLP is behind a large number of use cases in everyday life. One of them is neural machine translation (NMT), which has become a really great success in terms of quality.

An important breakthrough in NMT area was the paper from Google researchers — Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation: https://arxiv.org/abs/1609.08144. They achieved significant improvement on translation benchmarks.

Nowadays, tools like Google Translate, Deepl Pro and others provide high quality translations.

If one needs to translate large amounts of text (e.g. for generating training data set), we used these two projects in the past:

Another useful NMT library is https://github.com/UKPLab/EasyNMT. One particular feature that can help if you translating large, long texts is the ability to chunk texts in smaller units, which is important.

With Opus-NMT the input length is namely limited to 512 words whereas for M2M models (from Facebook research) it is limited to 1024 words. This is because of use of transformer models.

EasyNMT automatically splits sentences to be able to translate also longer documents.

Topic Modelling

Another use case for natural language processing is topic modelling, or discovering latent topics in corpus of documents.

We have used Latent Dirichlet Allocation in the past for this type of projects, also implemented in sklearn:

https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html

Another great library for topic modelling is also gensim: https://github.com/RaRe-Technologies/gensim.

Text summarization

Text summarization is a NLP technique which produces a summary of the text while still retaining its main informational value. It is useful when dealing with large number of corpuses and looking for a quick way to get key info on each document, without having to read documents in their entirety.

Sentiment analysis has found many uses case from opinion mining to helping companies extract insights based on automatically sentiment classified texts that users write about their products and services. Sentiment analysis can form a host of derived products, e.g. crypto fear greed index in the crypto market can be derived from sentiment of individual tweets. In our experience, ensembling different machine learning models can improve the f1 scores and accuracy of sentiment analyis model.

Next set of NLP methods is named entity recognition, which is extraction of information — trying to find and classify named entities in unstructured texts. Named entities can consist of persons’ names, organizations, places, expressions of times, and others.

Product categorization is a subfield of NLP and machine learning with the main task being of assigning one or complete taxonomy paths to products. The taxonomy system or the list of of categories can be a standard one, from IAB, Google, Facebook or it can be custom developed. You can learn more about taxonomy systems here.

Automated question answering is a subfield of NLP which concerns itself with development of model that are able to automatically answer questions being asked by humans.

Speech recognition is a multi-disciplinary field of tech that allow machines to transform speech into texts.

Offline URL Database — NLP analysis can be used for website categorization of large number of domains, usually in tens of millions and stored in form of paid or free url categorization database.

OCR (optical character recognition) — OCR has become an important part of many enterprise initiatives and is one of first steps for companies that want to bring their printed documents on the cloud. OCR is usually employed via an Online OCR API service but large companies have on-premise implementations as well.

Analysing technology usage — by collecting usage of technologies on millions of websites and their features, one can find insights on any technology.

Let us consider websites using Jquery:

Top 10 technologies that are most often used with Jquery are:

Next, let us look at IAB taxonomy Tier 1 categories of websites using Jquery:

Distribution is pretty even for a mature technology like jquery.

We can do more in-depth analysis by looking at Tier 2, with 441 categories:

At this level, jquery is most used in eCommerce websites, on relative basis, when compared to general population of websites.

How about distribution with respect to popularity of domains:

Again, relative usage is close to 1, except for websites among the top 100k domains, where it is slighlty lower.

Next, let us consider domain age as factor:

Seems the jquery is a bit less used on very young domains, with age below 3 years.

Natural language processing is generally a difficult problem in artificial intelligence and data science. The main reason for this are many different ambiguities of human languages, including lexical ambiguity or the ambiguity of single words, syntactic ambiguity, attachment ambiguity and others.

Text pre-processing

An important part of natural language processing tasks is pre-processing of texts.

Machine learning algorithms namely demand inputs to be in a numerical format.

When our data items are just corpus texts, the first step is thus to transform them in the some kind of numerical form.

In order to prepare texts for this, we usually apply several steps, where the specific approaches that we use are somehow dependent on what type of machine learning problem we are dealing with.

Generally, the most common pre-processing methods applied on texts are:

  • removal of punctuations,
  • removal of special characters,
  • removal of stop words,
  • lowercasing,
  • removal of frequent words,
  • removal of rare words,
  • removing numbers,
  • removal of emojis/emoticons or their conversion to words (if present, e.g. in tweets),
  • removal of URLs,
  • expanding contractions, (e.g. “aren’t” to “are not”),
  • removing misspelled words,
  • stemming or reducing words to their word stem or root form,
  • normalization, e.g. “aaaand” to “and”, often used when dealing with social media texts,
  • lemmatization.

If the text resides on websites, then it is useful to employ article extraction.

Text pre-processing however should not be indiscriminately applied to all types of ML tasks. If we are employing deep neural networks which can learn complex patterns in texts, it can be sometimes better to skip stemming, lemmatization and stop words removal as part of pre-processing.

--

--