5 NLP Tasks for Similarity Search

Get a grip on the Natural Language Processing landscape! Start your NLP journey with this Periodic Table of 80+ NLP tasks

Rob van Zoest
Jan 21 · 6 min read
Periodic Table of Natural Language Processing Tasks by www.innerdoc.com and created with the Periodic Table Creator
Periodic Table of Natural Language Processing Tasks is created with the Periodic Table Creator

Russian chemist Dmitri Mendeleev published the first Periodic Table in 1869. Now it’s time for the NLP tasks to be organized in the Periodic Table style!

The variation and structure of NLP tasks is endless. Still, you can think about building NLP Pipelines based on standard NLP tasks and dividing them into groups. But what do these tasks entail?

More than 80 frequently used NLP tasks are explained!

Group 12 : Similarity

WordNet is a lexical database from Princeton. It consists of nouns, verbs, adjectives and adverbs which are grouped into sets of synonyms (synsets), hyponyms, meronyms and hypernyms with descriptions. Each synset describes a concept and is interlinked by means of conceptual-semantic and lexical relations.

Wordnet results for ‘school’. Synset (semantic) relations, Word (lexical) relations, noun, verb, adjective (source)

Similar structured knowledge bases are ConceptNet or VerbNet. These focus on the major languages, but there are also initiatives for smaller languages like the Open Dutch WordNet. Applications can be tasks for word sense disambiguation, classification of texts, finding similar terms, lexical simplification, etc.

Distance Measures show how similar words are to each other. There is word Syntax similarity and Semantic word similarity. Syntax similarity means that sheep and ship are more similar than sheep and lamb, because semantic meaning is ignored. This can be calculated by the Levenshtein Distance that is used by the RapidFuzz library. Semantic similarity measures the meaning of the words, so sheep and lamb are more similar than sheep and ship. This can be calculated by measuring the cosine distance of wordvectors.

Lexical vs Semantic similarity (source)

The task of estimating the degree of similarity between the semantic representation of two documents can be done by different techniques for feature extraction. Some examples:

  • The statistical techniques BM25 (Best Matching 25) and TF-IDF (Term Frequency * Inverse Document Frequency), which are the default and former-default similarity algorithm in Elasticsearch and Lucene.
  • Latent Semantic Analysis (LSA/LSI) for vectorization of documents. It is often assumed that the underlying semantic space of a corpus is of a lower dimensionality than the number of unique tokens. Therefore, LSA applies principal component analysis on the vector space and only keeps the directions in our vector space that contain the most variance.
  • Latent Dirichlet allocation (LDA) which is a probabilistic method.
  • Doc2Vec (aka paragraph2vec, aka sentence embeddings) a neural network method that modifies the word2vec algorithm to unsupervised learning of continuous representations for larger blocks of text.
  • USE (Universal Sentence Encoder) encodes text into high dimensional vectors. It has pretrained models for English, but also a multilingual model.

Distributed / Static Word Representations, or Word Vectors or Word Embeddings are multi-dimensional meaning representations of a word, which are reduced to a level of N dimensions. This technique received a lot of attention since 2013 when Google published the algorithm. There were still several challenges.

Original word embedding have one vector per word. A vector typically has 300 or 512 dimensions and 500k words for a large model. This results in embeddings which can grow over 500Mb, which have to be loaded into memory. To reduce this load one can use less dimensions, which makes the vectors less unique. Remove vectors for infrequent words, but these might be the most interesting words. Or map multiple words into one vector (pruning in spaCy), but these words will then be 100% similar to each other.

Encountering out-of-vocabulary words is the word embedding problem of having words for which no vector exists. Subword approaches try to solve the unknown word problem by assuming that you can reconstruct a word’s meaning from its parts.

Lexical ambiguity or polysemy is another problem. A word in a word embedding has no context, so the vector for the word ‘bank’ is trained on the semantic context of ‘river’ bank, but also for the ‘financial’ bank. Sense2vec solves this context-sensitivity partly by taking metainfo into account. The model is trained on words like ‘duck|NOUN’ and ‘duck|VERB’ or ‘Obama|PERSON’ and ‘Obama|ORG’ (e.g. the Obama administration) to be more distinctive on the metainfo-tag (but how about ‘foot’; body part vs scale unit). Nowadays the ambiguity problem is solved by Attention Based Contextualized Word Representations.

A triggering feature (in the early days) for word embeddings was that they contain semantic relations if the training corpus reflects this. An example is ‘Paris’ is to ‘France’ as ‘London’ is to […]. The embedding can respond with ‘England’. However, it’s not always accurate and deeplearning models are nowadays a better alternative to find these relations.

Semantic relations in word embeddings (source)

Best known word embedding models are:

  • Word2Vec is the first wordvector algorithm created by Tomáš Mikolov at Google. It is best implemented by Gensim.
  • GloVe algorithm is created by Stanford.
  • fastText algorithm is created by Facebook and is a subword embedding where each word is represented as a bag of character n-grams. This means that out-of-vocabulary words can be composed from multiple subwords. This makes the algorithm faster, because the embedding is smaller. Trained word vectors for 157 languages are available to download.
  • BPEmb is also a subword embedding algorithm. Subwords are based on Byte-Pair Encoding (BPE) which is a specific type of subword tokenization. BPEmb has trained models for 275 languages.

Contextualized / Dynamic Word Representations can be seen as incorporating context into word embeddings and is the ‘upgrade’ of Static Word Representations. Contextualized Embeddings can be found in models like BERT, ELMo, and GPT-2.

Pretrained Language Models that take context into consideration (source)

ELMo (Embeddings from Language Models by AllenNLP) was the response to the polysemy-problem and took context into consideration an LSTM-based model; same words having different meanings based on their context.

BERT (Bidirectional Encoder Representations from Transformers by Google) was a follow-up that considered the context from both the left and the right sides of each word. It was universal, because no domain-specific dataset was needed. It was also generalizable, because a pre-trained BERT model can be fine-tuned easily for various downstream NLP tasks.

GPT (Generative Pretrained Transformer by OpenAI) also emphasized the importance of the Transformer framework, which has a simpler architecture and can train faster and facilitates more parallelization than an LSTM-based model. It is also able to learn complex patterns in the data by using the Attention mechanism. Attention is an added layer that let’s a model focus on what’s important in a long input sequence.

For a technical summary of the (20+) available model types see this Transformers Summary from Huggin Face.

I have tried to make the Periodic Table of NLP tasks as complete as possible. It’s therefore more a long-read than some self-contained blog articles. I split the 80 articles into the groups of the Periodic Table.

You can find the other group-articles here!

The set-up and composition of the Periodic Table is subjective. The division of tasks and categories could have been done in multiple other ways. I appreciate your feedback and new ideas in the form below. I tried to make a clear and short description for each task. I omitted the deeper details, but provided links to extra information where possible. If you have improvements, you can send add them below or you can contact me on LinkedIn.

Please drop me a message if you have any additions!

Download the Periodic table of NLP tasks here!

Create your own customized Periodic Table here!

Founder @ innerdoc.com | NLP Expert-Engineer-Enthusiast | Writes about how to get value from textual data | Lives in the Netherlands

Feel free to connect with me on LinkedIn or follow me here on Medium.

innerdoc.com

Deep Text Seach. Made Intelligent.

innerdoc.com

Natural Language Processing @ innerdoc.com | Creating custom NLP solutions for businesses | We process your textual data ✨

Rob van Zoest

Written by

Founder @ innerdoc.com | NLP Expert-Engineer-Enthusiast | Writes about how to get value from textual data | linkedin.com/in/robvanzoest/

innerdoc.com

Natural Language Processing @ innerdoc.com | Creating custom NLP solutions for businesses | We process your textual data ✨

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store