Russian chemist Dmitri Mendeleev published the first Periodic Table in 1869. Now it’s time for the NLP tasks to be organized in the Periodic Table style!
The variation and structure of NLP tasks is endless. Still, you can think about building NLP Pipelines based on standard NLP tasks and dividing them into groups. But what do these tasks entail?
More than 80 frequently used NLP tasks are explained!
Group 12 : Similarity
58. WordNet Synsets
WordNet is a lexical database from Princeton. It consists of nouns, verbs, adjectives and adverbs which are grouped into sets of synonyms (synsets), hyponyms, meronyms and hypernyms with descriptions. Each synset describes a concept and is interlinked by means of conceptual-semantic and lexical relations.
Similar structured knowledge bases are ConceptNet or VerbNet. These focus on the major languages, but there are also initiatives for smaller languages like the Open Dutch WordNet. Applications can be tasks for word sense disambiguation, classification of texts, finding similar terms, lexical simplification, etc.
59. Distance Measures
Distance Measures show how similar words are to each other. There is word Syntax similarity and Semantic word similarity. Syntax similarity means that sheep and ship are more similar than sheep and lamb, because semantic meaning is ignored. This can be calculated by the Levenshtein Distance that is used by the RapidFuzz library. Semantic similarity measures the meaning of the words, so sheep and lamb are more similar than sheep and ship. This can be calculated by measuring the cosine distance of wordvectors.
60. Document Similarity
The task of estimating the degree of similarity between the semantic representation of two documents can be done by different techniques for feature extraction. Some examples:
- The statistical techniques BM25 (Best Matching 25) and TF-IDF (Term Frequency * Inverse Document Frequency), which are the default and former-default similarity algorithm in Elasticsearch and Lucene.
- Latent Semantic Analysis (LSA/LSI) for vectorization of documents. It is often assumed that the underlying semantic space of a corpus is of a lower dimensionality than the number of unique tokens. Therefore, LSA applies principal component analysis on the vector space and only keeps the directions in our vector space that contain the most variance.
- Latent Dirichlet allocation (LDA) which is a probabilistic method.
- Doc2Vec (aka paragraph2vec, aka sentence embeddings) a neural network method that modifies the word2vec algorithm to unsupervised learning of continuous representations for larger blocks of text.
- USE (Universal Sentence Encoder) encodes text into high dimensional vectors. It has pretrained models for English, but also a multilingual model.
61. Distributed Word Representations
Distributed / Static Word Representations, or Word Vectors or Word Embeddings are multi-dimensional meaning representations of a word, which are reduced to a level of N dimensions. This technique received a lot of attention since 2013 when Google published the algorithm. There were still several challenges.
Original word embedding have one vector per word. A vector typically has 300 or 512 dimensions and 500k words for a large model. This results in embeddings which can grow over 500Mb, which have to be loaded into memory. To reduce this load one can use less dimensions, which makes the vectors less unique. Remove vectors for infrequent words, but these might be the most interesting words. Or map multiple words into one vector (pruning in spaCy), but these words will then be 100% similar to each other.
Encountering out-of-vocabulary words is the word embedding problem of having words for which no vector exists. Subword approaches try to solve the unknown word problem by assuming that you can reconstruct a word’s meaning from its parts.
Lexical ambiguity or polysemy is another problem. A word in a word embedding has no context, so the vector for the word ‘bank’ is trained on the semantic context of ‘river’ bank, but also for the ‘financial’ bank. Sense2vec solves this context-sensitivity partly by taking metainfo into account. The model is trained on words like ‘duck|NOUN’ and ‘duck|VERB’ or ‘Obama|PERSON’ and ‘Obama|ORG’ (e.g. the Obama administration) to be more distinctive on the metainfo-tag (but how about ‘foot’; body part vs scale unit). Nowadays the ambiguity problem is solved by Attention Based Contextualized Word Representations.
A triggering feature (in the early days) for word embeddings was that they contain semantic relations if the training corpus reflects this. An example is ‘Paris’ is to ‘France’ as ‘London’ is to […]. The embedding can respond with ‘England’. However, it’s not always accurate and deeplearning models are nowadays a better alternative to find these relations.
Best known word embedding models are:
- Word2Vec is the first wordvector algorithm created by Tomáš Mikolov at Google. It is best implemented by Gensim.
- GloVe algorithm is created by Stanford.
- fastText algorithm is created by Facebook and is a subword embedding where each word is represented as a bag of character n-grams. This means that out-of-vocabulary words can be composed from multiple subwords. This makes the algorithm faster, because the embedding is smaller. Trained word vectors for 157 languages are available to download.
- BPEmb is also a subword embedding algorithm. Subwords are based on Byte-Pair Encoding (BPE) which is a specific type of subword tokenization. BPEmb has trained models for 275 languages.
62. Contextualized Word Representations
Contextualized / Dynamic Word Representations can be seen as incorporating context into word embeddings and is the ‘upgrade’ of Static Word Representations. Contextualized Embeddings can be found in models like BERT, ELMo, and GPT-2.
ELMo (Embeddings from Language Models by AllenNLP) was the response to the polysemy-problem and took context into consideration an LSTM-based model; same words having different meanings based on their context.
BERT (Bidirectional Encoder Representations from Transformers by Google) was a follow-up that considered the context from both the left and the right sides of each word. It was universal, because no domain-specific dataset was needed. It was also generalizable, because a pre-trained BERT model can be fine-tuned easily for various downstream NLP tasks.
GPT (Generative Pretrained Transformer by OpenAI) also emphasized the importance of the Transformer framework, which has a simpler architecture and can train faster and facilitates more parallelization than an LSTM-based model. It is also able to learn complex patterns in the data by using the Attention mechanism. Attention is an added layer that let’s a model focus on what’s important in a long input sequence.
For a technical summary of the (20+) available model types see this Transformers Summary from Huggin Face.
ABOUT THIS POST
I have tried to make the Periodic Table of NLP tasks as complete as possible. It’s therefore more a long-read than some self-contained blog articles. I split the 80 articles into the groups of the Periodic Table.
The set-up and composition of the Periodic Table is subjective. The division of tasks and categories could have been done in multiple other ways. I appreciate your feedback and new ideas in the form below. I tried to make a clear and short description for each task. I omitted the deeper details, but provided links to extra information where possible. If you have improvements, you can send add them below or you can contact me on LinkedIn.
Please drop me a message if you have any additions!
Founder @ innerdoc.com | NLP Expert-Engineer-Enthusiast | Writes about how to get value from textual data | Lives in the Netherlands
Feel free to connect with me on LinkedIn or follow me here on Medium.