Brief review of word embedding families (2019)

Published in

Analytics Vidhya

4 min readMar 23, 2019

Word embeddings are essentially vector representations of words, that are typically learnt by an unsupervised model when fed with large amounts of text as input (e.g. Wikipedia, science, new articles etc.)

These representation of words capture syntactic and semantic similarity between words among other properties. They are hence very useful to represent words in downstream NLP tasks such as POS tagging, NER etc.

We will examine three families of word embeddings below

Attention (Transformer) based. Embeddings generated by BERT, which has produced state-of-art results to date in downstream tasks like NER, Q&A, classification etc. BERT takes into account order of words in a sentence but is based on attention mechanism as opposed to sequence models like ELMo described below
RNN family based. Sequence models (ELMo) that produces word embeddings. ELMo uses stacked bidirectional LSTMs to generate word embeddings that have different properties based on the layer that generates them.
Bag of words based. The original word order independent models like word2vec and Glove

The main difference between the word embeddings of Word2vec, Glove, ELMo and BERT is that

Word2vec and Glove word embeddings are context independent- these models output just one vector (embedding) for each word.
That is the one numeric representation of a word (which we call embedding/vector)…

Brief review of word embedding families (2019)

Written by Ajit Rajasekharan