Word Embeddings in Natural Language processing and Tabular Data

Ijaz Khan
unpack
Published in
5 min readJan 18, 2021

In the context of machine learning, an embedding provides a low-dimensional space into which high dimensional vectors are translated. In other words, we can say, it is a way to represent discrete vectors as continuous variables. Embeddings help ML models to work more efficiently and speed up the performance of deep neural networks as compared to the old methods like One-hot encoding.

Usage of Embeddings in ML tasks: Embeddings are more often used in the tasks of natural language processing (NLP), where the data is almost always discrete in nature. for example, Question- Answering, Reading Comprehension, Machine Translation, etc. However, data scientists are making use of embeddings in tabular data because tabular data does not always contain simple numbers but also words. Therefore in this regard, we call them embeddings for categorical variables in Tabular data. In this article, I will discuss basic word embedding techniques in NLP and how it's applicable on Tabular data.

Word Embeddings in NLP

Word embedding is a parametrized function that maps words to d-dimensional vectors as 𝑓 ∶ 𝐷 → ℜ𝑑 where D is a dictionary of words. In addition, it allows words with similar meaning to have a similar representation. Below are the two most commonly used word embedding techniques.

  1. Embedding Layer
    This kind of word embedding is learned along with the deep neural model on a particular NLP task, such as Text classification, language modeling.
    The process includes preprocessing of text from the document and each word needs to be one-hot encoded. The embedding space is defined as a part of the model. for example it can 50, 100, or 300 dimensions.
  2. Transfer Learning and Pre-trained Embeddings
    Pretrained Word Embeddings are the embeddings learned in one task that is used for solving another similar task.’
    These embeddings are actually a form of Transfer learning in NLP because they are trained on large datasets, saved, and then used for solving other tasks.
Source:[3]

The two most Popular pre-trained word embeddings are

  • Gooogle’s Word2Vec
  • Stanford’s GloVe

Gooogle’s Word2Vec:
One of the most popular pre-trained word embeddings, which are trained on about 100 billion words of Google news. Knowledge Discovery, Recommendation Engines are some of the common use cases of word2vec but it is also used and applied to many different text classification tasks.

Word2vec has a simple architecture consisting of a feed-forward neural network having just one hidden layer.

The embeddings are learned, using different two approaches:

  • Continuous Bag-of-Words (CBOW)
  • Skip-gram model

Both of these approaches are the inverse of each other. The former learn the significant words given the neighboring word whereas the latter learns the neighboring words given the significant words.

Source: [3]

Stanford’s GloVe
GloVe embeddings work on the idea of deriving the relationship between words from Global statistics. But how?

The easiest way is to look at the co-occurrence matrix. A co-occurrence matrix shows us the frequency of occurrence of a particular pair of words together. In a co-occurrence matrix, each value in is a sum of a pair of words occurring together.

For instance, if we have a text corpus: “ I love football, I love cricket and I play cricket”. below is the co-occurrence matrix of this corpus:

Source[3]

Let's compute the probabilities of the set of words, for instance, we focus on the word “cricket”

p(cricket/play)=1
p(cricket/love)=0.5
Computing the ratio of probabilities:
p(cricket/play) / p(cricket/love) = 2

Ratio >1 means that “play” is the most relevant word to cricket in comparison to “love”. In the same way, if the ratio is closer to 1, both “love” and “play” are relevant to the word “cricket”

So It shows how glove derives the relationship between words by using simple statistics.

Embeddings and Tabular Data

Deep learning has paid more attention for tasks of computer vision and natural language processing but less attention has been paid regarding its use in tabular data. However, tabular data is the most common data to be used by the industries.

Categorical Variables and embeddings

Nowadays, data scientists are trying to create embeddings for the categorical variables of tabular data because tabular data not always comes in simple numbers but also contain words to which deep learning can be applied to gain better results. Using this approach relationships between categories can be captured. For example, zip codes can show same patterns as other zip codes which are geo-graphically near to each other and also maybe some zip codes have similar socio-econimic status share similar patterns. Another example can be days of the week, Sunday and saturday may have identical behavior and friday acts like the average of weekdays and weekends.

Inspiration from Word Embeddings

Embeddings are used to capture these multi-dimensional relationships between categorical variables. This inspirations comes from the pre-trained embeddings such as GloVe and Word2vec.

This is the same idea as is used with word embeddings, such as Word2Vec. A 3-dimensional embedding vectors may looks like:

kitten[0.0, 0,9, 1.0]
cat[0.0, 0.3, 1.0]
puppy[0.8, 1.0, 0.0]
dog[0.9, 0.2, 0.0]

Source:[2]

In the above example if we look at the dimension of kitten , the first dimension captures something related to being a kitten while the second captures youthfulness. This is how the semantic relationships between kittenness and youthfulness is captured. Th figure above shows the embedding space and distance between similar objects.

In practice, our neural network during the training process learn the most suitable representations for each category where its each dimension can have multiple meanings. Machine Learning models can capture rich relationships using these distributed representations.

References:

  1. https://machinelearningmastery.com/what-are-word-embeddings
  2. https://www.fast.ai/2018/04/29/categorical-embeddings/
  3. https://www.analyticsvidhya.com/blog/2020/03/pretrained-word-embeddings-nlp

I hope it was helpful ;)

--

--