Understanding Word Embeddings with Keras

Henry Wu
6 min readJan 30, 2024

--

In this post, we will cover word embeddings, an approach in NLP for representing text as real value vectors that capture complex relationships between words and phrases. We will explore the motivation behind word embeddings and then demonstrate how to embed text using the Embedding Layer in Keras.

Photo by Patrick Tomasso on Unsplash

Background

In machine learning, models typically work with numerical data. This means we need ways to convert text into numerical formats. Generally, this process starts with breaking down texts into individual tokens, usually words, each of which is then numerically encoded. Traditional approaches include:

  • One-hot Encoding: This method represents each word in the vocabulary with a binary vector. All elements of this vector are zeroes, except for a ‘1’ at the index representing the word. This approach leads to sparse and high-dimensional vectors, which are inefficient and fail to capture the relationship between words.
  • Unique Number Encoding: This method assigns a unique number to each word in the vocabulary. This approach avoids sparse vectors and the issue of high dimensionality seen in one-hot encoding. However, the numbers assigned are arbitrary and, like one-hot encoding, do not capture the relationship between words.

The limitations of traditional encoding methods have led to word embeddings, a more sophisticated approach to representing text numerically, which we will look at next.

What are Word Embeddings?

Word embeddings are an approach for numerically representing words and sentences. It associates each word in a vocabulary with a dense vector of real values. Within this vector space, similar words have similar encodings, meaning that words with similar meanings tend to be close to each other. The length of the vector is a parameter we can specify and the values in the vector are learned and tuned during training to maximize the log-likelihood of the training data.

The main goal of word embeddings is to address the curse of dimensionality inherent in language modeling. It does this by learning a distributed representation of words in lower-dimensional space. This means that after learning, each training sentence can inform the model about a combinatorial number of other sentences.

There are various word embedding techniques, such as Word2Vec, GloVe, fastText, and TF-IDF, each suited to different tasks. In this discussion, we will use the Embedding Layer in Keras to demonstrate how to map each word into a dense vector of real values.

Embedding Layer in Keras

The Embedding Layer in Keras is designed to map positive integer inputs of a fixed range into dense vectors of fixed size. Before we can use this layer, our text has to be preprocessed, which includes tokenization and indexing. For simplicity, we will use the TextVectorization Layer to preprocess the text.

Text Data

Here we define a sample corpus with three sentences:

corpus = [
"This is the an example sentence.",
"This post covers word embedding.",
"This section talks about embedding layer in Keras.",
]

TextVectorization

max_features = 20  # Maximum vocab size.
max_len = 8 # Sequence length to pad the outputs to.
vectorize_layer = TextVectorization(max_tokens=max_features,
standardize='lower_and_strip_punctuation',
split='whitespace',
output_mode='int',
output_sequence_length=max_len)

vectorize_layer.adapt(corpus) # Build a vocabulary of all string
# tokens seen in the corpus
print(vectorize_layer.vocabulary_size())
print(vectorize_layer.get_vocabulary())

Let’s look at the two main parameters:

  • max_tokens: This sets the upper limit for the vocabulary size. In our example, it’s set to 20. This number is chosen arbitrarily and is not critical in our case since our total vocabulary size is only 18.
  • output_sequence_length: This determines the length to which the output sequences will be padded or truncated. We’ve set this to 8, which matches the length of our longest sentence in terms of token count.

output:

vocabulary size:  18
['', '[UNK]', 'this', 'embedding', 'word', 'the', 'talks', 'sentence',
'section', 'post', 'layer', 'keras', 'is', 'in', 'example', 'covers',
'an', 'about']

This layer first adapts to the corpus to create a vocabulary. In our case, the vocabulary size is 18, including special tokens for out-of-vocabulary (OOV) words and padding. This layer splits each sentence into tokens, and then represents each token by an integer, based on its index in the vocabulary. Lastly, it pads the list of tokens to our desired length of 8.

Here is how each sentence is preprocessed:

for sentence in corpus:
encoding = vectorize_layer([sentence])
print(f"'{sentence}' -> {encoding[0]}")

output:

'This is the an example sentence.' -> [ 2 12  5 16 14  7  0  0]
'This post covers word embedding.' -> [ 2 9 15 4 3 0 0 0]
'This section talks about embedding layer in Keras.' -> [ 2 8 6 17 3 10 13 11]

The output shows how each sentence is transformed into integer indices. For example, the word ‘This’ is encoded as 2, and padding is represented by 0s. It’s important to note that this numerical representation is not capable of capturing the relationships between words.

Next, we will explain how the Embedding Layer works with these integer indices.

Embedding Layer

vocab_size = vectorize_layer.vocabulary_size() # 18
embedding_dim = 4

embedding = Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=max_len)

The Embedding Layer takes the integer indices and maps them to dense vectors of real values. It has three main parameters:

  • input_dim: vocabulary size
  • output_dim: dimension of the embedding vector
  • input_length: length of input sequences

In our case, we are passing lists of 8 indices (input_length) into the Embedding Layer to map each of the 18 unique vocabulary words (input_dim) to a 4-dimensional embedding vector (output_dim).

After defining the embedding layer, we can use it to transform the text:

print(embedding(vectorize_layer(corpus)))

output:

[[[-0.04588735  0.04821293  0.02707684  0.03532112]
[-0.01999545 -0.04968796 -0.02757902 0.00817414]
[ 0.02031758 -0.01175624 0.01233627 0.01491675]
[-0.01573961 -0.03532983 -0.02621815 -0.04042824]
[ 0.02531271 -0.01728104 -0.00472096 -0.04027735]
[-0.04076502 -0.04600718 -0.03657167 -0.04884919]
[-0.00248233 -0.04897319 -0.01218885 0.02799994]
[-0.00248233 -0.04897319 -0.01218885 0.02799994]]

[[-0.04588735 0.04821293 0.02707684 0.03532112]
[-0.02176586 -0.03552277 0.0177248 -0.01874503]
[ 0.04428699 0.03702401 -0.038103 0.01531515]
[ 0.01737973 -0.01626869 -0.03483053 0.00736971]
[-0.02679001 -0.00531564 0.02538358 -0.00660758]
[-0.00248233 -0.04897319 -0.01218885 0.02799994]
[-0.00248233 -0.04897319 -0.01218885 0.02799994]
[-0.00248233 -0.04897319 -0.01218885 0.02799994]]

[[-0.04588735 0.04821293 0.02707684 0.03532112]
[-0.01086154 -0.00346394 0.01886177 -0.02077385]
[-0.04637978 0.01801422 0.00417098 -0.0366376 ]
[ 0.03626612 -0.03860765 0.00117645 -0.00586362]
[-0.02679001 -0.00531564 0.02538358 -0.00660758]
[ 0.00779638 -0.04695494 -0.02802014 -0.01979377]
[-0.03570173 0.04125914 -0.04764031 -0.03406505]
[ 0.00056888 0.01333905 -0.04374031 0.03733016]]]

The output shows that each sentence from the corpus is transformed into an (8x4) matrix. Each integer index is now represented by a 4-dimensional embedding vector.

Let’s see how the first sentence from our corpus is transformed:

'This is the an example sentence.'
-> [ 2 12 5 16 14 7 0 0]
->[[-0.04588735 0.04821293 0.02707684 0.03532112]
[-0.01999545 -0.04968796 -0.02757902 0.00817414]
[ 0.02031758 -0.01175624 0.01233627 0.01491675]
[-0.01573961 -0.03532983 -0.02621815 -0.04042824]
[ 0.02531271 -0.01728104 -0.00472096 -0.04027735]
[-0.04076502 -0.04600718 -0.03657167 -0.04884919]
[-0.00248233 -0.04897319 -0.01218885 0.02799994]
[-0.00248233 -0.04897319 -0.01218885 0.02799994]]

The sentence is first converted into integer indices through the TextVectorization Layer. The integer indices are then mapped to 4-dimensional vectors through the Embedding Layer. For example, ‘2’ is mapped to [-0.04588735 0.04821293 0.02707684 0.03532112].

Embedding Layer Weights

Let’s take a look at the weights and its shape:

embedding_weights = embedding.get_weights()[0]

print("Shape of the embedding weights:", embedding_weights.shape)
print("Embedding weights:\n", embedding_weights)

output:

Shape of the embedding weights: (18, 4)
Embedding weights:
[[-0.00248233 -0.04897319 -0.01218885 0.02799994]
[ 0.0224599 -0.02493274 -0.01421466 -0.00826609]
[-0.04588735 0.04821293 0.02707684 0.03532112]
[-0.02679001 -0.00531564 0.02538358 -0.00660758]
[ 0.01737973 -0.01626869 -0.03483053 0.00736971]
[ 0.02031758 -0.01175624 0.01233627 0.01491675]
[-0.04637978 0.01801422 0.00417098 -0.0366376 ]
[-0.04076502 -0.04600718 -0.03657167 -0.04884919]
[-0.01086154 -0.00346394 0.01886177 -0.02077385]
[-0.02176586 -0.03552277 0.0177248 -0.01874503]
[ 0.00779638 -0.04695494 -0.02802014 -0.01979377]
[ 0.00056888 0.01333905 -0.04374031 0.03733016]
[-0.01999545 -0.04968796 -0.02757902 0.00817414]
[-0.03570173 0.04125914 -0.04764031 -0.03406505]
[ 0.02531271 -0.01728104 -0.00472096 -0.04027735]
[ 0.04428699 0.03702401 -0.038103 0.01531515]
[-0.01573961 -0.03532983 -0.02621815 -0.04042824]
[ 0.03626612 -0.03860765 0.00117645 -0.00586362]]

As shown in the output, the Embedding Layer weights form a lookup table of size (18, 4). Each row in this table corresponds to each of the 18 words in our vocabulary, and every word is represented by a 4-dimensional vector.

Currently, these weights are randomly initialized and lack meaningful information. However, they will be adjusted during model training to capture the semantic relationships between words. The training is what allows the model to learn meaningful representations for each word and capture relationships and context.

References

--

--