Hands-on NLP Deep Learning Model Preparation in TensorFlow 2.X
This is a tutorial to walk through the NLP model preparation pipeline: tokenization, sequence padding, word embeddings, and Embedding layer setups.
Intro: why I wrote this post
Many state-of-the-art results in NLP problems are achieved by using DL (deep learning), and probably you want to use deep learning style to solve NLP problems as well. While there are a lot of materials discussing how to choose and train the “best” neural network architecture, like, an RNN, selecting and configuring a suitable neural network is just one part of solving a practical NLP problem. The other import part, but often being underestimated, is model preparation. NLP tasks usually require special data treatment in the model preparation stage. In other words, there is a lot of things to do before we can throw the data in the neural networks to train. Unfortunately, there are not many tutorials giving detailed guidance on model preparation.
Besides, the packages or APIs to support the state-of-the-art NLP theories and algorithms are usually released very recently and are updating at a rapid speed. (e.g., TensorFlow was first released in 2015, PyTorch in 2016, and spaCy in 2015.) To achieve a better performance, many times, you might have to integrate several packages in your deep learning pipeline, while preventing them from crashing with each other.
That’s why I decided to write this article to give you a detailed tutorial.
- I will walk you through the model preparation pipelines from tokenizing raw data to configuring the Tensorflow Embedding so that your neural networks are ready for the training.
- The example code will help you to have a solid understanding of the model preparation steps.
- In the tutorial, I will choose the popular packages and APIs that specialize in NLP and advise for parameter default settings to make sure you will have a good start on the NLP deep learning journey.
What to expect in this article
- We will walk through the NLP model preparation pipeline using TensorFlow 2.X and spaCy. The four main steps in the pipelines are tokenization, padding, word embeddings, embedding layer setups.
- The motivation (why we need this) and intuition (how it works) will be introduced, so don’t worry if you are new to NLP or deep learning.
- I will mention some common issues during model preparation and potential solutions.
- There is a notebook you can play with, available on Colab and Github. While we are using a toy dataset in the example (taken as a piece from the IMDB movie review dataset), the code can apply to a larger and more practical dataset.
Without further ado, let’s start with the first step.
Tokenization
What is tokenization?
In NLP, Tokenization means to brake the raw text into unique unites (a.k.a., tokens). A token can be sentences, phrases, or words. Each token has a unique token-id. The purpose of tokenization is that we can use those tokens (or token-id) to represent the original text. Here is an illustration.
Tokenization usually includes two stages:
stage1: create a token dictionary, in this stage,
- Select token candidates (usually words) by first separating the raw text into sentences, then breaking down sentences into words.
- Certain preprocessing should be involved, e.g., lowercasing, punctuation removal, etc.
- Note that tokes should be unique and assign to different token-ids, e.g., ‘car’ and ‘cars’ are different tokens, as well as ‘CAR’ and ‘car’. The chosen token and the associated token-ids will create a token dictionary.
stage2: text representation, in this stage,
- Represent the original text with the tokens (or the associated token-ids) by referring to the token dictionary.
- Sometimes the tokens are partially selected for text representation (e.g., only select the most frequent tokens.); thus, the final tokenized sequence will only include such chosen tokens.
In TensorFlow
We will take a piece of IMDB movie review dataset to demonstrate the pipeline.
from tensorflow.keras.preprocessing.text import Tokenizertokenizer = Tokenizer()
tokenizer.fit_on_texts(raw_text)
train_sequences = tokenizer.texts_to_sequences(raw_text) #Converting text to a vector of word indexes
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))
print('1st token-id sequnce', train_sequences[0])>>Found 212 unique tokens.
>>1st token-id sequnce [21, 4, 2, 12, 22, 23, 50, 51, 13, 2, 52, 53, 54, 24, 6, 2, 55, 56, 57, 7, 2, 58, 59, 4, 25, 60]
Now, let’s take a look at what we get from the tokenization step.
a token dictionary
# display the token dictionary (from most freqent to rarest)
# these are the 2 useful attributes, (get_config will show the rest)
print(tokenizer.word_index)
print(tokenizer.word_counts)
# tokenizer.get_config()>>{'the': 1, 'a': 2, 'and': 3, 'of': 4, 'to': 5, 'with': 6, 'is': 7, 'this': 8, 'by': 9, 'his': 10, 'movie': 11, 'man': 12, 'for': 13, ...
>>OrderedDict([('story', 2), ('of', 8), ('a', 11), ('man', 3), ('who', 2), ('has', 2), ('unnatural', 1), ('feelings', 1), ('for', 3), ('pig', 1), ('br', 1), ('starts', 1), ('out', 2), ('with', 6), ...
Explanation:
- The tokenizer counts the number of each word (tokens) and ranks the tokens by the counts. e.g., ‘the’ is the most frequent token in the corpus, so rank as no.1, associated the token-id as “1”. This ranking is described in a dictionary. We can use the
tokenizer.word_index
attribute to review the distortionary. - We can use
tokenizer.word_counts
to check the counts associated with each token.
Important note: When using TensorFlow Tokenizer, 0-token-id is reserved to empty-token, i.e., the token-id starts at 1,
token-ids sequences
# compare the number of tokens and tokens after cut-off
train_sequences = tokenizer.texts_to_sequences(raw_text) #Converting text to a vector of word indexes
# print(len(text_to_word_sequence(raw_text[0])), len(train_sequences[0]))
print(raw_text[0])
print(text_to_word_sequence(raw_text[0]))
print()
tokenizer.num_words = None # take all the tokens
print(tokenizer.texts_to_sequences(raw_text)[0])
tokenizer.num_words = 50 # take the top 50-1 tokens
print(tokenizer.texts_to_sequences(raw_text)[0])>>Story of a man who has unnatural feelings for a pig. <br> Starts out with a opening scene that is a terrific example of absurd comedy.
>>['story', 'of', 'a', 'man', 'who', 'has', 'unnatural', 'feelings', 'for', 'a', 'pig', 'br', 'starts', 'out', 'with', 'a', 'opening', 'scene', 'that', 'is', 'a', 'terrific', 'example', 'of', 'absurd', 'comedy']>>[21, 4, 2, 12, 22, 23, 50, 51, 13, 2, 52, 53, 54, 24, 6, 2, 55, 56, 57, 7, 2, 58, 59, 4, 25, 60]
>>[21, 4, 2, 12, 22, 23, 13, 2, 24, 6, 2, 7, 2, 4, 25]
Explanation:
- We use
train_sequences = tokenizer.texts_to_sequences(raw_text)
to convert text to a vector of word indexes/ids. The converted sequences will be fitted into the next step in the pipeline. - When there are too many tokens, storage and computation can be expensive. We can use the
num_words
parameter to determine how many tokens are used to represent the text. In the example, since we set the parameternum_words=50
, which means we will take the top 50-1=49 tokens. In other words, tokens like “unnatural: 50”, “feelings: 51” would not appear in the final tokenized sequence. - By default,
num_words=None
, which means it will take all the tokens. - Tips: you can set num_words anytime without re-fit the tokenizer.
NOTES: There is no simple answer to what should be num_words value. But here is my suggestion: to build a pipeline, you can start with a relatively small number, say, num_words=10,000, and come back to modify it after further analysis. (I found this stack overflow post shares some insightful ideas on how to choose the num_words value. Also, check the document of Tokenizer for other parameter settings.)
An issue: OOV
Let’s take a look at a common issue in tokenization that is very harmful to both deep learnings and traditional MLs and how we can deal with it. Consider the following example, to tokenize the sequence [‘Storys of a woman…’].
test_sequence = ['Storys of a woman...']
print(test_sequence)
print(text_to_word_sequence(test_sequence[0]))
print(tokenizer.texts_to_sequences(test_sequence))>>['Storys of a woman...']
>>['storys', 'of', 'a', 'woman']
>>[[4, 2]]
Since the corpus used for training doesn’t consist of words “storys” or “woman”, these words are not included in the token dictionary either. This is out of vocabulary (OOV) issue. While OOV is hard to avoid, there is some solution to mitigate the problems:
- A rule of thumb is to train on a relatively big corpus so that the dictionary created can cover more words, thus not consider them as new words can cast away.
- Set the parameter
oov_token=
to capture the OOV phenomenon. Note that this method only notifies you OOV has happened somewhere, but it will not solve the OOV problem. Check the Kerras document for more details. - Perform text preprocessing before tokenization. e.g., ‘storys’ can be spelling-corrected or signalized to ‘story’, which is included in the token dictionary. There are NLP packages offer more robust algorithms for tokenization and preprocessing. Some good options for tokenization are spaCy and Gensim.
- Adopt (and fine-tune) a pre-trained tokenizer (or transformers), e.g., Huggingface’s PretrainedTokenizer.
Short discussion: a tough start?
The idea of tokenization might seem very simple, but sooner or later, you will realize tokenization can be much more complicated than it seems in this example. The complexity mainly comes from various preprocessing methods. Some of the common practices of preprocessing are lowercasing, removal of punctuation, word singularization, stemming, and lemmatization. Besides, we have optional preprocessing steps, such as test normalization (e.g., digit to text, expand abbreviation), language identification, and code-mixing and translation; as well as advanced preprocessing, like, [Part-of-speech tagging](Part-of-speech tagging) (a.k.a., POS tagging), parsing, and coreference resolution. Depends on what preprocessing steps to take, the tokens can be different, thus the tokenized texts.
Don’t worry if you don’t know all these confusing names above. Indeed, it is very overwhelming to determine which preprocessing method(s) to include in the NLP pipeline. For instance, it is not an easy decision to make which tokens to include in the text presentation. Integrating a large number of token candidates are storage and computationally expensive. And it is not very clear which tokens are more important: the most appear words like “the”, “a” are not very informative for text representation, and that’s why we need to handle stop words in preprocessing.
Though arguably, we have good news here: deep learnings requires relatively less preprocessing than conventional machine learning algorithms. The reason is that deep learnings can take advantage of the neural network architecture for feature extraction that conventional ML models perform in the preprocessing and feature engineering stages. So, here we can keep the tokenization step simple and come back later if more preprocessing and/or postprocessing are desired.
Tokenization warp-up
While most of the deep learning tutorials still use a list or np.array to store the data, I find it more controllable and scalable using DataFrame (e.g., Pandas, or PySpark) to do the work. This step is optional, but I recommend you do it. Here is the example code.
# store in dataframe
df_text = pd.DataFrame({'raw_text': raw_text})
df_text.head()# updata df_text
df_text['train_sequence'] = df_text.raw_text.apply(lambda x: tokenizer.texts_to_sequences([x])[0])
df_text.head()>> raw_text train_sequence
0 Story of a man who has unnatural feelings for ... [21, 4, 2, 12, 22, 23, 13, 2, 24, 6, 2, 7, 2, ...
1 A formal orchestra audience is turned into an ... [2, 26, 7, 27, 14, 9, 1, 4, 28]
2 Unfortunately it stays absurd the WHOLE time w... [15, 25, 1, 29, 6, 15, 30]
3 Even those from the era should be turned off. ... [1, 16, 17, 27, 30, 1, 5, 2]
4 On a technical level it's better than you migh... [31, 2, 28, 6, 32, 9, 33]
That’s what you need to know about tokenization. Let’s move on to the next step: padding.
Padding
Most (if not all) of the neural networks require the input sequence data with the same length, and that’s why we need padding: to truncate or pad sequence (normally pad with 0s) into the same length. Here is an illustration of padding.
Let’s look at the following example code to perform padding in TensorFlow.
from tensorflow.keras.preprocessing.sequence import pad_sequences# before padding
print(type(train_sequences))
train_sequences
>> <class 'list'>
>> [[21, 4, 2, 12, 22, 23, 13, 2, 24, 6, 2, 7, 2, 4, 25],
[2, 26, 7, 27, 14, 9, 1, 4, 28],
[15, 25, 1, 29, 6, 15, 30],
[1, 16, 17, 27, 30, 1, 5, 2],
[31, 2, 28, 6, 32, 9, 33],
...MAX_SEQUENCE_LENGTH = 10 # length of the sequence
trainvalid_data_pre = pad_sequences(train_sequences, maxlen=MAX_SEQUENCE_LENGTH,
padding='pre',
truncating='pre',)
trainvalid_data_pre>>array([[23, 13, 2, 24, 6, 2, 7, 2, 4, 25],
[ 0, 2, 26, 7, 27, 14, 9, 1, 4, 28],
[ 0, 0, 0, 15, 25, 1, 29, 6, 15, 30],
[ 0, 0, 1, 16, 17, 27, 30, 1, 5, 2],
[ 0, 0, 0, 31, 2, 28, 6, 32, 9, 33],
...
Explanation:
- Before padding, the token-represented sequences have different lengths; after padding, they are all in the same length.
- The parameter “maxlen” defines the length of the padded sequences. When the length of the tokenized sequence is larger than the “maxlen”, the tokens of the sequence after “maxlen” would be truncated; when the length of the tokenized sequence is smaller than the “maxlen”, it would be padded with “0”.
- The positions to truncate and pad the sequence are determined by “padding=” and “truncating=”, respectively.
Discussion and tips
Pre or post?
By default, the pad_sequences parameters are set to padding=’pre’, truncating=’pre’. However, according to TensorFlow documentation, it is recommended “using ‘post’ padding when working with RNN layers”. (It is suggested that in English, the most important information appears at the beginning. So truncating or pad sequence after can better represent the original text.) Here is the example code.
MAX_SEQUENCE_LENGTH = 10
trainvalid_data_post = pad_sequences(train_sequences, maxlen=MAX_SEQUENCE_LENGTH,
padding='post',
truncating='post',)
trainvalid_data_post
>>array([[21, 4, 2, 12, 22, 23, 13, 2, 24, 6],
[ 2, 26, 7, 27, 14, 9, 1, 4, 28, 0],
[15, 25, 1, 29, 6, 15, 30, 0, 0, 0],
[ 1, 16, 17, 27, 30, 1, 5, 2, 0, 0],
[31, 2, 28, 6, 32, 9, 33, 0, 0, 0],
...
About maxlen.
Another question is, what should be the maxlen value. The trade-off here is larger maxlen value leads to sequences that maintain more information but takes more storage space and more computationally expensive, while smaller maxlen value can save storage space but result in loss of information.
- At the pipeline building stage, we can choose the mean or median as maxlen. And it works well when the lengths of sequences do not vary too much.
- If the lengths of sequences vary in a big range, then it is a case-by-case decision, and some trial-and-errors are desired. e.g., for RNN architecture, we can choose a maxlen value towards the higher end (i.e., large maxlen) and utilize Masking (we will see Masking later) to mitigate storage and computation waste. Note that padding sequences with 0s will introduce noises into the model if not handled properly. It is not a very good idea to use a very large maxlen value. If you are not sure what NN architecture to use, better stick with mean or median of the unpadded sequences.
Since we store the token sequence data in a data frame, getting sequence length stats are very straightforward, here is the example code:
# ckeck sequence_length stats
df_text.train_sequence.apply(lambda x: len(x))
print('sequence_length mean: ', df_text.train_sequence.apply(lambda x: len(x)).mean())
print('sequence_length median: ', df_text.train_sequence.apply(lambda x: len(x)).median())>> sequence_length mean: 9.222222222222221
>> sequence_length median: 8.5
Sequence padding should be an easy piece. Let’s move on to the next step, preparing the word2vec word embeddings.
Word2vec word embeddings
Intuition
Word embeddings build the bridge between human understanding of languages and of a machine. It is essential for many NLP problems. And you might have heard the names “word2vec”, “GloVe”, and “FastText”.
Don’t worry if you are not familiar with word embeddings. I will give a brief introduction of word embedding that should provide enough intuition and apply word embedding in TensorFlow.
First, let’s understand some key concepts:
Embedding: For the set of words in a corpus, embedding is a mapping between vector space coming from distributional representation to vector space coming from distributed representation.
Vector semantics: This refers to the set of NLP methods that aim to learn the word representations based on the distributional properties of words in a large corpus.
Let’s see some solid examples using spaCy’s pre-trained embedding models.
import spacy
# if first use, download en_core_web_sm
nlp_sm = spacy.load("en_core_web_sm")
nlp_md = spacy.load("en_core_web_md")
# nlp_lg = spacy.load("en_core_web_lg")doc = nlp_sm("elephant")
print(doc.vector.size)
doc.vector
>>96
>>array([ 1.5506991 , -1.0745661 , 1.9747349 , -1.0160941 , 0.90996253,
-0.73704714, 1.465313 , 0.806101 , -4.716807 , 3.5754416 ,
1.0041305 , -0.86196965, -1.4205945 , -0.9292773 , 2.1418033 ,
0.84281194, 1.4268254 , 2.9627366 , -0.9015219 , 2.846716 ,
1.1348789 , -0.1520077 , -0.15381837, -0.6398335 , 0.36527258,
...
Explanations:
- Use spaCy (a famous NLP package) to embed the word “elephant” to a 96-dimension vector.
- Based on which model to load, the vectors will have different dimensionality. (e.g., the dimension of “en_core_web_sm”, “en_core_web_md” and “en_core_web_lg are 96, 300, and 300, respectively.)
Now the word “elephant” has been represented by a vector, so what? Don’t look away. Some magic is about to happen.🧙🏼♂️
Since we can represent words using vectors, we can calculate the similarity (or distance) between words. Consider the following code.
# demo1
word1 = "elephant"; word2 = "big"
print("similariy {}-{}: {}".format(word1, word2, nlp_md(word1).similarity(nlp_md(word2))) )
word1 = "mouse"; word2 = "big"
print("similariy {}-{}: {}".format(word1, word2, nlp_md(word1).similarity(nlp_md(word2))) )
word1 = "mouse"; word2 = "small"
print("similariy {}-{}: {}".format(word1, word2, nlp_md(word1).similarity(nlp_md(word2))) )>>similariy elephant-big: 0.3589780131997766
>>similariy mouse-big: 0.17815787869074504
>>similariy mouse-small: 0.32656001719452826# demo2
word1 = "elephant"; word2 = "rock"
print("similariy {}-{}: {}".format(word1, word2, nlp_md(word1).similarity(nlp_md(word2))) )
word1 = "mouse"; word2 = "elephant"
print("similariy {}-{}: {}".format(word1, word2, nlp_md(word1).similarity(nlp_md(word2))) )
word1 = "mouse"; word2 = "rock"
print("similariy {}-{}: {}".format(word1, word2, nlp_md(word1).similarity(nlp_md(word2))) )
word1 = "mouse"; word2 = "pebble"
print("similariy {}-{}: {}".format(word1, word2, nlp_md(word1).similarity(nlp_md(word2))) )>>similariy elephant-rock: 0.23465476998562218
>>similariy mouse-elephant: 0.3079661539409069
>>similariy mouse-rock: 0.11835070985447328
>>similariy mouse-pebble: 0.18301520085660278
Comments:
- In test1: “elephant” is more similar to “large” than “mouse” is to “large”; while “mouse” is more similar to “small” than “elephant is to “small”. This matches our common sense when referring to the usual sizes of an elephant and a mouse.
- In test2: “elephant” is less similar to “rock” than “elephant” itself to “mouse”; similarly, “mouse” is less similar to “rock” than “mouse” itself to “elephant”. This probably can be explained by both “elephant” and “mouse” are animals, while “rock” has no life.
- The vectors in test2 not only represents the concept of liveness but also the concept of size: the word “rock” is normally used described an object that has size closer to an elephant to a mouse, thus “rock” is more similar to “elephant” than to “mouse”. Similarly, “pebble” is usually used to describe something smaller than “rock”; thus the similarity between “pebble” and “mouse” is greater than “rock” and “mouse”.
- Note that the similarity between words might not always match the one in your head. One reason is the similarity is just a metric (i.e., a scalar) to indicate the relationship between two vectors; so much information has lost when similarity collapses the high-dimensional vectors into a scalar. Also, one word can have several meanings. e.g., the word bank can be either related to finance or rivers; without context, it is hard to say what kinds of banks we are talking about. After all, language is a concept that open to interpretation.
Don’t fall in the rabbit hole
Word2Vec is very powerful, and it is a pretty new concept (Word2vec was created and published in 2013). There is so much more to talk about, things like
- You may wonder how the values are assigned in the vectors. What is Skip-gram? What is CBOW?
- There are other word embedding models, like “GloVe”, and “FastText”. What is the difference? Which one(s) should we use?
Word embedding is a very exciting topic, but don’t get stuck here. For readers who are new to word embeddings, the most important thing is to understand
- What word embeddings do: convert word to vectors.
- Why we need these embedding vectors: so that a machine can do amazing things; calculating the similarity between words is one of them, but there is definitely more.
- OOV is still a problem for word embeddings. Consider the following code:
print(nlp_md("elephan")[0].has_vector)
print(nlp_md("elephan")[0].is_oov)
nlp_md("elephan").vector>>False
>>True
>>array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
....
Since the word “elephan” does not exist in the spaCy “en_core_web_md” model we have loaded earlier, spaCy returns a 0-vector. Again, treating OOV is not a trivial task. But we can use either .has_vector
or .is_oov
to capture the OOV phenomenon.
Hopefully, you have a pretty good understanding of word embedding now. Let’s come back to the main track and see how we can apply word embeddings in the pipeline.
Adopt a pre-trained word embeddings model
Pretrained Word Embeddings are the embeddings learned in one task that is used for solving another similar task. A pre-trained word embedding model is also called a transformer. Using a pre-trained word embedding models can save us the trouble to train one from scratch. Also, the fact that the pre-trained embedding vectors are generated from a large dataset usually leads to stronger generative capability.
To apply a pre-trained word embedding model is a bit like searching in a dictionary, and we have seen such a process earlier using spaCy. (e.g., input the word “elephant” and spaCy returned an embedding vector. ) At the end of this step, we will create an “embedding matrix” with embedding vectors associated with each token. (The embedding matrix is what TensorFlow will use to connect a token sequence with the word embedding representation.)
Here is the code.
# import pandas as pd
# nlp_sm = spacy.load("en_core_web_sm")
df_index_word = pd.Series(tokenizer.index_word)
# df_index_word
df_index_word_valid = df_index_word[:MAX_NUM_TOKENS-1]
df_index_word_valid = pd.Series(["place_holder"]).append(df_index_word_valid)
df_index_word_valid = df_index_word_valid.reset_index()
# df_index_word_valid.head()
df_index_word_valid.columns = ['token_id', 'token']
# df_index_word_valid.head()
df_index_word_valid['word2vec'] = df_index_word_valid.token.apply(lambda x: nlp_sm(x).vector)
df_index_word_valid['is_oov'] = df_index_word_valid.token.apply(lambda x: nlp_sm(x)[0].is_oov)
df_index_word_valid.at[0, "word2vec"] = np.zeros_like(df_index_word_valid.at[0, "word2vec"])
print(df_index_word_valid.head())>>
token_id token word2vec is_oov
0 0 NAN [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... True
1 1 the [-1.3546131, -1.7212939, 1.7044731, -2.8054314... True
2 2 a [-1.9769197, -1.5778058, 0.116705105, -2.46210... True
3 3 and [-2.8375597, 0.8632377, -0.69991976, -0.508865... True
4 4 of [-2.7217283, -2.1163979, -0.88265955, -0.72048... True# Embbeding matrix
embedding_matrix = np.array([vec for vec in df_index_word_valid.word2vec.values])
embedding_matrix[1:3]print(embedding_matrix.shape)
>>(50, 96)
Explanation:
- We first used spaCy to find embedding vectors associated with each token (stored in a data frame). With some data wrangling, we created an embedding matrix (with TensorFlow convention, stored in np.array this time).
- Rows of the embedding matrix: the total number of rows is 50, with the first row holds a zero-vector representing empty tokens, and the rest 50–1 tokens are the chosen tokens in the tokenization step. (Soon, you will see why we put a zero-vector at the first row in the next session when we set up an Embedding layer.)
- Columns of the embedding matrix: The word2vec dimensionality is 96 (when using “en_core_web_sm”), so the number of columns is 96.
Here we have the embedding matrix (i.e., a 2-d array) with the shape of (50, 96). This embedding matrix will be fed into TensorFlow embedding layers in the last step of this NLP model preparation pipeline.
NOTES: You might notice that all the is_oov values are True. But you will still get non-zero embedding vectors. This happens using the spaCy “en_core_web_sm” model.
Tips: how to treat OOV in word embeddings
Unlike “en_core_web_md”, which returns a zero-vector when the token is not in the embedding model, the way how “en_core_web_sm” works will make it always return some non-zero vectors. However, according to the spaCy documentary, the vectors returned by “en_core_web_sm” are not “as precise as” larger models like “en_core_web_md” or “en_core_web_lg”.
Depends on the application, it is your decision to choose the “not-very-precise” embedding model but always give non-zero vectors or models return “more precise” vectors but sometimes zero-vectors when seeing OOVs.
In the demo, I’ve chosen the “en_core_web_sm” model that always gives me some non-zero embedding vectors. A strategy could be by using vectors learned for subword fragments during training, similar to how people can often work out the gist of a word from familiar word-roots. Some people call this strategy “better something-not-precise than nothing-at-all”. (Though I am not sure how spaCy assigns non-zero values to OOVs.)
Finally, Embedding layer setups
So far, we have the padded token sequence to represent the original text data. Also, we have created an embedding matrix with each row associated with the tokens. Now it is time to set up the TensorFlow Embedding layers.
The Embedding layer mechanism is summarized in the following illustration.
Explanation:
Embedding layer builds the bridge between the token sequences (as input) and the word embedding representation (as output) through an embedding matrix (as weights).
- Input of an Embedding layer: the padded sequence is fed in as input to the Embedding layer; each position of the padded sequence is designated to a token-id.
- Weights of an Embedding layer: by looking up into the embedding matrix, the Embedding layer can find the word2vec representation of words(tokens) associated with the token-id. Note that padded sequence use zeros to indicate empty tokens resulting zero-embedding-vectors. That’s why we have saved the first row in the embedding matrix for the empty tokens.
- Output of an Embedding layer: After going through the input padded sequences, the Embedding layer “replaces” the token-id with representative vectors(word2vec) and output embedded sequences.
Notes: The key to modern NLP feature extraction: If everything works, the output of the embedding layers should represent well of the original text, with all the features storing in the word embedding weights; this is the key idea of modern NLP feature extraction. You will see very soon, we can fine-tune this weights by setting trainable=True for embedding layers.
Also note that, in this example, we explicitly specify the empty token’s word2vec as zero just for demonstration purposes. In fact, once the Embedding layer sees the 0-token-id, it will immediately assign a zero-vector to that position without looking into the embedding matrix.
In TensorFlow
The following example code shows how embedding is done In TensorFlow,
from tensorflow.keras.initializers import Constant
from tensorflow.keras.layers import Embedding# MAX_NUM_TOKENS = 50
EMBEDDING_DIM = embedding_matrix.shape[1]
# MAX_SEQUENCE_LENGTH = 10
embedding_layer = Embedding(input_dim=MAX_NUM_TOKENS,
output_dim=EMBEDDING_DIM,
embeddings_initializer=Constant(embedding_matrix),
input_length=MAX_SEQUENCE_LENGTH,
mask_zero=True,
trainable=False)
Explanation:
- the dimensionality-related parameters are “input_dim”, “output_dim”, and “input_length”. You should have a good intuition of how to set up these parameters by referring to the illustration.
- When using a pre-trained word embedding model, we need to use tensorflow.keras.initializers.Constant to feed the embedding matrix into an Embedding layer. Otherwise, the weights of the embedding layers will be initialized with some random numbers, referring to as “training word embeddings from scratch”.
trainable=
is set to False in this example, so that the weight of word2vec will not change during neural network training. This helps to prevent overfitting, especially when training on a relatively small dataset. But if you want to fine-tune the weights, you know what to do. (settrainable=True
)mask_zero=
is another argument you should pay attention to. Masking is a way to tell sequence-processing layers that certain positions in an input are missing, and thus should be skipped when processing the data.” By setting the parametermask_zero=True
, it not only speeds up the training but also gives a better representation of the original text.
We can check the output of the Embedding layer using a test case. The output Tensor of the Embedding layer should be in the shape [num_sequence, padded_sequence_length, embedding_vector_dim].
# output
embedding_output = embedding_layer(trainvalid_data_post)
# result = embedding_layer(inputs=trainvalid_data_post[0])
embedding_output.shape>>TensorShape([18, 10, 96])# check if tokens and embedding vectors mathch
print(trainvalid_data_post[1])
embedding_output[1]>>[21 4 2 12 22 23 13 2 24 6]
>><tf.Tensor: shape=(10, 96), dtype=float32, numpy=
array([[-1.97691965e+00, -1.57780576e+00, 1.16705105e-01,
-2.46210432e+00, 1.27643692e+00, 4.56989884e-01,
...
[ 2.83787537e+00, 1.16508913e+00, 1.27923262e+00,
-1.44432998e+00, -7.07145482e-02, -1.63411784e+00,
...
And that’s it. You are ready to train your text data. (You can refer to the notebook to see training using RNN and CNN.)
Summary
We have been through a long way to prepare data for NLP deep learning. Use the following checklist to test your understanding:
Tokenization: train on a corpus to create a token dictionary and represent the original text with tokens (or token-ids) by referring to the token dictionary created. In TensorFlow, we can use Tokenizer for tokenization.
- Preprocessing is often required in the tokenization process. While using Tensorflow’ s Tokenizer with its default settings helps to start the pipeline, it is almost always recommended to perform advanced preprocessing and/or postprocessing during tokenization.
- Out-of-vocabulary (OOV) is a common issue for tokenization. Potential solutions include training on a larger corpus or use a pre-trained tokenizer.
- In the TensorFlow convention, 0-token-id is reserved to empty-tokens, while other NLP packages might assign tokens to 0-token-id. Watch out such confliction and adjust the token-id namings if desired.
Padding: pad or truncate sequences to the same length, i.e., the padded sequences have the same number of tokens (including empty-tokens). In TensorFlow, we can use pad_sequences for padding.
- It is recommended to pad and truncate sequence after (set to “post”) for RNN architecture.
- The padded sequence length can be set to be the mean or median of the sequences before padding (or truncating).
Word embeddings: the tokens can be mapped to vectors by referring to an embedding model, e.g., word2vec. The embedding vectors possess information that both humans and a machine can understand. We can use spaCy “en_core_web_sm”, “en_core_web_md”, or “en_core_web_lg” for word embeddings.
- It is a good start to use pre-trained word embeddings models. There is no need to find the “perfect” pre-trained word embeddings model; just take one, to begin with. Since Tensorflow doesn’t have a word embeddings API yet, choose a package that can be applied easily in the deep learning pipeline. At this stage, it is more important to build the pipeline than achieve better performance.
- OOV is also an issue for word embeddings using pre-trained models. A potential solution to treat OOV is by using vectors learned for subword fragments during training. If available, such “guess” usually give better results than using zero-vectors for OOVs, which brings noise into the model.
Embedding layer in TensorFlow: to take advantage of the pre-trained word embeddings, the inputs of an Embedding layer in TensorFlow include padded sequences represented by token-ids, and an embedding matrix that stores embedding vectors associated with the tokens within the padded sequences. The output is a 3-d tensors with the shape of [num_sequence, padded_sequence_length, embedding_vector_dim].
- There are many parameter settings for the Embedding layer. Use a toy dataset to make sure Embedding layers’ behavior matches your understanding. Special attention should be given to the shapes of the input and output tensors.
- We can fine-tune the embedding matrix by setting trainable=True.
- By setting mask_zero=True, it can speed up the training. Also, it is a better representation of the original text, especially when using RNN-type architecture. e.g., the machine will skip the zero-data and maintain the associated weight as 0s no matter what, even with trainable=True.
If you haven’t checked the notebook, here is the link:
I hope you like this post. See you next time.
reference:
- Practical Natural Language Processing: A Comprehensive Guide to Building Real-world Nlp Systems-Oreilly & Associates Inc (2020)
- Natural Language Processing in Action: Understanding, analyzing, and generating text with Python-Manning Publications (2019)
- Deep Learning with Python-Manning Publications (2018)