Encoding Text for NLP Tasks

DataCan
Geek Culture
Published in
14 min readAug 23, 2021

Encoding texts is one of the most important steps in all natural language applications. For starters, it is the process of converting natural language texts into numeric values that computers can understand. This is especially critical in deep learning since neural networks operate on numbers, vectors, tensors, and not characters, words, and sentences.

Consider the following question: What is the best way to encode the word “go”? This is not easy to answer, but we can start by thinking about the following factors.

  • Verb tenses: “go”, “going”, “gone”, “went” should have similar representations in our encoding
  • Semantics: “travel”, “move”, “proceed” should have similar representations; “stop”, “stay”, “halt” should be far away
  • Context: “Go” can be referring to an action, a programming language, or a board game depending on the context, and they should be represented differently

Encoding text is not an easy problem, and there are many possible ways to do it. Using an effective representation to properly capture the syntax, semantics, and contexts of words can be extremely important for all downstream natural language tasks.

For ease of visualization, texts are commonly used directly as the input and output of computational models. However, there is always an intermediate step to encode the text into a more useful representation.

In this post, I will share some of the most common word-level text encoding methods. Before we dive into the details, here are some of the frequently mentioned NLP terminologies and concepts used in this post.

  • Corpus: the entire dataset; usually a collection of documents
  • Document: an article such as a journal paper, a story, a web page, or simply a collection of paragraphs
  • Sentence: one or more sentence(s) form a paragraph in a document
  • Type and token: A type is a distinct word in a corpus, whereas each word is always a token. For example, the sentence “A factory is a factory of a factory.” contains four types and eight tokens. The types are the unique words, i.e. “a”, “factory”, “is”, and “of”.

There can be minor differences in terminology from implementation to implementation, especially regarding the definition of types and tokens. For example, sometimes types are case insensitive, or sometimes contractions, possessives, and hyphenations are handled differently depending on the tokenizer. For simplicity, let us assume for the moment that our tokenizer removes punctuations, then splits a sentence into tokens using whitespaces.

Text data hierarchy
Text data hierarchy

Index-Based Encoding

The simplest approach to encode text is index-based encoding. The idea is to map each type with an index so that we can uniquely identify and represent all tokens in the corpus by using the respective index. Let’s assume our corpus has three documents, and each document only has one sentence:

corpus = [“This is our first Medium article, it is about NLP and we love it!”,

“I would like to share how to encode text and how to apply it”,

“It is said that Vancouver Woman in Data Science has interesting work!”]

In this example, there are 31 unique tokens with a few preprocessing (eg: lowercasing and punctuation removal). They are numbered based on their hash value, as shown below.

{‘about’: 10, ‘and’: 18, ‘apply’: 2, ‘article’: 27, ‘data’: 16, ‘encode’: 4, ‘first’: 28, ‘has’: 17, ‘how’: 29, ‘i’: 19, ‘in’: 30, ‘interesting’: 25, ‘is’: 26, ‘it’: 22, ‘like’: 6, ‘love’: 13, ‘medium’: 15, ‘nlp’: 23, ‘our’: 12, ‘said’: 31, ‘science’: 1, ‘share’: 9, ‘text’: 11, ‘that’: 8, ‘this’: 5, ‘to’: 3, ‘vancouver’: 24, ‘we’: 14, ‘woman’: 21, ‘work’: 20, ‘would’: 7}

The corpus is then encoded as a list of list of indices:

[[5, 26, 12, 28, 15, 27, 22, 26, 10, 23, 18, 14, 13, 22],

[19, 7, 6, 3, 9, 29, 3, 4, 11, 18, 29, 3, 2, 22],

[22, 26, 31, 8, 24, 21, 30, 16, 1, 17, 25, 20]]

We have transformed our documents into a list of numbers, but there is still a vital step in encoding. Most machine learning models require fixed-length input vectors, so we need to convert our jagged array into a square array. Let us first find the longest document in our corpus; we then need to pad the other documents that are shorter by adding 0 to the end of the sequence. In the previous example, both the first and second documents have 14 words, so we pad document 3 with two additional zeros to make its representation a 14-length array. Our final encoded corpus became:

[[31, 10, 26, 12, 6, 17, 23, 10, 20, 18, 33, 19, 15, 28],

[21, 1, 13, 9, 7, 30, 9, 32, 14, 33, 30, 9, 25, 23],

[2, 10, 5, 29, 16, 4, 24, 22, 3, 27, 11, 8, 0, 0]]

Padding at the end of the sequence is called “post-sequence padding”; alternatively, we also could have padded at the beginning. Other than padding shorter sequences, sometimes we may need to truncate long sequences down to a predefined length to fit in the model input in memory. In practice, padding and truncation are often used in combination.

Index-based encoding is easy to understand. However, arbitrarily assigning numeric values to tokens can have unintended negative consequences. In particular, the numerical distances between words can have unintended implications, especially to machine learning models. For instance: semantically, would you say “Science” (1) is closer to “Medium” (15) than to “interesting” (25)?

One-Hot Encoding

The one-hot encoding uses 0s and 1s to indicate whether individual tokens appear in a sentence or not. As before, the first step to create a one-hot encoding is to number the unique tokens in the corpus. Then, instead of constructing an index sequence for each document, as in index-based encoding, we would create a sequence of 0s and 1s to mark the presence of each unique token.

Using this approach, we can compare two documents by calculating their similarity in the vector space. Notice that the vectors representing each document are by construction equal lengths, so the padding and truncating steps would not be necessary. However, one disadvantage of using one-hot encoding is that the representation is sparse, meaning the vectors contain a large percentage of zeros. Sparse representations are generally inefficient, especially in terms of storage. Furthermore, since the size of one-hot encoding grows proportionally to the number of unique tokens in the corpus, it can quickly grow out of hand when the corpus is large. For efficiently handling sparse vectors and matrices, here is a good reading I recommend. Last but not the least, one-hot encoding is widely used to encode categorical variables.

Bag of Words (BOW)

Bag of words is a common technique to encode text. Instead of using binary values to indicate the presence of a token in a document, it uses an integer to mark the number of times a token appears in a sentence. Overall, the construction process is similar to that of the one-hot encoding.

The BOW vector for the example document would be:

As the name suggests, this method treats each sentence as a bag of words. It considers the number of times each word appears in a sentence but does not consider their order, which means we lose the context of the words. Despite this, it remains a practical technique in natural language processing due to its simplicity and effectiveness.

Term Frequency — Inverse Document Frequency (TF-IDF)

Stop words are tokens that are common in every sentence or document; they generally do not carry too much meaning and can add noise to the corpus when performing text analysis. Some examples of common stop words are and, the, and a. In some domain-specific settings, we can also treat some terminologies as stop words. For example, “interest” and “bank” in financial news, “review” and “director” in movie reviews, etc. Since stop words do not bring uniqueness to a document, using them as a feature, as in one-hot encoding and bag of words, can reduce the effectiveness of the representation. To address this, we can use TF-IDF which penalizes words that are prominent in all documents.

TF-IDF is short for term frequency-inverse document frequency. In TF-IDF, each term is represented as the relative frequency of the word in a given document compared to the whole corpus. It can be considered as an extension to the bag of word encoding.

There are two components in TF-IDF. Term Frequency (TF) measures whether a term(token) is common or rare in the corpus. There are a variety of ways to define term frequency as well, such as Boolean “frequencies”, augmented frequency, etc. The most common definition is the number of occurrences of a term in a given document, divided by the total number of terms in the given document. Let t and d denote a given term in document d, then term frequency tf_(t,d) is defined as:

where the numerator is the number of time terms t in a document d.

Inverse Document Frequency (IDF) is the total number of documents in the corpus, divided by the number of documents that contain the given term in the corpus. It would be very low for the stop words. Let N denotes the total number document in the corpus D, inverse document frequency is then defined as:

where the denominator is the number of documents where the term t appears.

Term/token t can be represented as the product of TF and IDF:

For instance: for document 1 of our sample corpus:

corpus = [“This is our first Medium article, it is about NLP and we love it!”,

“I would like to share how to encode text and how to apply it”,

“It is said that Vancouver Woman in Data Science has interesting work!”]

The TF-IDF vector for the example document would be:

Note: some text representations have been cute off, please check the source code for full representations

TF-IDF has been widely used in text mining and information retrieval. Though TF-IDF considers the token frequency, it does not consider the order, the co-occurrence, or the semantics of words in documents.

Distributed Word Representation

The above methods treat words as atomic units, they are easy to understand and outperform in small datasets compared with complex methods. However, many NLP tasks such as machine translation and speech recognition often require working with large corpora in a variety of languages.

One of the most successful breakthroughs in NLP is the distributional representation of words. Harris [1] proposed a distributional hypothesis: words occurring in similar contexts tend to have a similar meaning, e.g., the vector space representation of the word “mother” in English is very close to that in Japanese after training. By aligning the word embedding spaces of two different languages, we can infer the meaning of some words without learning the vocabulary. In addition, people also found that the meaning of words can sometimes be interpreted geometrically[2]: vector(”King”) — vector(”Man”) + vector(”Woman”) results in a vector that is closest to the vector representation of the word Queen. This was exploited in Word2vec [3, 4], GloVe [5], and FastText [6].

Geometric Relationship of Word Embedding

Word2Vec

Word2Vec is a framework for learning high-quality word vectors from millions of words. The idea is: given a large corpus, every word in a fixed vocabulary is represented by a vector of numbers. The algorithm goes through each position t in the text, which has a center word c and context words o, and we use the similarity of the word vectors for c and o to calculate the probability of o given c or the other way around. The algorithm keeps adjusting the word vectors to maximize this probability.

The model is trained on a large corpus using a neural network to learn word semantic and syntactic word similarities. Architecture-wise: there are two ways to train a Word2Vec: CBOW (continuous bag-of-words) and skip-gram. The CBOW model predicts the target word from its surrounding context words. While the skip-gram model predicts the words around a given word, with more weight given to words closer to the given word than the more distant ones. According to the author, training the CBOW model is faster while the skip-gram model does a better job for infrequent words [3, 4].

For all positions t, Word2Vec tries to maximize the likelihood of context words within a window of fixed size m given center word w_j:

[7]

and the objective function is:

[7]

Word2Vec is trained using a softmax as the output layer: Let u_w and v_w be two-word vectors for center word and context word, the algorithm gives probability distribution by taking the dot product of similarity between context word and target word normalized over entire vocabulary:

[8]

This has made the model difficult to train since the size of the vocabulary is too large. There are 2 approaches to approximate the conditional log-likelihood in the above equation: the hierarchical softmax and negative sampling. The hierarchical softmax method uses a Huffman tree to reduce calculation[8]. Negative samples refer to randomly selected words from the corpus that are not in the target word context window. Negative sampling has made training Word2Vec much faster.

Word2Vec can be fed into the model in an online fashion with relatively less preprocessing. However, Word2Vec doesn’t take into account the internal structure of words and can not deal with Out-of-Vocabulary (OOV).

Glove

Glove is similar to Word2Vec in the sense that they both are able to capture the semantic similarity between words. However, Glove is trained in a way that leverages the co-occurrence of words over the entire corpus, instead of the co-occurrence within the local context like Word2Vec does.

Glove is trained based on matrix factorization. It first builds a large co-occurrence matrix over the entire corpus. For each type (row) of the matrix, it counts how frequently such a type of token appears in some context (column) in the given corpus.

The co-occurrence matrix is then de-construct to extract the 2 latent factors: word * latent_factor matrix and latent_factor * context matrix. Each row of the word * latent_factor matrix is a vector representation for the word. Last but not the least, it reconstructs the co-occurrence matrix by multiplying the 2 latent factors. Since the de-construction step is not deterministic, matrix factorization tries to find such 2 latent factors that yield the minimal reconstruction loss (Mean Square Error between the ground truth counts and the predicted co-occurrence counts). I encourage you to read more about matrix factorization if you want to get into details.

FastText

FastText was developed by a group of researchers from FAIR. The primary difference between FastText and Word2Vec and Glove is that FastText uses n-grams of a single word to train on, and then the final vector of the word is the sum of the n-grams vectors. The word vector ‘awesome’ is the sum of the n-grams: <aw, awe, wes, eso, som, ome, me> if the n = 3, where the symbols `<’ and `>’ denote word boundaries.

Using n-gram features makes word representations robust to out-of-vocabulary situations, such as misspelling, slang, and abbreviations. For example, when we see an OOV word, such as “awme”, the representation for this word will be the vector sum of the subwords <aw, wme and me>. The resulting embedding will still be closer to the word “awesome” than other random words, like “anthropology”.

Distributional word representation has been the state-of-the-art method to learn word vectors for many years. In addition, this framework provides flexible training time and accuracy depending on the dimensionality of the word vectors and the amount of training data [4].

Pre-trained Language Model

Pre-trained language models are trained using self-supervised learning on large text corpora, such as millions of Wikipedia articles. The enormous amount of data allows the language model to learn both lower-level (parts of speech, tense, etc.) and higher-level features (semantic, syntax, etc.). Traditionally, researchers trained language models to predict the next word in the sentence, as in the sentence completion task. Starting in 2018, researchers at Google developed the masked language model, which predicts words anywhere in the sentence. Language models have been shown to perform exceedingly well in many downstream tasks by fine-tuning on smaller task-specific datasets. Some examples are SQuAD for question answering, GLUE for sequence classification, and European Parliament Proceedings Parallel Corpus for machine translation, etc.

Fine-tuning a classifier with a pre-trained language model

A few of the most popular pre-trained language models are ELMo, BERT, GPT, XLNet, and T5. Among them, the most popular one is BERT, which stands for Bidirectional Encoder Representations from Transformers [9]. In comparison to Word2Vec, BERT also has a subword-level embedding as part of its architecture, but this is not the most common use case for language models like BERT. More importantly, BERT can encode entire sentences or documents into vector representations, which can be extremely helpful in distinguishing the meaning of polysemous words. For instance, Word2Vec would produce a single word representation for the word “tangerine”. While BERT would produce different representations for when it is referring to a fruit or a financial institution.

The convenience of fine-tuning large pre-trained language models has facilitated the adoption of BERT in a wide variety of applications. However, enormous language models are not without their problems. For example, overfitting during fine-tuning can be a problem, especially because task-specific data is limited. In addition, training a competitive language model is computationally expensive and environmentally costly.

Energy and Policy Considerations for Deep Learning in NLP. Strubell et al. ACL 2019

Strubell et al. compared the carbon emission of training NLP models to that of the average American lifestyle [10]. Although language models are incredible technologies, we must also be aware of their negative impacts on the environment. We have only scratched the surface of BERT. Look out for our next post where we discuss it in more detail, until next time!

Special thanks to Ben Ling for the insightful comments on post structure and academic content! Many thanks to Duong Vu for her careful comments of the final draft and continuous support for DataCan! This post is contributed by Celine Liu. The sample code can be found here.

This article was originally published on @DataCanOrg. Stay connected with DataCan & Woman in Data Science Vancouver!

Reference

[1] Harris, Z. (1954). Distributional structure. Word, 10(23): 146–162.

[2] Mikolov, Tomas, Yih, Wen-tau, Zweig, Geoffrey (2013). “Linguistic Regularities in Continuous Space Word Representations”. HLT-Naacl: 746–751.

[3] Mikolov, Tomas, et al. (2013). “Efficient Estimation of Word Representations in Vector Space”. arXiv:1301.3781 [cs.CL].

[4] Mikolov, Tomas (2013). Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems. arXiv:1310.4546.

[5] Jeffrey Pennington, Richard Socher, and Christopher D. Manning (2014). GloVe: Global Vectors for Word Representation. https://nlp.stanford.edu/pubs/glove.pdf

[6] Piotr Bojanowski, Edouard Grave, Armand Joulin, Tomas Mikolov, (2016). Enriching Word Vectors with Subword Information, https://arxiv.org/abs/1607.04606

[7] Christopher Manning, 2021. Stanford University, CS224n: Natural Language Processing with Deep Learning.

[8] Wikipedia. Word2vec. https://en.wikipedia.org/wiki/Word2vec

[9] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł. & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems (p./pp. 5998–6008). https://arxiv.org/abs/1706.03762

[10] Emma Strubell, Ananya Ganesh, Andrew McCallum. Energy and Policy Considerations for Deep Learning in NLP. https://arxiv.org/abs/1906.02243

--

--

DataCan
Geek Culture

Navigate and celebrate data science and machine learning careers together. Learn more about us https://datacan.network/.