How do computers understand human languages?

6 min readMar 23, 2023

How to represent text for Natural Language Processing

Natural language processing(NLP) is the link between human languages and computer languages. Any type of task that involves human languages is an NLP task such as speech recognition, text classification, named entity recognition, part of speech tagging, text generation, question answering, summarization, and translation. However, computers can not understand raw text or sound, they can only understand numbers. Therefore, we must represent text in numerical values.
There are many ways developed by researchers and I will go in order from most basic to most advanced. But first, we will go through the building blocks of language and preprocessing operations.

Building blocks of language

The smallest unit in the language is character. White spaces, punctuations, and letters are considered characters. We can describe any written language with characters. Character n-grams are the group of characters and “n” represents the number of characters.

Another building block of language is words. Words are created with characters, the way a word is written is called syntax and words are used to represent an idea or meaning, the meaning of the word is called semantics. Morphemes are another building block of language and they consist of prefixes, suffixes, and words. For instance, the word “apples” consist of two morphemes “apple” and “s”. The last building block is a sentence. These building blocks let us communicate with each other by forming words together.

Preprocessing steps

Especially for traditional approaches preprocessing steps are vital. Removing unnecessary characters such as punctuation marks is an example. For some natural language processing tasks, common words do not have an effect on the outcome such as “the” or “a”. These kinds of words are called stopwords and are removed from some NLP tasks. In a task like text generation, they should not be removed.

Difference of stemming and lemmatization(image by author)

Stemming and lemmatization are other important preprocessing steps of NLP. Stemming transforms a word into its own root. For example, it will turn the word “meeting” to “meet”. Lemmatization is similar but instead of transforming into its own root, it transforms into its base which means it turns the word into its most similar word in the dictionary. It will not do anything to a word like “meeting” but will change “better” to “good”.

Tokenization

Tokenization is the most critical step before any NLP task. And it has a family of its own, so it deserves its own section. Types of tokenization are:

word tokenization
char tokenization
subword tokenization

Tokenization is the segmentation of a text into tokens. As it can be understood from its name, word tokenization segments the text into words. Word tokenization is the most popular tokenization technique but it sure has its own drawbacks. Data being too sparse and OOV token problems are the major ones. If not enough data is present, having a large vocabulary might affect the performance of the model negatively. To tackle this issue, word tokenization methods have a parameter to limit the size of the vocabulary. Another issue of word tokenization is the OOV token problem, when a given text contains a word that is not present in the vocabulary, the word becomes an unknown token. This can harm the model, where there are many unknown words in a text. Char tokenization on the other hand does not possess this problem. Because char tokenization is the segmentation of characters in a text and even the most unique word consists of characters. But it has its own disadvantage, the model gets much slower when working with char tokenization because there are always more characters than words in a text. Subword tokenization is the bridge between char and word tokenizations. It uses chunks of characters as tokens. Byte-pair encoding is the original subword tokenization.

For byte-pair encoding first thing to do is put the “</w>” token at the end of each word so that the algorithm can know where a word ends and another word starts. BPE looks at the frequency of each pair combination and merges them to get its tokens. The modern subword tokenization methods are WordPiece and SentencePiece. WordPiece is similar to byte-pair encodings except it uses possibility rather than frequency of pair combinations. Sentencepiece is like the combination of subword tokenization methods.

Now since we dealt with preprocessing steps, we can examine the representation methods.

Traditional Representation Methods

Traditional methods are the bag of words, tf-idf, and n-grams. The bag of words method is categorizing a text by the collections of words inside of it. The idea is that if two bodies of texts contain the same words with similar frequency then the text is considered similar. The only thing to do is to count each word in a text. an n-gram is very similar, the big difference is that we count the word groups rather than single words. If the n equals 2, then the algorithm will count the neighboring words like {(dog, bites), (cat, sleeps)}. This method has less memory requirement than the bag of words method. The tf-idf method on the other hand can be thought of as the improved version of the bag of words method. It gives more importance to a word if it appears more in a document but also gives less importance to a word if it appears in a lot of documents.

Modern Representation Methods

The modern approach involves deep learning and embeddings. Embeddings are just fixed-sized vectors that represent a token. The way we give values of the vectors of embeddings is through deep learning. Researchers have come from the idea of the distributional hypothesis which is the meaning of words appearing in a similar context have a similar meaning. This is actually how humans capture semantics too. What do we do when we do not know the meaning of a word in a sentence? We look at neighboring words to extract the meaning. This is how Word2vec works. There are two ways to do it. Either the model predicts the context words’ embeddings from the center word’s embedding or predicts the center word’s embedding from the context word’s embeddings. These are called skip-gram and a continuous bag of words in respect to order.
The whole training process is like this. First, the embedding vectors of words are initialized randomly and all the words are one hot encoded. Secondly, the target is predicted from the feature vectors, then the loss is calculated by subtracting the one hot encoded vectors from the predicted vector. And finally, the embedding vectors are updated using backpropagation.

Usually, skip-gram is preferred more than the continuous bag of words method because it has better performance most of the time. Because CBOW loses many nuances since it averages over all the context words instead of working one by one like skip-gram. On the other hand, CBOW performs better whenever there is not much data available. Newer versions of skip-gram and CBOW also contain parameters for the position of the words to enhance the performance.

There are other methods such as negative sampling to speed up the training. Rather than predicting the word, it predicts whether the given word is the center word’s neighbor or not. We do this with all the neighbors and then train it with a few non-neighbor words that are called negative sampling.

References

[1] Natural Language Processing with TensorFlow, Thushan Ganegedara, 2018
[2] Practical Natural Language Processing: A Comprehensive Guide to Building Real-World NLP Systems, Sowmya Vajjala, Bodhisattwa Majumder, 2020
[3] Real World Natural Language Processing Practical Applications, Masato Hagiwara, 2021
[4] Hands-on Machine Learning, Aurelion Geron, 2022