Text Tokenization and Vectorization in NLP

Wojtek Fulmyk, Data Scientist
4 min readAug 1, 2023

--

Article level: Intermediate

My clients often ask me about the specifics of certain data pre-processing methods, why they’re needed, and when to use them. I will discuss a few common (and not-so-common) preprocessing methods in a series of articles on the topic.

In this preprocessing series:

Data Standardization — A Brief Explanation — Beginner
Data Normalization — A Brief Explanation — Beginner
One-hot Encoding — A Brief Explanation — Beginner
Ordinal Encoding — A Brief Explanation — Beginner
Missing Values in Dataset Preprocessing — Intermediate
Text Tokenization and Vectorization in NLP — Intermediate

Outlier Detection in Dataset Preprocessing — Intermediate

Feature Selection in Data Preprocessing — Advanced

In this specific short writeup I will explain how to tokenize and vectorize text. Some understanding of specific terms would be helpful, so I attached a short explanation of the more complicated terminology. Give it a go, and if you need more info, just ask in the comments section!

NLP — Using algorithms to analyze and process human language.

tokenization — Splitting text into smaller units such as words or phrases.

vectorization — Converting text into numerical representations for ML models.

reformatting — Changing the structure or representation of data.

ML model — Algorithms that can learn patterns from data.

whitespace — Blank spaces between words and characters.

linguistic — Relating to human language and its structure.

prefix — A group of letters at the start of a word.

semantics — The meaning of words, phrases, sentences.

Text Tokenization

Text tokenization is the process of reformatting a piece of text into smaller units called “tokens.” It transforms unstructured text into structured data that models can understand. The goal of tokenization is to break down text into meaningful units like words, phrases, sentences, etc. which can then be inputted into machine learning models. It’s one of the first and most important steps in natural language preprocessing, and often goes hand-in-hand with text vectorization.

Tokenization enables natural language processing tasks like part-of-speech tagging (identifying verbs vs nouns, etc.), named entity recognition (categories like person, organization, location), and relationship extraction (family relationships, professional relationships, etc.).

There are a number of different tokenization methods; some of the simpler ones include splitting text on whitespace or punctuation. Advanced techniques use language rules to identify word boundaries and tokenize text into linguistic units; this can split words into sub-word tokens (such as prefixes, or based on syllables), or even combine certain tokens into larger units based on language semantics. The goal is to produce tokens that best represent the original text for ML purposes.

Text Vectorization

Text vectorization is the process of converting text into numerical representations (or “vectors”) that can be understood by ML models. It transforms unstructured text into structured numeric data with the goal to represent the semantic meaning of text in a mathematical format.

Text vectorization allows for a variety of NLP tasks like document classification (checking whether something is an email or an essay, etc.), sentiment analysis (opinions or attitudes of the text, etc.), enhancing search engines, and so on.

Common text vectorization methods include one-hot encoding (assigning a unique integer value to each word), bag-of-words (counting the occurrence of words within each document), and word embeddings (mapping words to vectors so as to capturing meaning). The vector space allows words with similar meanings to have similar representations.

Useful Python Code

To give you some understanding of the code involved in this kind of preprocessing, I will show you how to tokenize text using the NLTK libraries (a popular toolkit used by scientists and analysts who work with natural language models), and then do a simple text vectorization using sklearn.

import nltk
from sklearn.feature_extraction.text import CountVectorizer

# this downloads a separate module that enables the tokenize functionality.
## you need to do this only once. Hash it out once it's downloaded.
nltk.download('punkt')

# sample text
text = "I would really like to tokenize and vectorize this sentence!"

# tokenize the text
tokenized = nltk.tokenize.word_tokenize(text)

# vectorize the tokens
vectorizer = CountVectorizer()
vectorized = vectorizer.fit_transform(tokenized).toarray()

print(tokenized)
print(vectorized)

This will output the following:

['I', 'would', 'really', 'like', 'to', 'tokenize', 'and', 'vectorize', 'this', 'sentence', '!']

[[0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 1]
[0 0 1 0 0 0 0 0 0]
[0 1 0 0 0 0 0 0 0]
[0 0 0 0 0 1 0 0 0]
[0 0 0 0 0 0 1 0 0]
[1 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 1 0]
[0 0 0 0 1 0 0 0 0]
[0 0 0 1 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0]]

And that’s all! I will leave you with some “fun” trivia 😊

Trivia

  • The early natural language program SHRDLU, developed in the 1960s, used text tokenization to understand commands about a simulated block world. By splitting input into word and punctuation tokens, SHRDLU could extract enough syntax and semantics to interpret the text meaning.
  • Studies show the average English word length is 4.7 characters, but individual words vary greatly, from 1 letter up to 28 letters. Tokenization accounts for this by splitting text into individual words rather than just fixed units. It provides a flexible representation that retains semantics, which allows for the handling of the estimated million+ word English vocabulary.

--

--

Wojtek Fulmyk, Data Scientist

Data Scientist, University Instructor, and Chess enthusiast. ML specialist.