NLP — Text PreProcessing — Part 2 (Tokenization)

Chandu Aki
The Deep Hub
Published in
3 min readFeb 16, 2024

In the previous article (NLP — Text PreProcessing — Part 1), we delved into the world of text removal. Now, in this sequel, our quest continues as we unravel more enchanting techniques of text preprocessing:

Tokenization:

Tokenization is the process of breaking down a text into smaller units, often words or phrases, known as tokens.

Examples of tokens can be words, characters, numbers, symbols, or n-grams.

The most common tokenization process is whitespace/ unigram tokenization. In this process entire text is split into words by splitting them from whitespaces. As shown in the example below the whole sentence is split into unigrams i.e “Natural,” Language”,”Processing”

Why do we use? Tokenization is crucial for various NLP tasks as it helps in understanding the structure of the text, aiding in analysis and feature extraction.

How does it work? Let’s consider the sentence: “Magical adventures await us!” Tokenization transforms it into individual tokens: [“Magical”, “adventures”, “await”, “us”, “!”].

Applications & Use Cases:

  • Text analysis
  • Sentiment analysis
  • Named Entity Recognition (NER)

Different Techniques to perform Tokenization:

Tokenization Using Python’s split() function:

  • What it is? Tokenization using Python’s split() function involves breaking a string into tokens based on a specified delimiter.
  • Why do we use? It is a simple and quick way to tokenize text when the delimiter is known.
  • How does it work? The string is split into tokens wherever the specified delimiter occurs.
  • Applications & Use Cases: Handy for text where tokens are separated by spaces or other consistent delimiters.

Python Code and Output:

text = "Tokenization using Python's split() function is straightforward." 
tokens = text.split()
print(tokens)

Output: ['Tokenization', 'using', "Python's", 'split()', 'function', 'is', 'straightforward.']

Tokenization Using Regular Expressions (RegEx):

  • What it is? Tokenization using regular expressions involves defining patterns to split the text into tokens based on specific criteria.
  • Why do we use? Regular expressions offer flexibility to define custom tokenization rules based on patterns in the text.
  • How does it work? For example, using \\w+ as a pattern in Python's re module will tokenize based on word boundaries.
  • Applications & Use Cases: Useful when text follows specific patterns, such as hashtags or mentions in social media data.

Python Code and Output:

import re  
text = "Tokenization using regular expressions is powerful! #NLP"
tokens = re.findall(r'\\w+', text)
print(tokens)

Output: ['Tokenization', 'using', "Python's", 'split()', 'function', 'is', 'straightforward.']

Tokenization Using NLTK (Natural Language Toolkit):

NLTK contains a module called tokenize() which further classifies into two sub-categories:

  • Word tokenize: We use the word_tokenize() method to split a sentence into tokens or words
  • Sentence tokenize: We use the sent_tokenize() method to split a document or paragraph into sentences.

word_tokenize():

  • What it is? NLTK’s word_tokenize() breaks a text into words, considering punctuation marks and whitespace.
  • Why do we use? It provides a more sophisticated tokenization compared to basic splitting methods.
  • How does it work? Words are identified based on patterns that include punctuation as separate tokens.
  • Applications & Use Cases: Suitable for comprehensive word-level analysis and processing.

Python Code and Output:

from nltk.tokenize 
import word_tokenize
text = "NLTK's word_tokenize() is useful for advanced word-level processing."
tokens = word_tokenize(text)
print(tokens)

Output: ["NLTK's", 'word_tokenize', '(', ')', 'is', 'useful', 'for', 'advanced', 'word-level', 'processing', '.']

sentence_tokenize():

  • What it is? sentence_tokenize() from NLTK splits text into sentences based on punctuation and context.
  • Why do we use? Essential for tasks requiring sentence-level understanding.
  • How does it work? Sentences are identified based on punctuation and linguistic context.
  • Applications & Use Cases: Crucial for tasks like machine translation, summarization, or sentiment analysis.

Python Code and Output:

from nltk.tokenize 
import sent_tokenize
text = "NLTK's sentence_tokenize() can handle complex sentence structures. For example, this is a sentence."
sentences = sent_tokenize(text)
print(sentences)

Output: ["NLTK's sentence_tokenize() can handle complex sentence structures.", 'For example, this is a sentence.']

Tokenization is the cornerstone of many NLP tasks, and these methods provide diverse approaches to extract meaningful units from text data based on specific requirements and contexts.

You can achieve tokenization using alternative NLP libraries like Gensim or spaCy, each offering unique features and capabilities in the realm of Natural Language Processing”

--

--

Chandu Aki
The Deep Hub

Aspiring Data Scientist|Dynamic Data Analyst | Sales Analytics Expert | AI & ML , NLP , Generative AI Enthusiast