TextPreprocessing Steps- Text Lowercasing, Tokenization

TejasH MistrY
9 min readApr 5, 2024

--

In this article, I have explained the common Text Preprocessing steps like Text lowering, what is tokenization, sentence tokenization, and word tokenization in detail with practical implementation.

TextPreprocessing Steps

What is Natural Language Processing?

Natural Language Processing (NLP) is a branch of artificial intelligence (AI) that focuses on the interaction between computers and humans through natural language. It involves the development of algorithms and computational models that enable computers to understand, interpret, and generate human language in a meaningful way.

we use NLTK (Natural Language Toolkit) is a widely-used library in Python for Natural Language Processing (NLP) tasks. It provides a wide range of tools and resources for working with human language data.

1. Text Lowercasing:

Text lowercasing is an essential step in natural language processing (NLP) tasks.

By converting all text to lowercase, we ensure consistency in processing. Words with different cases (e.g., “apple” and “Apple”) are treated as the same token, avoiding discrepancies in analysis.

Lowercasing helps standardize text data, reducing the complexity introduced by varying cases of words. This simplification aids in text comparison, analysis, and modeling.

Text Lowercasing with Code.

# Sample text
text = "The quick BROWN fox JUMPS over the LAZY dog."

# Convert text to lowercase
lowercased_text = text.lower()

# Print the lowercased text
print(lowercased_text)
Output:  the quick brown fox jumps over the lazy dog.

2. What is Tokenization?

Tokenization is the process of splitting a string into a list of pieces or tokens. A token is a piece of a whole, so a word is a token in a sentence, and a sentence is a token in a paragraph.

In Natural Language Processing (NLP), Three of the most common types of tokenization are

  1. Sentence tokenization.
  2. Word tokenization.
  3. Regular expressions Tokenization.

1. Sentence Tokenization:

Sentence tokenization involves splitting text into individual sentences, where each sentence becomes a separate token. This is particularly useful for tasks that require analyzing text at the sentence level or segmenting text for specific applications, such as machine translation, text summarization, and sentiment analysis.

For example, consider this paragraph:

Natural language processing (NLP) is a fascinating field. It deals with how computers understand and interact with human language.

After Sentence tokenization, this paragraph would become:

[“Natural language processing (NLP) is a fascinating field.”, “It deals with how computers understand and interact with human language.”]

Here, each sentence is considered a token.

Let's perform tokenization by writing some code

Note: To perform Tokenization we have to install NLTK Library in our device. To install NLTK, open your terminal or command prompt and run the following command:

pip install nltk

This code demonstrates Sentence tokenization using NLTK

import nltk
from nltk.tokenize import sent_tokenize

# Sample text containing multiple sentences
text = "Natural language processing (NLP) is a fascinating field. It deals with how computers understand and interact with human language. Sentence tokenization is one of the basic tasks in NLP."

# Tokenize the text into sentences
sentences = sent_tokenize(text)

# Print the tokenized sentences
print(sentence)
Output:

['Natural language processing (NLP) is a fascinating field.',
'It deals with how computers understand and interact with human language.',
'Sentence tokenization is one of the basic tasks in NLP.']

explanation of provided code step by step:

1. Importing NLTK and Sentence Tokenizer:

  • First, we import the NLTK library, which provides various tools and resources for natural language processing tasks.
  • Then, we specifically import the sent_tokenize function from the nltk.tokenize module. This function is designed for sentence tokenization.

2. Providing Text Input:

  • We define a sample text containing multiple sentences. This text serves as the input for the sentence tokenization process.

3. Tokenizing Text into Sentences:

  • We use the sent_tokenize() function to tokenize the input text into individual sentences. This function breaks down the text into a list of sentences based on punctuation marks that denote sentence boundaries.

4. Printing Tokenized Sentences:

  • Finally, we print the tokenized sentences to observe the result of the sentence tokenization process. Each sentence is printed separately, as they are stored in a list after tokenization.

2. Word Tokenization:

Word Tokenization is like cutting the sentence into smaller pieces, where each piece is called a token. In this case, each word in the sentence is a token. So, when we tokenize the sentence, we split it into individual words.

Word tokenization is widely used in NLP tasks because it provides a granular representation of text, making it easier to analyze and process.

For Example, consider you have a sentence: “Hello there, how are you?

Now, let’s tokenize this sentence.

After word tokenization, our sentence becomes:

[“Hello”, “there”, “,”, “how”, “are”, “you”, “?”]

Here, each word (“Hello”, “there”, “how”, “are”, “you”) and even punctuation marks like comma (“,”) and question mark (“?”) are considered tokens.

code demonstrating word tokenization using NLTK

import nltk
from nltk.tokenize import word_tokenize

# Sample sentence
sentence= "Natural language processing (NLP) is a fascinating field. It deals with how computers understand and interact with human language. Sentence tokenization is one of the basic tasks in NLP."

# Tokenize the sentence into words
words = word_tokenize(sentence)

# Print the tokenized words
print(words)
Output:

['Natural', 'language', 'processing', '(', 'NLP', ')', 'is', 'a', 'fascinating', 'field', '.', 'It', 'deals', 'with', 'how', 'computers', 'understand', 'and', 'interact', 'with', 'human', 'language', '.', 'Sentence', 'tokenization', 'is', 'one', 'of', 'the', 'basic', 'tasks', 'in', 'NLP', '.']

explanation of provided code step by step:

1. Importing the word_tokenize Function

  • The code imports the word_tokenize function from the nltk.tokenize module. This function is used specifically for tokenizing sentences into words.

2. Providing a Sample Sentence

  • A sample sentence is defined and stored in the variable sentence.

3. Tokenizing the Sentence into Words:

  • The word_tokenize() function is applied to the sentence variable. This function tokenizes the input sentence into individual words based on whitespace and punctuation.

4. Storing Tokenized Words:

  • The resulting tokens (words) are stored in the variable words.

5. Printing Tokenized Words:

  • The tokenized words are printed using the print() function. This allows us to observe the individual words extracted from the sentence.

Each word in the sentence is separated and stored as an individual element in the list. This process enables further analysis or processing of the text data at the word level.

Tokenization is like breaking down text into smaller parts, whether they are words in a sentence or sentences in a paragraph. These smaller parts are called tokens, and they help us analyze and understand text better in computer programs.

3. Tokenizing sentences using regular expressions

Regular expressions can be used if you want complete control over how to tokenize text. Regular expressions allow for the creation of custom rules or patterns to perform sentence and word tokenization based on specific criteria or requirements. This flexibility is one of the key advantages of using regular expressions for text-processing tasks.

Before diving into practical implementations of word and sentence tokenization using regular expressions, let’s take a moment to understand the re module in Python:

re module in Python provides support for working with regular expressions, which are powerful tools for pattern matching and text manipulation.

Some common use cases of the re module.

  1. Pattern Matching: The primary use of the re module is to search for specific patterns within strings. You can define patterns using regular expressions and then use functions like re.search() or re.match() to find matches in text data.
  2. String Extraction: Regular expressions can be used to extract substrings from text data based on predefined patterns. Functions like re.findall() or re.finditer() can be used to extract all occurrences of a pattern in a string or iterate over them, respectively.
  3. String Splitting: The re.split() function allows you to split strings based on a specified pattern. This is particularly useful for tokenization tasks, where you want to split text into words or sentences based on specific delimiters or patterns.
  4. String Replacement: Regular expressions can be used to replace substrings in text data with other strings. The re.sub() function allows you to search and replace occurrences of a pattern with a specified replacement string.
  5. Validation: Regular expressions are commonly used for data validation tasks, such as validating email addresses, phone numbers, or other structured data formats. You can define patterns that match valid formats and use functions like re.match() to validate input strings against these patterns.
  6. Text Cleaning: Regular expressions are useful for cleaning and preprocessing text data by removing unwanted characters, formatting inconsistencies, or special symbols. You can define patterns that match specific patterns of characters to be removed or replaced.

Sentence Tokenization with Regular Expressions

import re

# Sample text containing multiple sentences
text = "Natural language processing (NLP) is a fascinating field. It deals with how computers understand and interact with human language. Sentence tokenization is one of the basic tasks in NLP."

# Define regular expression pattern for sentence tokenization
pattern = r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?|\!)\s'

# Tokenize the text into sentences using regular expressions
sentences = re.split(pattern, text)

# Print the tokenized sentences
print(sentences)
['Natural language processing (NLP) is a fascinating field.', 
'It deals with how computers understand and interact with human language.',
'Sentence tokenization is one of the basic tasks in NLP.']

explanation of provided code step by step:

1. Importing the re Module:

  • The code starts by importing the re module, which provides support for working with regular expressions in Python.

2. Defining the Sample Text:

  • A sample text containing multiple sentences is defined and stored in the variable text. This text serves as the input for sentence tokenization.

3. Defining the Regular Expression Pattern:

  • The code defines a regular expression pattern pattern for sentence tokenization.

Let's break down this pattern:

  • (?<!\\\\w\\\\.\\\\w.): Negative lookbehind assertion that ensures the matched position is not preceded by an abbreviation (e.g., "Mr." or "Mrs.").
  • (?<![A-Z][a-z]\\\\.): Negative lookbehind assertion that ensures the matched position is not preceded by an initial (e.g., "A. Smith" or "J. Doe").
  • (?<=\\\\.|\\\\?|\\\\!): Positive lookbehind assertion that ensures the matched position is preceded by a period, question mark, or exclamation mark.
  • \\\\s: Matches any whitespace character (space, tab, newline).

4. Tokenizing the Text into Sentences:

  • The re.split() function is used to tokenize the text into sentences based on the specified regular expression pattern. This function splits the text wherever the pattern matches, effectively segmenting it into individual sentences.
  • The pattern and text variables are passed as arguments to re.split() to perform the tokenization.

5. Storing Tokenized Sentences:

  • The resulting tokenized sentences are stored in the variable sentences. Each sentence in the text is represented as an individual element in the list sentences.

6. Printing Tokenized Sentences:

  • the tokenized sentences are printed using the print() function. This allows us to observe the individual sentences extracted from the input text.

Word Tokenization with Regular Expressions

import re

# Sample text containing multiple sentences
text = "Natural language processing (NLP) is a fascinating field. It deals with how computers understand and interact with human language. Sentence tokenization is one of the basic tasks in NLP."

# Define regular expression pattern for word tokenization
pattern = r'\b\w+\b'

# Tokenize the text into words using regular expressions
words = re.findall(pattern, text)

print(words)
['Natural', 'language', 'processing', 'NLP', 'is', 'a', 'fascinating', 'field', 'It', 'deals', 'with', 'how', 'computers', 'understand', 'and', 'interact', 'with', 'human', 'language', 'Sentence', 'tokenization', 'is', 'one', 'of', 'the', 'basic', 'tasks', 'in', 'NLP']

1. Importing the re Module:

  • The code starts by importing the re module, which provides support for working with regular expressions in Python.

2. Defining the Sample Text:

  • A sample text containing multiple sentences is defined and stored in the variable text. This text serves as the input for word tokenization.

3. Defining the Regular Expression Pattern:

  • The code defines a regular expression pattern r'\\\\b\\\\w+\\\\b' for word tokenization. Let's break down this pattern:
  • \\\\b: Denotes a word boundary.
  • \\\\w+: Matches one or more word characters (letters, digits, or underscores).
  • \\\\b: Denotes another word boundary.
  • This pattern effectively matches individual words in the text.

4. Tokenizing the Text into Words:

  • The re.findall() function is used to tokenize the text into words based on the specified regular expression pattern. This function finds all non-overlapping matches of the pattern in the input text and returns them as a list.
  • The pattern and text variables are passed as arguments to re.findall() to perform the tokenization.

5. Storing Tokenized Words:

  • The resulting tokens (words) are stored in the variable words. Each word in the text is represented as an individual element in the list words.

6. Printing Tokenized Words:

  • the tokenized words are printed using the print() function. This allows us to observe the individual words extracted from the input text.

--

--

TejasH MistrY

Machine learning enthusiast breaking down complex Ml/AI concepts and exploring their real-world impact.