TextPreprocessing Steps- Text Lowercasing, Tokenization
In this article, I have explained the common Text Preprocessing steps like Text lowering, what is tokenization, sentence tokenization, and word tokenization in detail with practical implementation.
What is Natural Language Processing?
Natural Language Processing (NLP) is a branch of artificial intelligence (AI) that focuses on the interaction between computers and humans through natural language. It involves the development of algorithms and computational models that enable computers to understand, interpret, and generate human language in a meaningful way.
we use NLTK (Natural Language Toolkit) is a widely-used library in Python for Natural Language Processing (NLP) tasks. It provides a wide range of tools and resources for working with human language data.
1. Text Lowercasing:
Text lowercasing is an essential step in natural language processing (NLP) tasks.
By converting all text to lowercase, we ensure consistency in processing. Words with different cases (e.g., “apple” and “Apple”) are treated as the same token, avoiding discrepancies in analysis.
Lowercasing helps standardize text data, reducing the complexity introduced by varying cases of words. This simplification aids in text comparison, analysis, and modeling.
Text Lowercasing with Code.
# Sample text
text = "The quick BROWN fox JUMPS over the LAZY dog."
# Convert text to lowercase
lowercased_text = text.lower()
# Print the lowercased text
print(lowercased_text)
Output: the quick brown fox jumps over the lazy dog.
2. What is Tokenization?
Tokenization is the process of splitting a string into a list of pieces or tokens. A token is a piece of a whole, so a word is a token in a sentence, and a sentence is a token in a paragraph.
In Natural Language Processing (NLP), Three of the most common types of tokenization are
- Sentence tokenization.
- Word tokenization.
- Regular expressions Tokenization.
1. Sentence Tokenization:
Sentence tokenization involves splitting text into individual sentences, where each sentence becomes a separate token. This is particularly useful for tasks that require analyzing text at the sentence level or segmenting text for specific applications, such as machine translation, text summarization, and sentiment analysis.
For example, consider this paragraph:
“Natural language processing (NLP) is a fascinating field. It deals with how computers understand and interact with human language.”
After Sentence tokenization, this paragraph would become:
[“Natural language processing (NLP) is a fascinating field.”, “It deals with how computers understand and interact with human language.”]
Here, each sentence is considered a token.
Let's perform tokenization by writing some code
Note: To perform Tokenization we have to install NLTK Library in our device. To install NLTK, open your terminal or command prompt and run the following command:
pip install nltk
This code demonstrates Sentence tokenization using NLTK
import nltk
from nltk.tokenize import sent_tokenize
# Sample text containing multiple sentences
text = "Natural language processing (NLP) is a fascinating field. It deals with how computers understand and interact with human language. Sentence tokenization is one of the basic tasks in NLP."
# Tokenize the text into sentences
sentences = sent_tokenize(text)
# Print the tokenized sentences
print(sentence)
Output:
['Natural language processing (NLP) is a fascinating field.',
'It deals with how computers understand and interact with human language.',
'Sentence tokenization is one of the basic tasks in NLP.']
explanation of provided code step by step:
1. Importing NLTK and Sentence Tokenizer:
- First, we import the NLTK library, which provides various tools and resources for natural language processing tasks.
- Then, we specifically import the
sent_tokenize
function from thenltk.tokenize
module. This function is designed for sentence tokenization.
2. Providing Text Input:
- We define a sample text containing multiple sentences. This text serves as the input for the sentence tokenization process.
3. Tokenizing Text into Sentences:
- We use the
sent_tokenize()
function to tokenize the input text into individual sentences. This function breaks down the text into a list of sentences based on punctuation marks that denote sentence boundaries.
4. Printing Tokenized Sentences:
- Finally, we print the tokenized sentences to observe the result of the sentence tokenization process. Each sentence is printed separately, as they are stored in a list after tokenization.
2. Word Tokenization:
Word Tokenization is like cutting the sentence into smaller pieces, where each piece is called a token. In this case, each word in the sentence is a token. So, when we tokenize the sentence, we split it into individual words.
Word tokenization is widely used in NLP tasks because it provides a granular representation of text, making it easier to analyze and process.
For Example, consider you have a sentence: “Hello there, how are you?”
Now, let’s tokenize this sentence.
After word tokenization, our sentence becomes:
[“Hello”, “there”, “,”, “how”, “are”, “you”, “?”]
Here, each word (“Hello”, “there”, “how”, “are”, “you”) and even punctuation marks like comma (“,”) and question mark (“?”) are considered tokens.
code demonstrating word tokenization using NLTK
import nltk
from nltk.tokenize import word_tokenize
# Sample sentence
sentence= "Natural language processing (NLP) is a fascinating field. It deals with how computers understand and interact with human language. Sentence tokenization is one of the basic tasks in NLP."
# Tokenize the sentence into words
words = word_tokenize(sentence)
# Print the tokenized words
print(words)
Output:
['Natural', 'language', 'processing', '(', 'NLP', ')', 'is', 'a', 'fascinating', 'field', '.', 'It', 'deals', 'with', 'how', 'computers', 'understand', 'and', 'interact', 'with', 'human', 'language', '.', 'Sentence', 'tokenization', 'is', 'one', 'of', 'the', 'basic', 'tasks', 'in', 'NLP', '.']
explanation of provided code step by step:
1. Importing the word_tokenize Function
- The code imports the
word_tokenize
function from thenltk.tokenize
module. This function is used specifically for tokenizing sentences into words.
2. Providing a Sample Sentence
- A sample sentence is defined and stored in the variable
sentence
.
3. Tokenizing the Sentence into Words:
- The
word_tokenize()
function is applied to thesentence
variable. This function tokenizes the input sentence into individual words based on whitespace and punctuation.
4. Storing Tokenized Words:
- The resulting tokens (words) are stored in the variable
words
.
5. Printing Tokenized Words:
- The tokenized words are printed using the
print()
function. This allows us to observe the individual words extracted from the sentence.
Each word in the sentence is separated and stored as an individual element in the list. This process enables further analysis or processing of the text data at the word level.
Tokenization is like breaking down text into smaller parts, whether they are words in a sentence or sentences in a paragraph. These smaller parts are called tokens, and they help us analyze and understand text better in computer programs.
3. Tokenizing sentences using regular expressions
Regular expressions can be used if you want complete control over how to tokenize text. Regular expressions allow for the creation of custom rules or patterns to perform sentence and word tokenization based on specific criteria or requirements. This flexibility is one of the key advantages of using regular expressions for text-processing tasks.
Before diving into practical implementations of word and sentence tokenization using regular expressions, let’s take a moment to understand the re
module in Python:
re
module in Python provides support for working with regular expressions, which are powerful tools for pattern matching and text manipulation.
Some common use cases of the re
module.
- Pattern Matching: The primary use of the
re
module is to search for specific patterns within strings. You can define patterns using regular expressions and then use functions likere.search()
orre.match()
to find matches in text data. - String Extraction: Regular expressions can be used to extract substrings from text data based on predefined patterns. Functions like
re.findall()
orre.finditer()
can be used to extract all occurrences of a pattern in a string or iterate over them, respectively. - String Splitting: The
re.split()
function allows you to split strings based on a specified pattern. This is particularly useful for tokenization tasks, where you want to split text into words or sentences based on specific delimiters or patterns. - String Replacement: Regular expressions can be used to replace substrings in text data with other strings. The
re.sub()
function allows you to search and replace occurrences of a pattern with a specified replacement string. - Validation: Regular expressions are commonly used for data validation tasks, such as validating email addresses, phone numbers, or other structured data formats. You can define patterns that match valid formats and use functions like
re.match()
to validate input strings against these patterns. - Text Cleaning: Regular expressions are useful for cleaning and preprocessing text data by removing unwanted characters, formatting inconsistencies, or special symbols. You can define patterns that match specific patterns of characters to be removed or replaced.
Sentence Tokenization with Regular Expressions
import re
# Sample text containing multiple sentences
text = "Natural language processing (NLP) is a fascinating field. It deals with how computers understand and interact with human language. Sentence tokenization is one of the basic tasks in NLP."
# Define regular expression pattern for sentence tokenization
pattern = r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?|\!)\s'
# Tokenize the text into sentences using regular expressions
sentences = re.split(pattern, text)
# Print the tokenized sentences
print(sentences)
['Natural language processing (NLP) is a fascinating field.',
'It deals with how computers understand and interact with human language.',
'Sentence tokenization is one of the basic tasks in NLP.']
explanation of provided code step by step:
1. Importing the re
Module:
- The code starts by importing the
re
module, which provides support for working with regular expressions in Python.
2. Defining the Sample Text:
- A sample text containing multiple sentences is defined and stored in the variable
text
. This text serves as the input for sentence tokenization.
3. Defining the Regular Expression Pattern:
- The code defines a regular expression pattern
pattern
for sentence tokenization.
Let's break down this pattern:
(?<!\\\\w\\\\.\\\\w.)
: Negative lookbehind assertion that ensures the matched position is not preceded by an abbreviation (e.g., "Mr." or "Mrs.").(?<![A-Z][a-z]\\\\.)
: Negative lookbehind assertion that ensures the matched position is not preceded by an initial (e.g., "A. Smith" or "J. Doe").(?<=\\\\.|\\\\?|\\\\!)
: Positive lookbehind assertion that ensures the matched position is preceded by a period, question mark, or exclamation mark.\\\\s
: Matches any whitespace character (space, tab, newline).
4. Tokenizing the Text into Sentences:
- The
re.split()
function is used to tokenize the text into sentences based on the specified regular expression pattern. This function splits the text wherever the pattern matches, effectively segmenting it into individual sentences. - The
pattern
andtext
variables are passed as arguments tore.split()
to perform the tokenization.
5. Storing Tokenized Sentences:
- The resulting tokenized sentences are stored in the variable
sentences
. Each sentence in the text is represented as an individual element in the listsentences
.
6. Printing Tokenized Sentences:
- the tokenized sentences are printed using the
print()
function. This allows us to observe the individual sentences extracted from the input text.
Word Tokenization with Regular Expressions
import re
# Sample text containing multiple sentences
text = "Natural language processing (NLP) is a fascinating field. It deals with how computers understand and interact with human language. Sentence tokenization is one of the basic tasks in NLP."
# Define regular expression pattern for word tokenization
pattern = r'\b\w+\b'
# Tokenize the text into words using regular expressions
words = re.findall(pattern, text)
print(words)
['Natural', 'language', 'processing', 'NLP', 'is', 'a', 'fascinating', 'field', 'It', 'deals', 'with', 'how', 'computers', 'understand', 'and', 'interact', 'with', 'human', 'language', 'Sentence', 'tokenization', 'is', 'one', 'of', 'the', 'basic', 'tasks', 'in', 'NLP']
1. Importing the re
Module:
- The code starts by importing the
re
module, which provides support for working with regular expressions in Python.
2. Defining the Sample Text:
- A sample text containing multiple sentences is defined and stored in the variable
text
. This text serves as the input for word tokenization.
3. Defining the Regular Expression Pattern:
- The code defines a regular expression pattern
r'\\\\b\\\\w+\\\\b'
for word tokenization. Let's break down this pattern: \\\\b
: Denotes a word boundary.\\\\w+
: Matches one or more word characters (letters, digits, or underscores).\\\\b
: Denotes another word boundary.- This pattern effectively matches individual words in the text.
4. Tokenizing the Text into Words:
- The
re.findall()
function is used to tokenize the text into words based on the specified regular expression pattern. This function finds all non-overlapping matches of the pattern in the input text and returns them as a list. - The
pattern
andtext
variables are passed as arguments tore.findall()
to perform the tokenization.
5. Storing Tokenized Words:
- The resulting tokens (words) are stored in the variable
words
. Each word in the text is represented as an individual element in the listwords
.
6. Printing Tokenized Words:
- the tokenized words are printed using the
print()
function. This allows us to observe the individual words extracted from the input text.