Text-PreProcessing — Removing Punctuation and Special Characters
2 min readApr 5, 2024
To remove punctuation and special characters from text data, we aim to clean the text and retain only alphanumeric characters and possibly spaces between words. This step is essential in text preprocessing as punctuation marks and special characters often do not carry significant semantic meaning and can introduce noise in the data.
Code demonstrating Removing Punctuation and Special Characters.
import re
# Sample text with punctuation and special characters
text = "Natural ##language proc##essing## (NLP) is a %%fascinating field. It @@deals with how computers understand and interact with human language***. ***Sentence &*tokenization is **one of the *&*basic tasks in NLP."
# Remove punctuation and special characters using regular expressions
clean_text = clean_text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
# Print the cleaned text
print(clean_text)
Output:
Natural language processing NLP is a fascinating field It deals with how computers understand and interact with human language Sentence tokenization is one of the basic tasks in NLP
Explanation of provided code.
- We start by importing the
re
module, which allows us to work with regular expressions in Python. - We define a sample text containing punctuation marks and special characters.
- Using the
re.sub()
function, we specify a regular expression patternr'[^a-zA-Z0-9\\s]'
to match any character that is not alphanumeric ([^a-zA-Z0-9]
) or a whitespace character (\\s
). The^
inside the square brackets indicates negation. - We replace all occurrences of non-alphanumeric characters and special characters with an empty string
''
, effectively removing them from the text. - The cleaned text is stored in the variable
clean_text
. - Finally, we print the cleaned text.