15 Examples for Text Processing using Regex

Monica Pérez Nogueras
7 min readJun 12, 2023

Regular expressions (regex) are a powerful tool for pattern matching and text processing in Python. While the basics of regex are well-known, there are advanced techniques and complex patterns that can take your text-processing skills to new heights. In this article, we will explore 15 complex examples of using regex in Python, demonstrating its versatility and effectiveness in solving various text-processing challenges.

Example 1: Extracting URLs

Extracting URLs from a given text is a common task in web scraping and data extraction.

import re

text = "Visit my website at https://example.com or check out the latest news at http://news.example.com"

url_pattern = r"https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+"
urls = re.findall(url_pattern, text)

print(urls)

In this example, we define a regex pattern r"https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+ to match URLs. The pattern allows for both http and https protocols and captures the domain name and path. By using re.findall(), we extract all URLs from the given text.

Example 2: Validating Email Addresses

Validating email addresses based on specific criteria can be crucial in many applications.

import re

def is_valid_email(email):
email_pattern = r"^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}$"
return re.match(email_pattern, email) is not None

print(is_valid_email("john.doe@example.com"))
print(is_valid_email("invalid_email"))

In this code snippet, we define a regex pattern r"^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}$" to validate email addresses. The pattern ensures that the email address has a valid format with the correct local part, domain name, and top-level domain. The re.match() function checks if the provided email matches the pattern.

Example 3: Extracting Hashtags

Extracting hashtags from social media content or text data is a common requirement.

import re

text = "Just enjoying a #beautiful day with #friends in #nature"

hashtag_pattern = r"#\w+"
hashtags = re.findall(hashtag_pattern, text)

print(hashtags)

In this example, we define a regex pattern r"#\w+" to match hashtags. The pattern starts with the # symbol and captures one or more alphanumeric characters (\w+). By using re.findall(), we extract all hashtags from the given text.

Example 4: Parsing HTML Attributes

Parsing HTML attributes from tags can be useful in web scraping and extracting specific information. Let’s extract the href attribute from an <a> tag:

import re

html = '<a href="https://example.com">Visit our website</a>'

href_pattern = r'href="([^"]*)"'
href = re.search(href_pattern, html).group(1)

print(href)

In this code snippet, we use the regex pattern r'href="([^"]*)"' to match the href attribute value within the <a> tag. The ([^"]*) captures any character except a double quote, allowing us to extract the attribute value. By using re.search() and accessing the matched group with .group(1), we retrieve the href value.

Example 5: Tokenizing Words

Tokenizing words is a fundamental step in natural language processing (NLP) tasks. Regex can help us tokenize words based on specific criteria. Let’s tokenize words excluding numbers and punctuation marks:

import re

text = "Hello! How are you today 06/12/2023? I hope everything is going well. Have a great day."

word_pattern = r"\b(?![\d.!?])\w+\b"
words = re.findall(word_pattern, text)

print(words)

In this example, we define a regex pattern r"\b(?![\d.!?])\w+\b" to match words. The pattern uses negative lookahead (?![\d.!?]) to exclude numbers and punctuation marks. The \b markers ensure that we match complete words. By using re.findall(), we extract all words from the given text.

Example 6: Removing Extra Whitespaces

Cleaning and normalizing text often involve removing extra whitespaces. Regex provides an efficient way to achieve this. Let’s remove extra whitespaces from a sentence:

import re

text = "This sentence has extra whitespaces."

clean_text = re.sub(r"\s+", " ", text)

print(clean_text)

In this code snippet, we use the regex pattern r"\s+" to match one or more consecutive whitespaces. The re.sub() function replaces all occurrences of the pattern with a single whitespace, effectively removing extra whitespaces from the sentence.

Example 7: Splitting Text into Sentences

Splitting text into sentences is a common task in natural language processing. Let’s split a paragraph into sentences using regex:

import re

text = "Hello! How are you today? I hope everything is going well. Have a great day."

sentence_pattern = r"(.*?[.!?])"
sentences = re.findall(sentence_pattern, text)

print(sentences)

In this example, we define a regex pattern r"(.*?[.!?])" to match sentences. The (.*?[.!?]) captures any character (non-greedy) until it encounters a period, exclamation mark, or question mark. By using re.findall(), we extract all sentences from the given text based on the defined pattern.

Example 8: Extracting Dates

Extracting dates from text data can be valuable in various applications. Let’s see how regex can help us extract dates in the format “MM-DD-YYYY”:

import re

text = "The event will take place on 06-12-2023. Don't miss it!"

date_pattern = r"\b\d{2}-\d{2}-\d{4}\b"
dates = re.findall(date_pattern, text)

print(dates)

In this code snippet, we define a regex pattern r"\b\d{2}-\d{2}-\d{4}\b" to match dates in the format "MM-DD-YYYY." The pattern consists of \b markers to match complete dates, followed by \d{2}-\d{2}-\d{4} to capture the month, day, and year. By using re.findall(), we extract all dates from the given text.

Example 9: Extracting IP Addresses

Extracting IP addresses from text data is often required in network-related tasks. Let’s extract IP addresses from a log file using regex:

import re

log = "Client IP: 192.168.0.1 - Request received from 10.0.0.1 - Server IP: 172.16.0.1"

ip_pattern = r"\b(?:\d{1,3}\.){3}\d{1,3}\b"
ips = re.findall(ip_pattern, log)

print(ips)

In this example, we define a regex pattern r"\b(?:\d{1,3}\.){3}\d{1,3}\b" to match IP addresses. The pattern captures four sets of digits (1 to 3 digits each) separated by periods. By using re.findall(), we extract all IP addresses from the log file.

Example 10: Finding Duplicate Words

Identifying duplicate words in text data can be helpful in various text analysis tasks. Let’s find duplicate words in a sentence using regex:

import re

text = "This is is a test sentence to find duplicate duplicate words."

duplicate_pattern = r"\b(\w+)\b(?=.*\b\1\b)"
duplicates = re.findall(duplicate_pattern, text)

print(duplicates)

In this code snippet, we use the regex pattern r"\b(\w+)\b(?=.*\b\1\b)" to match duplicate words. The pattern captures individual words using (\w+) and utilizes positive lookahead (?=.*\b\1\b) to ensure that the same word appears again later in the sentence. By using re.findall(), we extract all duplicate words from the given sentence.

Example 11: Removing HTML Tags

Removing HTML tags from text data is a common preprocessing step in web scraping and data cleaning. Let’s remove HTML tags from a given string using regex:

import re

html = "<h1>Welcome to the Regex World</h1><p>Enjoy the power of regex!</p>"

clean_text = re.sub(r"<.*?>", " ", html)

print(clean_text)

In this example, we use the regex pattern r"<.*?>" to match HTML tags. The pattern matches any character (.) one or more times (*) between < and >, effectively capturing HTML tags. The re.sub() function replaces all occurrences of the pattern with an empty string, effectively removing the HTML tags from the string.

Example 12: Extracting Quoted Text

Extracting quoted text from a given string can be useful in various applications. Let’s extract the quoted text using regex:

import re

text = 'She said, "Life is short, enjoy every moment"'

quote_pattern = r'"([^"]*)"'
quotes = re.findall(quote_pattern, text)

print(quotes)

In this code snippet, we define a regex pattern r'"([^"]*)"' to match quoted text. The pattern captures any character except a double quote ([^"]*) between double quotes. By using re.findall(), we extract all quoted text from the given string.

Example 13: Extracting Time from Text

Extracting time information from text data can be valuable in various applications. Let’s extract time information in the format “HH:MM” from a given text:

import re

text = "The meeting will start at 14:30. Please be on time."

time_pattern = r"\b\d{2}:\d{2}\b"
times = re.findall(time_pattern, text)

print(times)

In this example, we define a regex pattern r"\b\d{2}:\d{2}\b" to match time in the format "HH:MM." The pattern captures two sets of digits (\d{2}) separated by a colon. By using re.findall(), we extract all time information from the given text.

Example 14: Removing Non-Alphanumeric Characters

Removing non-alphanumeric characters from text can be helpful in data cleaning and preprocessing tasks. Let’s remove non-alphanumeric characters from a sentence using regex:

import re

text = "This sentence includes !@#$% special characters *&^."

clean_text = re.sub(r"[^a-zA-Z0-9\s]", "", text)

print(clean_text)

In this code snippet, we use the regex pattern r"[^a-zA-Z0-9\s]" to match non-alphanumeric characters. The ^ within the square brackets negates the character class, effectively matching anything that is not a letter, digit, or whitespace. The re.sub() function replaces all occurrences of the pattern with an empty string, effectively removing non-alphanumeric characters from the sentence.

Example 15: Extracting Social Security Numbers

Extracting social security numbers (SSN) from text data can be necessary in certain applications.

import re

text = "The SSN of John Doe is 123-45-6789."

ssn_pattern = r"\d{3}-\d{2}-\d{4}"
ssns = re.findall(ssn_pattern, text)

print(ssns)

In this example, we define a regex pattern r"\d{3}-\d{2}-\d{4}" to match social security numbers in the format "XXX-XX-XXXX." The pattern captures three digits (\d{3}), followed by a hyphen, two digits (\d{2}), another hyphen, and four digits (\d{4}). By using re.findall(), we extract all SSNs from the given text.

These examples demonstrate the power and flexibility of regex in Python for advanced text-processing tasks. By mastering regex and leveraging its capabilities, you can effectively handle complex text patterns and extract valuable information from various sources.

Conclusion

Regex is a powerful tool that empowers developers to handle complex text-processing tasks efficiently. In this article, we explored 15 examples of using regex in Python, showcasing its versatility in tasks such as extracting URLs, validating email addresses, tokenizing words, removing HTML tags, and more.

If you are interested in this matter, you can consult the article Regex: Exploring Patterns and Character Classes, Quantifiers, Groups, Look-ahead, and Look-behind.

Let’s connect on Linkedin!!

--

--

Monica Pérez Nogueras

Automation Developer | Data Analyst | Business Intelligence Analyst | The Dow Chemical Company