Replacing Words Matching using Regular Expressions

TejasH MistrY
3 min readApr 7, 2024

--

In this article, we will learn how to replace words matching specific patterns using regular expressions. For instance, we’ll explore replacing “can’t” with “cannot” or “would’ve” with “would have”.

Replacing Words Matching using Regular Expressions

Now, we are going to get into the process of replacing words.

If stemming and lemmatization are a kind of linguistic compression, then word replacement can be thought of as error correction or text normalization.

In this, we will replace words based on regular expressions, with a focus on expanding contractions.

Remember when we were tokenizing words Tokenizing Text and it was clear that most tokenizers had trouble with contractions? In this we aim to fix this by replacing contractions with their expanded forms, for example, by replacing “can’t” with “cannot” or “would’ve” with “would have”.

Understanding how this works will require a basic knowledge of regular expressions and the re-module.

Let's first understand what is regular expressions and then re-module.

1. Regular Expressions.

  • Regular expressions, often abbreviated as regex or regexp, are sequences of characters that define a search pattern.
  • They are used for pattern matching within strings. With regular expressions, you can search for specific patterns, extract information, and manipulate text.
  • Regular expressions consist of a combination of literal characters (such as letters and digits) and special characters (metacharacters) that have special meanings.
  • Regular expressions are highly versatile and powerful, allowing you to perform complex text-processing tasks efficiently

2. re-Module.

  • In Python, the re module provides support for working with regular expressions.
  • It offers functions and methods for pattern matching, searching, and substitution operations using regular expressions.

Some of the commonly used functions and methods in the re module include:

  • re.search(pattern, string): Searches for the first occurrence of the pattern in the string.
  • re.match(pattern, string): Matches the pattern at the beginning of the string.
  • re.findall(pattern, string): Finds all occurrences of the pattern in the string and returns them as a list.
  • re.sub(pattern, repl, string): Substitutes occurrences of the pattern in the string with the replacement text.
  • re.compile(pattern): Compiles the regular expression pattern into a pattern object, which can be used for efficient pattern matching.

Code Demonstrating Replacing Words Matching Using Regular Expressions

import re

replacement_patterns = [
(r'won\'t', 'will not'),
(r'can\'t', 'cannot'),
(r'i\'m', 'i am'),
(r'ain\'t', 'is not'),
(r'(\w+)\'ll', '\g<1> will'),
(r'(\w+)n\'t', '\g<1> not'),
(r'(\w+)\'ve', '\g<1> have'),
(r'(\w+)\'s', '\g<1> is'),
(r'(\w+)\'re', '\g<1> are'),
(r'(\w+)\'d', '\g<1> would')
]

class RegexpReplacer:
def __init__(self, patterns=replacement_patterns):
self.patterns = [(re.compile(regex), repl) for (regex, repl) in patterns]

def replace(self, text):
result = text
for (pattern, repl) in self.patterns:
result = re.sub(pattern, repl, result)
return result

replacer = RegexpReplacer()
text = replacer.replace("I can't believe you're going on vacation without me.")
text1 = replacer.replace("I won't be attending the meeting tomorrow.")
print(text)
print(text1)
Output :

I cannot believe you are going on vacation without me.
I will not be attending the meeting tomorrow.

The RegexpReplacer.replace() function works by replacing every instance of a replacement pattern with its corresponding substitution pattern.

In replacement patterns, we have defined tuples such as (r’(\w+)\’ve’, ‘\g<1> have’).

The first element matches a group of ASCII characters followed by ‘ve. By grouping the characters before ‘ve in parenthesis, a match group is found and can be used in the substitution pattern with the \g<1> reference. So, we keep everything before ‘ve, then replace ‘ve with the word have. This is how should’ve can become should have.

--

--

TejasH MistrY

Machine learning enthusiast breaking down complex AI/ML concepts and exploring their real-world impact.