Text-PreProcessing-Removing Repeating Characters

TejasH MistrY
2 min readApr 7, 2024

--

In this article, we will explore the process of removing repeating characters from words. For instance, we'll address cases like "I looooooove," where excessive repetition is used to emphasize a word such as "love.”

Removing Repeating Characters

In everyday language, people are often not strictly grammatical. They will write things such as I looooooove it in order to emphasize the word love. However, computers don’t know that “looooooove” is a variation of “love” unless they are told.

To remove these annoying repeating characters in order to end up with a proper English word.

we will be making use of the re-module, and more specifically, backreferences.

A backreference is a way to refer to a previously matched group in a regular expression. This will allow us to match and remove repeating characters

import re

class RepeatReplacer:

def __init__(self):
self.repeat_regexp = re.compile(r'(\w*)(\w)\2(\w*)')
self.repl = r'\1\2\3'

def replace(self, word):
repl_word = self.repeat_regexp.sub(self.repl, word)

if repl_word != word:
return self.replace(repl_word)
else:
return repl_word

replacer = RepeatReplacer()
text = replacer.replace("looooove")
text1 = replacer.replace("gooose")
text2 = replacer.replace("ooooooohh")
print(text)
print(text1)
print(text2)
Output: 

love
gose
oh

The RepeatReplacer class starts by compiling a regular expression to match and define a replacement string with backreferences.

The repeat_regexp pattern matches three groups:

  • 0 or more starting characters (\w*)
  • A single character (\w) that is followed by another instance of that character (\2)
  • 0 or more ending characters (\w*) The replacement string is then used to keep all the matched groups while discarding the backreference to the second group. So, the word looooove gets split into (looo)(o)o(ve) and then recombined as loooove, discarding the last o. This continues until only one o remains, when repeat_regexp no longer matches the string and no more characters are removed.

In the previous examples, you can see that the RepeatReplacer class is a bit too greedy and ends up changing goose into gose. To correct this issue, we can augment the replace() function with a WordNet lookup. If WordNet recognizes the word, then we can stop replacing characters.

Here is the WordNet-augmented version.

import re
from nltk.corpus import wordnet

class RepeatReplacer:
def __init__(self):
self.repeat_regexp = re.compile(r'(\w*)(\w)\2(\w*)')
self.repl = r'\1\2\3'

def replace(self, word):
if wordnet.synsets(word):
return word
repl_word = self.repeat_regexp.sub(self.repl, word)

if repl_word != word:
return self.replace(repl_word)
else:
return repl_word

replacer = RepeatReplacer()
text = replacer.replace("goose")
text1 = replacer.replace("oooooh")
text2 = replacer.replace("looooove")

print(text)
print(text1)
print(text2)
Output:

goose
ooh
love

Now, goose will be found in WordNet, and no character replacement will take place. Also, oooooh will become ooh instead of oh because ooh is actually a word in WordNet, defined as an expression of admiration or pleasure.

--

--

TejasH MistrY

Machine learning enthusiast breaking down complex Ml/AI concepts and exploring their real-world impact.