Four Pattern Occurrences can Save Your Life during Text Mining Projects.

Arunkumar N
Variablz Academy
Published in
4 min readAug 20, 2022
Credits to Aatomz

In this data-dominating world, we can extract the required information efficiently using Regular Expression (Regex) in any Programming Language. Suppose you are given a data-containing book full of contact information. If you want to extract the user’s mail-id alone, Regex will do that job in a few minutes.

However, practicing the Regex pattern occurrences is vital to retrieve the exact information. So, In this article, I am here to explain the 4 pattern occurrences from the regular expression that can save your life during text mining projects.

Pattern 1: Zero or more occurrences (*)

As the name suggests, it repeats the character or pattern zero or more times.

Before starting the code, we have to import the regex module.

In the first example, we take random text as input, and then we create a pattern to find something in the text. For finding, we use re.findall(pattern, txt) that finds all the pattern matches in the text. I have created a pattern that starts with ‘94’ and dot(.) indicates any character or wildcard character except newline, and our hero (*) indicates that the wildcard character could occur zero or more times, and the pattern should end with ‘8’. For this pattern, we got (9488488458), which is in the text as output. Here the pattern exactly matches the numbers in the text. If the text contains some letters between ‘94’ and ‘8’, it brings all of them.

In the second example, we are ignoring the wildcard character (.) dot. In this case, (*) indicates the number ‘4’ could occur zero or more times in between ‘9’ and ‘8’. we got two results. ‘944448’ has more occurrences of ‘4’ and ‘98’ has zero occurrences of ‘4’.

Pattern 2: One or more occurrences (+)

The character or a pattern repeats one or more times.

In this example, ‘\d’ represents numbers(0–9) in a string, and ‘+’ indicates one or more occurrences of the numbers. In the given input, there are many numbers. By using this pattern, we have extracted the numbers in the input. If ‘+’ is not included after ‘\d’, the output will be unique numbers.

Pattern 3: Zero or one occurrence (?)

It repeats the character or pattern zero or only one time.

The above example (?) indicates zero or one occurrence of ‘i’ in the text. The first result, ‘fire’, is an example of one occurrence since one ‘i’ is present between ‘f’ and ‘r’. The result ‘free’ is an example of zero occurrences since there is no ‘i’ between ‘f’ and ‘r’.

Pattern 4: The exact number of occurrences ({ })

We use this for how often a character or pattern needs to be repeated.

In this example, we will collect all the mobile numbers from the given input. For this, we have to recognize the pattern for the mobile numbers shown in the text. Let’s do this. In the given text, a mobile number has 10 digits which are represented by ‘\d{10}’ or a 10-digit number with a code ‘+91-’, which is represented by ‘\+91-\d{10}’. Here ‘|’ means either or. To indicate ‘+’ as the raw character, we are putting ‘\’ before ‘+’ since it has special meaning in Regex, as we saw in the previous examples.

I hope I have given some valuable insights about Regex. Still, there are many incredible things to learn from it. Learn Regex and efficiently do code. See you all in the upcoming article. Keep your heads high. keep rising.

Cheers!

By the way! Would you like to Ask for more info? here is my LinkedIn

https://www.linkedin.com/in/arunkumar-data-scientist/

--

--