The power of ReGex | Python

Shivam Dutt Sharma
Analytics Vidhya
Published in
12 min readJun 17, 2020
The power of ReGex is world-known

Short for Regular Expressions, ReGex or Regexp is known by the world of Computer Science & Linguistics as a string of text that allows you to create patterns that are used to match, locate and manage texts. To give a simpler definition of it — ReGex is a sequence of characters that forms a search pattern. It is further used to check if a string contains that specified search pattern.

You would generally find such search patterns being used in string searching algorithms, prominently for “find” or “find and replace” operations on strings and the usage also extends in input validations in programming. Regex is known to be a technique developed in theoretical computer science, with its roots being strong in Theory of Automata and Formal Grammar / Language.

Who invented Regular Expressions?

Stephen Cole Kleene

The concept of Regular Expressions emerged in the 1950s when the American mathematician Stephen Cole Kleene formalized the description of a regular language. The concept became famous with its primary apparent use with Unix text-processing utilities.

Today, ReGex is used prominently in various areas of Computer Science & Linguistics.

To name a few such areas / applications of ReGex :-

  • Search Engines : Everyone knows the importance of web-based search engines in the current Information age. However, the biggest problem faced by the consumers of Computer Science, is that out of the vast pool of information available out there, how do they really separate the information that is needed from that which is not needed. Well, ReGex clearly solves that problem and helps the members of the Computer industry or what you might call then “query-submitter/s” get the most relevant results for their queries.
  • Search and replace utility of word processors and text editors : wherein a user can search dialogues / strings and replace them with another dialogue / string. I bet. you see and use this on a regular basis. Don’t you? All you Word & Excel fans 😉
  • In text processing utilities such as Stream Editor & AWK : The former is a Unix utility that parses and transforms text and the latter is a a domain-specific language designed for text processing and is popularly used as a reporting tool incorporating data extraction.
  • Lexical analysis : while we know what Lexical analysis does (which is converting a sequence of characters or what we knows as strings, into tokens); we should also know the importance ReGex holds in Lexical analysis. ReGex in Lexical analysis, helps in scanning and identifying only the valid set of strings/token/lexeme that belong to a certain language. ReGex helps search only for those patterns that are defined by that language’s rules.

And, you should know that,

Many programming languages provide ReGex capabilities either built-in or via libraries. ReGex support is part of the standard library of many programming languages of which Java and Python would probably be the best known to the world. And, at the same time ReGex is also built into the syntax of others, including Perl and ECMAScript.

If it has got a little boring for you so far, let’s light up the reading a bit :-
Did you know that to a beginner, a ReGex expression appears like a “Lingua Franca” introduced for a possible communication between humans and aliens whenever there happens an alien invasion in future.

😆 😆 😆 😆 👽 👽 😯 😯 👽 👽 😖 😖👽 👽 😯 😯 👽 👽 😖 😖👽 👽 😆 😆 😆 😆

Like, imagine you ever come across a sequence like /^([A-Z0–9_\.- and you are like that’s too cryptographic for me to grasp. What on this planet, does this mean?

On a light note : “ReGex” syntax is like a cat playing on your keyboard.

Now, on a serious note :-

One should ask the computer scientists and programmers around the world as to why ReGex is so important?

And they will hear a common verdict. It being that ReGex makes life easy by automating search patterns, cleaning and working with text based data, thus, saving you the time and trouble of searching through mountains of text data manually.

Today, we will get our hands dirty with ReGex, using Python3 🙌

A lay-man connotation to how ReGex is performed using Python

Simply, cut to the chase…

At this point, I would want to give a new and more evolved definition of ReGex, in context to Python (you can find the same on python.org too) :-

Regular expressions (called REs, or regexes, or regex patterns) are essentially a tiny, highly specialized programming language embedded inside Python and made available through the re module.

The kind of questions, a programmer would want to answer using ReGex + Python?

  • Does this string contain a set of patterns that I want to search?
  • Does this string in its entirety match a pattern that I have?
  • Do my search patterns follow all the pre-defined rules?
  • Can I set some new ground rules for my ReGex?
  • Does my corpus accommodate all the possible search patters of the consumer / user?
  • How relevant my documents are for the most common types of consumer keyword searches?
  • Does my text corpus contain any irrelevant characters / symbols / keywords, that I can remove so as to keep my documents all relevant and precise?

Well, yes ReGex gives answers to all those questions. I will try to put it all in a practical sense now using Python3.

And right away, let’s do the primarily inevitable thing to accomplish it. Let’s import the right Python module “re” to achieve anything and everything we would like to do in ReGex :-

Importing the re module from Python

re, is a built-in package in Python, which can be used to work with Regular Expressions.

We are now going to try out a very basic example of ReGex to begin with. This is a basically a demonstration of how you import the re module and perform ReGex operations on a sample string. This is an iota of the overall vast potential that ReGex has. So, please don’t get overwhelmed by what you see, as we will do much more interesting stuff as we progress.

So……
Open your Jupyter notebook / Anaconda prompt / or anywhere you generally write your Python scripts, and being by simply importing the re module.

Once you are ready, write the following lines of code on your editor / terminal / notebook. Just check, if you get the same output as I did. The ideal output to the code below is- “Platform: Medium.com”. Please confirm that you also see this getting printed on your output console.

A basic ReGex example

What did just happen here?

Someone just told me that he is going to publish an article on a certain online platform. I have his statement written somewhere with me. All I want to do now is simply extract the platform name from that statement (string). If you will see in the code above, I have done exactly the same. I have defined a search pattern that extracts the sequence “Medium.com” from the original input string.

How did it actually happen?

Well this is a question which we can answer once we dive a little deeper into the basic fundamentals of ReGex- its functions, characters, meta-characters, sets, sequences, etc.

ReGex Functions

The re module in Python offers a range of functions which allow a programmer to search a string for a match:-

  1. findall() : This ReGex function returns a list of all matches. The two parameters it takes are — 1st parameter is the pattern of ReGex string/sequence that they want to find and the 2nd parameter is the main string from which they want to find all the matching patterns corresponding to the ReGex expression given.
An example demonstrating the use of findall( )

In the above example, I am trying to find a list of all those words in the given sentence, that have “ing” as a substring(part) in them. Hence, if you will see, the list returned is [‘Channing’, ‘swimming’, ‘dancing’, ‘dining’], wherein apparently each word has “ing” as a part.

2. search() : This ReGex function returns a match object if there was any match found in the string. The two parameters it takes are — 1st parameter is the pattern of ReGex string/sequence that they want to search and the 2nd parameter is the main string in which they want to search the particular patterns corresponding to the ReGex expression given. In case if there are two or more matches, then only the first occurrence of the match is returned.
I will show you both the cases here :-

1st case : Single occurrence of a match/search string. It’s quite a simple one, and you will see that the word “going” occurs at the 15th position knowing that the string’s starting index is 0. Hence, x.start() returns 15.

When I wanted to match “going” and there existed only one “going”

2nd case : Double (multiple) occurrence of a match/search string. It’s a case when I want to match “going” and I can see that there exist two occurrences of “going” in the sentence. Here, by the rule, I get the position of the first occurrence only, if you will see. x.start() returns 9 as the first going word is positioned at 9th index. It does not take into account the other “going” word at the end of the string.

When I wanted to match “going” and there existed two occurrences of “going”

3. split() : This is a ReGex function that returns a list where the string has been split at each match. The two parameters it takes are — 1st parameter is the character / sequence, basis (at) which they will split the main string and 2nd parameter is the main string which they want to split. I have also just given you a one-liner approach to find the count of words in a sentence, next time when you plan to do that in any program. 😁 😀 Just import re and use split at white-space (“\s”).
Voila!

I split a sentence at white space.

Let’s look at a case, wherein we would want to split basis a certain letter or word, rather than a white-space (which is generally a usual case). So below, I am trying to split the sentence at letter “i”. The functionality remains the same obviously :-

I split the same sentence as above at letter ‘i’

4. sub() : This ReGex function is used to replace a match / search string with a text of your own choice. The three parameters it takes are — 1st parameter is the sub-string that you want to replace, 2nd parameter is the the sub-string that you want to replace it with, and 3rd parameter is the main concerned string in which you want to do the replacement.

If you will see below, in a string- “We are discussing ReGex”, I have matched a patter “ReGex” and replaced it with “ReGex with Python” so that the resultant string comes out to be “We are dicussing ReGex with Python”.

Using sub() in ReGex

Python ReGex characters Cheat-sheet :-

Regular Expression Basics :-

. (dot) — Any character except new line
a — The character a
ab — The string ab
a|b — a or b
a* — 0 or more a’s
\ — Escapes a special character

Regular Expression Character Classes :-

[ab-d] — One character of: a, b, c, d
[^ab-d] — One character except: a, b, c, d
[\b] — Backspace character
\d — One digit
\D — One non-digit
\s — One white-space
\S — One non-white-space
\w — One word character
\W — One non-word character

Regular Expression Flags :-

i — Ignore case
m — ^ and $ match start and end of line
s — . matches newline as well
x — Allow spaces and comments
L — Locale character classes
u — Unicode character classes
(?iLmsux) — Set flags within regex

Regular Expression Quantifiers :-

* — 0 or more
+ — 1 or more
? — 0 or 1
{2} — Exactly 2
{2, 5} — Between 2 and 5
{2,} — 2 or more
(,5} — Up to 5

Regular Expression Assertions :-

^ — Start of string
\A — Start of string, ignores m flag
$ — End of string
\Z — End of string, ignores m flag
\b — Word boundary
\B — Non-word boundary
(?=…) — Positive look-ahead
(?!…) — Negative look-ahead
(?<=…) — Positive look-behind
(?<!…) — Negative look-behind
(?()|) — Conditional

Regular Expression Groups :-

(…) — Capturing group
(?P<Y>…) — Capturing group named Y
(?:…) — Non-capturing group
\Y — Match the Y’th captured group
(?P=Y) — Match the named group Y
(?#…) — Comment

Regular Expression Special Characters :-

\n — Newline
\r — Carriage return
\t — Tab
\YYY — Octal character YYY
\xYY — Hexadecimal character YY

Regular Expression Replacement :-

\g<0> — Insert entire match
\g<Y> — Insert match Y (name or number)
\Y — Insert group numbered Y

I think we have done a fantastic job so far on apprising ourselves of the fundamentals of ReGex with Python and sort of know the concepts in depth.
Probably, the right time to look at some real world application of using ReGex (with Python)….

Have you heard of Credit Card Frauds?

ReGex has the magic to identify Credit Card frauds…

I have a friend named Mohan and he works for a Credit Card company. He was given a huge number of Credit Cards by his manager, and he was asked to validate the credit card numbers. My friend researched about it and found that the knowledge of ReGex can help him out.

He knew that I know ReGex and he reached out to me. He requested me if I could help him validate whether the bunch of Credit Cards that he has, have valid Credit Card numbers on them or not?

He gave me the following rules / assertions as the basic characteristics of a genuine Credit Card number :-

► It must start with a 4, 5 or 6.
► It must contain exactly 16 digits.
► It must only consist of digits (0–9).
► It may have digits in groups of 4, separated by one hyphen “-”.
► It must NOT use any other separator like ' ' , '_', etc.
► It must NOT have 4 or more consecutive repeated digits.

He then gave me examples of a few valid credit card numbers :-

6954627879619786
5126424824942942
4127-2267-7924-3914

He then gave me examples of a few invalid / fraud credit card numbers :-

62531254792151867 #17 digits in card number → Hence Invalid 
6924444424441344 #Consecutive digits >= 4 times → Hence Invalid
5122-2368-7954 - 3214 #Separators other than '-' are used → Invalid
65244x4521242547 #Contains non digit characters('x') → Hence Invalid
9625312582963578 #Doesn't start with 4, 5 or 6 → Hence Invalid

I said - “Ok I shall try my best to help you”.

I did not take much time before I got back to him with the following code. This also explains how easily and quickly can you execute ReGex using Python and its effective module “re”. You just need to know your ReGex characters’ set well, that’s it!

I told my friend — “I believe this code that am sharing with you serves the purpose of validating whether your Credit Cards have genuine credit card numbers or not. Just input your credit card numbers and check for yourself”.

Can you (reader) have a look at this code of mine, and tell me what exactly is happening here ?

*Please consider the proper indentations for if/else, for statements, functions at your end to let your console / notebook identify the block of codes syntactically correct; before executing this script.

import re
def cc_rules(cc_num_):
pass_ = 0
rule1 = ‘^[456].+’
if re.search(rule1, cc_num_):
pass_ += 1
#print(“rule1”, pass_)
rule2 = ‘’.join(filter(lambda i: i.isdigit(), cc_num_))
if len(rule2) == 16:
pass_ += 1
#print(“rule2”, pass_)
rule3 = all(item in [0,1,2,3,4,5,6,7,8,9] for item in list(int(x) for x in rule2))
if rule3 == True:
pass_ += 1
#print(“rule3”, pass_)
rule4 = ‘^[456][0–9]{3}\-?[0–9]{4}\-?[0–9]{4}\-?[0–9]{4}’
if re.search(rule4, cc_num_):
pass_ += 1
#print(“rule4”, pass_)
rule5 = ‘^[456][0–9]{3}[\s_][0–9]{4}[\s_]?[0–9]{4}[\s_]?[0–9]{4}’
if re.search(rule5, cc_num_) == None:
pass_ += 1
#print(“rule5”, pass_)
rule6 = r’(\d)\1{3,}’
if re.search(rule6, rule2) == None:
pass_ += 1
if pass_ == 6:
return “Valid”
else:
return “Invalid”

if __name__ == ‘__main__’:
N = int(input())
assert 0<N<100, “0<N<100”
for i in range(N):
str_ = input()
ans_ = cc_rules(str_)
print(ans_)

If you think you have got the idea of what has happened in this code and feel much confident about ReGex now after reading this article, then please leave a clap.

And, if you think you have some doubts anywhere, please leave your question in the comments below, and I would love to answer those.

THANK YOU!

Happy ReGex-ing!
Don’t forget to reach out to your Python for better and quicker execution 😉

--

--