Regex in Python 101
Regular expressions are essentially a tiny, highly specialized programming language embedded inside Python and made available through the [re](<https://docs.python.org/3.6/library/re.html#module-re>)
module. Using this little language, you specify the rules for the set of possible strings that you want to match; this set might contain English sentences, or e-mail addresses, or TeX commands, or anything you like. You can then ask questions such as “Does this string match the pattern?”, or “Is there a match for the pattern anywhere in this string?”.
Regular expression patterns are compiled into a series of bytecodes which are then executed by a matching engine written in C. There is a beautiful theory underlying regular expressions, and efficient regular expression processing is regarded as one of the classic problems of computer science. To learn more about that you can look at this article below: https://swtch.com/~rsc/regexp/regexp1.html
Regular Expressions in Python
Pattern matching with regular expressions has 3 steps:
- You come up with a pattern to find.
- You compile it into a pattern object.
- You apply the pattern object to a string, to find matches, i.e., instances of the pattern within the string. You can perform any of the below operations:
match()
Determine if the RE matches at the beginning of the string.search()
Scan through a string, looking for any location where this RE matches.findall()
Find all substrings where the RE matches, and returns them as a list.finditer()
Find all substrings where the RE matches, and returns them as an iterator.
Basics
Let’s see how this scheme works for the simplest case, in which the pattern is an exact substring
import re pattern = 'prize'
pattern_matcher = re.compile (pattern)input_1 = 'Cash prize of $20,000'
matches_1 = pattern_matcher.search (input_1)print (matches_1)
You can also query for more information
print (matches_1.group ())
print (matches_1.start ())
print (matches_1.end ())
print (matches_1.span ())
Module Level: For infrequently used patterns, you can also skip creating the pattern object and just call the module-level search function, re.search()
matches_2 = re.search('prize', input_1)
# If you want to ignore case
matches_2 = re.search('cash', input_1, re.IGNORECASE)
Most letters and characters will simply match themselves as we have seen above.
There are exceptions to this rule; some characters are special metacharacters “ . ^ $ * + ? { } [ ] \ | ( ) “ and don’t match themselves. Instead, they signal that some out-of-the-ordinary thing should be matched, or they affect other portions of the RE by repeating them or changing their meaning. To match these metacharacters you will need precede them with a backslash to remove their special meaning. \\[
or \\\\
print( re.search ('\\$', input_1))
Problem: Domains of Email Addresses
Let’s say you have been given a file which contains first name, last name, email address and contact numbers of all the attendees of an event. If you want to analyse which organisations these participants represent or even simpler if you want to clean this data up you can use regular expressions to do so. Lets take a look at how we can clean up the email addresses while also analyzing the domains of each.
Sample Data:
Identify any non alphabetic characters present
Character Classes: [ ]
The metacharacters [ ] are used for specifying a character class, which is a set of characters that you wish to match. Characters can be listed individually, or a range of characters can be indicated by giving two characters and separating them by a '-'
. For example, [abc]
will match any of the characters a
, b
, or c
; this is the same as [a-c]
, which uses a range to express the same set of characters.
Metacharacters are not active inside classes. For example, [akm$]
will match any of the characters 'a'
, 'k'
, 'm'
, or '$'
; '$'
is usually a metacharacter, but inside a character class it’s stripped of its special nature.
# Character classes: [...]
email_id = 'veleti3@deloitte.com'
chars = '[^a-z]'
print (re.findall (chars, email_id))
Special Character Classes: \
Some of the special sequences beginning with '\\'
represent predefined sets of characters that are often useful, such as the set of digits, the set of letters, or the set of anything that isn’t whitespace.
\\d
Matches any decimal digit; this is equivalent to the class[0-9]
.\\D
Matches any non-digit character; this is equivalent to the class[^0-9]
.\\s
Matches any whitespace character; this is equivalent to the class[ \\t\\n\\r\\f\\v]
.\\S
Matches any non-whitespace character; this is equivalent to the class[^ \\t\\n\\r\\f\\v]
.\\w
Matches any alphanumeric character; this is equivalent to the class[a-zA-Z0-9_]
.\\W
Matches any non-alphanumeric character; this is equivalent to the class[^a-zA-Z0-9_]
.
These sequences can be included inside a character class. For example, [\\s,.]
is a character class that will match any whitespace character, or ','
or '.'
.
Identify any leading spaces present:
Zero or More: *
- specifies that the previous character can be matched zero or more times, instead of exactly once.
Repetitions such as * are greedy; when repeating a RE, the matching engine will try to repeat it as many times as possible. If later portions of the pattern don’t match, the matching engine will then back up and try again with fewer repetitions.
print ("Testing '*'...")
assert re.match ('\\s*', ' johndavis@ nasa.gov') is not None
assert re.match ('\\s*', 'veleti3@deloitte.com') is not None
One or More: +
print ("Testing '+'...")
assert re.match ('\\s+', ' johndavis@ nasa.gov') is not None
assert re.match ('\\s+', 'veleti3@deloitte.com') is None
Identify the Domain:
Everything: .
Dot matches anything except a newline character, and there’s an alternate mode (re.DOTALL) where it will match even a newline. . is often used where you want to match “any character”.
print (re.match ('@.*', 'veleti@deloitte.com'))
Grouping: ( )
Frequently you need to obtain more information than just whether the RE matched or not. Regular expressions are often used to dissect strings by writing a RE divided into several subgroups which match different components of interest. Groups are marked by the ‘(‘, ‘)’ metacharacters. ‘(‘ and ‘)’ have much the same meaning as they do in mathematical expressions; they group together the expressions contained inside them, and you can repeat the contents of a group with a repeating qualifier, such as , +, ?, or {m,n}. For example, (ab) will match zero or more repetitions of ab.
re_email = re.compile ('\\s*(\\w+\\.*\\w*)\\@(\\w+\\-*\\w+[.\\w+]+)')print (re_email.match ('veleti3@deloitte.com').groups ())
To retrieve groups easily you could also name them using ‘?P’
re_names3 = re.compile ('''
\\s*(?P<userid>\\w+\\.*\\w*)
\\@
\\s*(?P<domain>\\\\w+\\-*\\w+[.\\w+]+)
''',
re.VERBOSE)
print (re_names3.match ('veleti3@deloitte.com').group ('domain'))
print (re_names3.match ('michelle.pratt@anthem.com').group ('domain'))
print (re_names3.match (' johndavis@ nasa.gov').group ('domain'))