Python Regular Expression

Regular expressions can be think like a mini-language for specifying text pattern

re.compile(): To create a regex object
re.search(): find a pattern in a string
re.match(): does this entire string conform to this pattern
re.findall(): find all patterns in this string and returns all the matches in it not just the first match
re.group(): to get the matched string

Searching with Regex

 match = re.search(pattern,string)

Pattern type(Character Classes)

\w : sequence of word-like characters [a-zA-Z0–9_] that are not space
\d: Any numeric digit[0–9]
\s: whitespace characters(space,newline,tab)
\D: match characters that are NOT numeric digits
\W: match characters that are NOT words,digit or underscore
\S: match characters that are NOT spaces,tab or newline

Repetition Group

+ : 1 or more
* : 0 or more
?: 0 or 1
{k}: exactly integer K occurence
{m,n}: m to n occurence inclusive
. :matches any character except the newline(\n)
^: start of the string
$: end of string
\: escape character 

Example

# Re module has all regular expression function in it
>>> import re
>>> example = “Welcome to the world of Python”
>>> pattern = r’Python’
>>> match = re.search(pattern,example)
>>> print(match)
<_sre.SRE_Match object; span=(24, 30), match=’Python’>
>>> if match:
… print(“found”, match.group())
… else:
… print(“No match found”)
found Python

NOTE: r is for raw string as Regex often uses \ backslashes(\w), so they are often raw strings(r’\d’)

Most popular example is finding phone number :-)

>>> import re
>>> message = “my number is 510–123–4567”
# Here we are creating regex object,which define the pattern we are looking for 
>>> myregex = re.compile(r’\d\d\d-\d\d\d-\d\d\d\d’)
# Then we are trying to find a pattern in the string
>>> match = myregex.search(message)
# This will tell us the actual text
>>> print(match.group())
510–123–4567

In case we have multiple phone number, use findall

>>> import re
>>> message = “my number is 510–123–4567 and my office number is 510–987–1234”
>>> myregex = re.compile(r’\d\d\d-\d\d\d-\d\d\d\d’)
# Find all pattern of the string and return a list objects
>>> print(myregex.findall(message))
[‘510–123–4567’, ‘510–987–1234’]

Lets use group to separate area code with phone number. Here parenthesis have special meaning where group start and where group end.

import re
myregex = re.compile(r’(\d\d\d)-(\d\d\d-\d\d\d\d)’)
>>> match = myregex.search(“My number is 510–123–4567”)
>>> match
<_sre.SRE_Match object; span=(13, 25), match=’510–123–4567'>
# This will return the full matching string
>>> match.group()
‘510–123–4567’
# Only return the first matching group(area code)
>>> match.group(1)
‘510’
#Second matching group(Return the whole phone number)
>>> match.group(2)
‘123–4567’

To find out parentheses literally in string, we need to escape parentheses using backslash \(

>>> myregex = re.compile(r’\(\d\d\d\)-(\d\d\d-\d\d\d\d)’)
>>> match = myregex.search(“My number is (510)-123–4567”)
>>> match.group()
‘(510)-123–4567’

Pipe Character(|) match one of many possible group

>>> lang = re.compile(r’Pyt(hon|con|mon)’)
>>> match = lang.search(“Python is a wonderful language”)
>>> match.group()
‘Python’
>>> match = lang.search(“Pytcon is a wonderful language”)
>>> match.group()
‘Pytcon’
>>> match = lang.search(“Pytmon is a wonderful language”)
>>> match.group()
‘Pytmon’

If regular expression not able to find that pattern it will return None, to verify that

>>> match = lang.search(“Pytut is a wonderful language”)
>>> match == None
True

? : zero or one time

>>> import re
# Here ho is optional it might occur zero time or one time
>>> myexpr = re.compile(r’Pyt(ho)?n’)
>>> match = myexpr.search(“Python a wonderful language”)
>>> match.group()
‘Python’
>>> match = myexpr.search(“Pytn a wonderful language”)
>>> match.group()
‘Pytn’

So if we try to match this expression it will fail

>>> match = myexpr.search(“Pythohon a wonderful language”)
>>> match.group()
Traceback (most recent call last):
File “<stdin>”, line 1, in <module>
AttributeError: ‘NoneType’ object has no attribute ‘group’
>>> match ==None
True

Same way as with our previous example of Phone Number we can make area code optional

>>> myphone = re.compile(r’(\d\d\d-)?\d\d\d-\d\d\d\d’)
>>> match = myphone.search(“My phone number is 123–4567”)
>>> match.group()
‘123–4567’

“*” zero or more time

>>> import re
>>> myexpr = re.compile(r’Pyth(on)*’)
>>> match = myexpr.search(“Welcome to the world of Pythononon”)
>>> match.group()
‘Pythononon’

“+” must appear atleast 1 or more time

>>> myexpr = re.compile(r’Pyth(on)+’)
>>> match = myexpr.search(“Welcome to the world of Pyth”)
>>> match.group()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'group'
>>> match = myexpr.search(“Welcome to the world of Python”)
>>> match.group()
‘Python’
>>> match = myexpr.search(“Welcome to the world of Pythonononon”)
>>> match.group()
‘Pythonononon’

Now if we want to match specific number of times

>>> myregex = re.compile(r’(Re){3}’)
>>> match = myregex.search(“My matching string is ReReRe”)
>>> match.group()
‘ReReRe’
# Range of repetitions
>>> myregex = re.compile(r'(Re){3,5}')
>>> match = myregex.search("My matching string is ReReReRe")
>>> match.group()
'ReReReRe'

Regular expression in Python do greedy matches i.e it try to match longest possible string

# Instead of searching for min i.e first 3 it matches first 5
>>> mydigit = re.compile(r’(\d){3,5}’)
>>> match = mydigit.search(‘123456789’)
>>> match.group()
‘12345’

To do a non-greedy match add ? (then it matches shortest string possible),Putting a question mark after the curly braces makes it to do a non-greedy match

>>> mydigit = re.compile(r’(\d){3,5}?’)
>>> match = mydigit.search(‘123456789’)
>>> match.group()
‘123’

Let’s take a look at few more example which involves character classes

\w : sequence of word-like characters [a-zA-Z0–9_] that are not space
\d: Any numeric digit[0–9]
\s: whitespace characters(space,newline,tab)

Let say I need to match this address

>>> import re
>>> address = “123 fremont street”
>>> match = re.compile(r’\d+\s\w+\s\w+’)
>>>match.findall( match.finditer( match.flags match.fullmatch(
>>> match.findall(address)
[‘123 fremont street’]

We can create our own character class

#Let's create our own character class which matches all lower case vowel
>>> myregex = re.compile(r’[aeiou]’) #To match even upper case
r'[aeiouAEIOU]'
>>> mypat = “Welcome to the world of Python”
>>> myregex.findall(mypat)
[‘e’, ‘o’, ‘e’, ‘o’, ‘e’, ‘o’, ‘o’, ‘o’]

Now if we want to match two vowel in a row

>>> myregex = re.compile(r’[aeiouAEIOU]{2}’)
>>> mypat = “Welcome to the world of Python ae”
>>> myregex.findall(mypat)
[‘ae’]

Negative Character Class(Use of ^ means search everything except vowel)

>>> myregex = re.compile(r’[^aeiouAEIOU]’)
>>> mypat = “Welcome to the world of Python ae”
>>> myregex.findall(mypat)
[‘W’, ‘l’, ‘c’, ‘m’, ‘ ‘, ‘t’, ‘ ‘, ‘t’, ‘h’, ‘ ‘, ‘w’, ‘r’, ‘l’, ‘d’, ‘ ‘, ‘f’, ‘ ‘, ‘P’, ‘y’, ‘t’, ‘h’, ’n’, ‘ ‘]

Let take look at dot (. :matches any character except the newline(\n))

>>> myregex = re.compile(r’.x’)
>>> mypat = “Linux Unix Minix”
>>> myregex.findall(mypat)
[‘ux’, ‘ix’, ‘ix’]

Dot is majorly used with *

* : 0 or more

Now if we change our regex to include both

>>> myregex = re.compile(r’.*x’)
>>> mypat = “Linux Unix Minix”
>>> myregex.findall(mypat)
[‘Linux Unix Minix’]

NOTE

.*: always perform greedy match(except newline)
.*?: To make it non-greedy add ?

Let take a look at this with the help of this example

>>> mystr = ‘“Welcome to the world of Python” great language to learn”’
>>> mypat = re.compile(r’”(.*?)”’)
#Because of non-greedy nature it will search till first " is encountered
>>> mypat.findall(mystr)
[‘Welcome to the world of Python’]

But in case of greedy match

>>> mypat = re.compile(r’”(.*)”’)
# It will return the whole string
>>> mypat.findall(mystr)
[‘Welcome to the world of Python” great language to learn’]

Now as we mentioned above .* matches everything except newline

>>> myexpr = “Welcome to the \n world of \n Python”
>>> print(myexpr)
Welcome to the
world of
Python
>>> mypat = re.compile(r’(.*)’)
>>> mypat.search(myexpr)
<_sre.SRE_Match object; span=(0, 15), match=’Welcome to the ‘>

Now even in this case if we want to perform a greedy match add re.DOTALL(then it will match newlines as well)

>>> mypat = re.compile(r’.*’,re.DOTALL)
>>> mypat.search(myexpr)
<_sre.SRE_Match object; span=(0, 34), match=’Welcome to the \n world of \n Python’>

Second argument is really useful, specially if we want to perform case-insensitive search(re.I)

>>> import re
>>> mystr = “Why Linux Is Such An Awesome Platform”
>>> mypat = re.compile(r’[aeiou]’,re.I)
>>> mypat.findall(mystr)
[‘i’, ‘u’, ‘I’, ‘u’, ‘A’, ‘A’, ‘e’, ‘o’, ‘e’, ‘a’, ‘o’]