Python RegEx
2 ways to find a match
RegEx, or Regular Expression is a mini language, using a string pattern, to search for a substring or substrings in a string.
After importing re module, there are four methods we can use to make queries.
- match()
- search()
- findall()
- finditer()
All these four methods can be called in two ways, called in the module level, or called from the compiled pattern objects:
- Module-Level Functions
import restring = "Lorem ipsum dolor sit amet, consectetur adipiscing elit"
pattern = r'\wo'
res = re.search(pattern, string)
- compiling/compilation
import restring = "Lorem ipsum dolor sit amet, consectetur adipiscing elit"
pattern = re.compile(r'\wo')
res = pattern.search(string)
The differences of these two are in two aspects:
- how to define the pattern
- how to call the search() function
Module-Level Functions
Take the search() method for example, we can call it directly from the re module: re.search()
re.search(r'\wo', "Lorem ipsum dolor sit amet")
First argument is the pattern, this can be literal pattern, or a compiled pattern. Second argument is the tring to be searched for. Of cause these two can be replaced with variables like the example above.
Compilation
Compiling regex will require making a pattern first, with the compile() methods. Then use the pattern to call the search() for example.
import repattern = re.compile(r'\wo', re.IGNORECASE)
res = pattern.search("Lorem ipsum dolor sit amet, consectetur adipiscing elit")
One of the pros of using the compile() method to create the pattern, is that one flag or more can be provided to refine the search process, which is the second arguments.
- re.A / re.ASCII
- re.S / re.DOTALL
- re.I / re.IGNORECASE
- re.L / re.LOCALE
- re.M / re.MULTILINE
- re.X / re.VERBOSE
Check the Python Docs for more definition of these flags
You can set multiple flags with pipe: re.I|re.X, to ignore case and set verbose/comments.
Personally, I would prefer using the compiling method, as I always forget the two arguments position when I use re.search(), and I like set the pattern first anyway, so more conveniently using the pattern to call the search(0 method.
some other notes for regex
About match(), search(), findall(), finditer()
- match() and search() will return one match, while findall() and finditer() will return all matches.
- match() and search() will return a re.Match object, while findall() return a list and finditer() return iterable.
- match() will only find the match at the start of the string, while search will search through the whole string.
Use raw string
As mentioned at the beginning, regex is a mini language inside python, python would interpret the string a little bit differently as in regex, it would be better to use the raw string to create the pattern.
One special character “?”
- with or without it
p = re.compile(r'\d\d\d-?\d\d\d\d') #with or without "-"
p = re.compile(r'(\d{2})?d') # with or without (\d{2})
- match the subpattern but don’t catch it. ((?:pattern)pattern)
>>> p1 = re.compile(r'(\we)+')
>>> p2 = re.compile(r'(?:\we)+')
>>> m1 = p1.search("references")
>>> m2 = p2.search("references")
>>> m1.groups()
('re',)
>>> m2.groups()
()
- name the match substring. (?P<name>pattern)
>>> p = re.compile(r'name is (?P<name>\w+)(\.| )')
>>> m = p.search('Hi, my name is Jack.')
>>> m.group("name")
'Jack'>>> p = re.compile(r'am (?P<fname>\w+) (?P<lname>\w+)(\.| )')
>>> m = p.search('Hi, I am Jack London.')
>>> m.groupdict()
{'fname': 'Jack', 'lname': 'London'}
ps: the ? is followed by a capital P
- look ahead assertion. have it or not have it
(?!...) # not have it
(?=...) # have itre.compile(r"""
.* # zero or more characters
[.] # with a "."
(?!exe$) # should not be end with exe after the ".", this is not pattern but a look ahead assertion
[^.]*$ # zero or more not '.' characters after the "."
""", re.X)
methods for re.Match object
- .group(), .group(i), .groups(), .groupdict()
.group() == .group(0)
>>> p = re.compile(r"c(o)?(a)t")
>>> m = p.search('a cat in coat')
>>> m1 = p.search("a coat on cat")
>>> print(m.group(), m.groups())
>>> print(m1.group(), m1.groups())
>>> print(p.findall('a cat in coat'))cat (None, 'a') #(o)? returns None ...
coat ('o', 'a')
[('', 'a'), ('o', 'a')] #(o)? returns "" ...
- .span(), .start(), .end()
>>> p = re.compile(r"c(o)?(a)t")
>>> m = p.search('a cat in coat')
>>> print(m.span(), m.start(), m.end())(2, 5) 2 5
That’s all for today.