Non-capturing group in Python’s regular expression
For the past two days I’ve been watching Programming Languages CS 262 on udacity and has been using regular expression for quite a bit.
Say you want to write a regular expression to recognize restricted email addresses of the form alphanumeric@example.org and alphanumeric@long.subdomain.more.subdomain.domain.org. One particular solution is like this:
import re
text1 = "alphanumeric@example.org"
text2 = "alphanumeric@long.subdomain.more.subdomain.domain.org"
r1 = re.compile(r'[a-zA-Z0-9]+@[a-zA-Z]+\.[a-zA-Z]+(\.[a-zA-Z]+)*')
print(r1.match(text1).group())
print(r1.match(text2).group())
>>> print(r1.match(text1).group())
alphanumeric@example.org
>>> print(r1.match(text2).group())
alphanumeric@long.subdomain.more.subdomain.domain.org
Great. The regular expression is good, at least these tiny test cases seem to say so. But Problem Set 2.2 Email Addresses And Spam uses re.findall() to find all the email addresses in text. Following the previous snippet, we added:
r2 = re.compile(r'[a-zA-Z0-9]+@[a-zA-Z]+\.[a-zA-Z]+(?:\.[a-zA-Z]+)*')
print(r1.findall(text2))
print(r2.findall(text2))
Note r2 is identical to r1 from previous snippet except the additional :?
>>> print(r1.findall(text2))
[‘.org’]
>>> print(r2.findall(text2))
[‘alphanumeric@long.subdomain.more.subdomain.domain.org’]
Why? It turns out that when re.findall sees a group in a regular expression pattern, the findall method will return the matches for the group. I will quote the documentation:
Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.
The (?: stuff gones inside) syntax is a non-capturing group. It basically means you can’t retrieve this via .groups() method call.
The ( stuff goes inside ) is capturing group. If you use .groups() you can examine matches for each capturing group in your pattern.
In our example, r1 has exactly one capturing group, which is the ending (\.[a-zA-Z]+)*, capturing zero or more .com .org that kind of pattern. This is why we see .org was printed instead of the entire matched string.
The full test code is available on this gist.
As I was writing this post, I’ve found a few helpful discussions. If you think I’ve made any mistake, let me know so I can learn!
Email me when Facing Security publishes or recommends stories