Regex (Regular Expressions) Demystified

Munish Goyal
The Startup
Published in
23 min readAug 3, 2020

--

To fully utilize the power of shell scripting (and programming), one needs to master Regular Expressions. Certain commands and utilities commonly used in scripts, such as grep, expr, sed and awk use REs.

In this article we are going to talk about Regular Expressions. Below is “Table of Contents” for you to get gist of what is going to be covered and help in navigation:

· What is Regex?
· Regex Metacharacters
· How a Regex Engine works internally?
· Character Sets (or Classes): [ ]
· Word Sets (Alternation): |
· The Dot
· Anchors
· Repetition (?, *, +)
· Grouping and Back-references
∘ _Back-references
∘ _Named Capturing Groups
∘ _Branch Reset Groups
· Regex Matching Modes
· Mode Modifiers
· Atomic Grouping & Rest of the Topics
· Regex Performance
· Using Regex in Linux Shell
· References

What is Regex?

Regular Expressions are sets of characters and/or metacharacters that match (or specify) patterns. The main uses for Regular Expressions (REs) are text searches and string manipulation. An RE matches a single character or a set of characters — a string or a part of a string.

Those characters having an interpretation above and beyond their literal meaning are called metacharacters.

Regex Pattern:

Generally you define a Regex pattern by enclosing that pattern (without any additional quotes) within two forward-slashes. For example, /\w/, and /[aeiou]/.

Case Sensitivity:

Note that regex engines are case sensitive by default, unless you tell the regex engine to ignore the differences in case.

Regex uses:

When you scan a string (may be multi-line) with a regex pattern, you can get following information:

  • Whether there is any match or not
  • Matched substrings within given string
  • Position of these substring within given string
  • Group back references for every substring
  • When used with \A, and \Z, rather than a matching substring, we can match whole of the given string as a unit

Regex Metacharacters

Inside a pattern, all characters except (, ), [, ], {, }, |, \, ?, *, +, ., ^, and $ match themselves. If you want to match one of the special characters literally in a pattern, precede it with a backslash.

Note: Even / cannot be used inside a pattern, you can escape it by preceding it with backslash.

Most regular expression flavors treat the brace { as a literal character, unless it is part of a repetition operator like {1,3}. So you generally do not need to escape it with a backslash, though you can do so if you want. An exception to this rule is the java.util.regex package which requires all literal braces to be escaped.

Escaping a Metacharacter:

The \ (backslash) is used to escape special characters and is used to give special meaning to some normal characters. For example, \1 is used to back reference first word and \d means a digit character, and \D means non-digit character, and to specify non-printable characters such as \n (LF), \r (CR), and \t (tab).

Note: You can also escape backslash with backslash.

Escaping a single meta-character with a backslash works in all regular expression flavors.

All other characters should not be escaped with a backslash. That is because the backslash is also a special character. The backslash in combination with a literal character can create a regex token with a special meaning. For example, \d will match a single digit from 0 to 9.

As a programmer, you may be surprised that characters like the single quote and double quote are not special characters.

Special characters and programming languages:

In your source code, you have to keep in mind which characters get special treatment inside strings by your programming language. That is because those characters will be processed by the compiler, before the regex library sees the string.

Non-Printable Characters:

You can use special character sequences to put non-printable characters in your regular expression.

  • Use \t to match a tab character (ASCII 0x09), \r for carriage return (0x0D) and \n for line feed (0x0A).
  • More exotic non-printables are \a (bell, 0x07), \e (escape, 0x1B), \f (form feed, 0x0C) and \v (vertical tab, 0x0B).
  • Remember that Windows text files use \r\n to terminate lines, while UNIX (Linux and Mac OS X) text files use \n (LF), and \r (CR) in older versions of Mac OS.
  • You can include any character in your regular expression if you know its hexadecimal ASCII or ANSI code for the character set that you are working with. In the Latin-1 character set, the copyright symbol is character 0xA9. So to search for the copyright symbol, you can use \xA9.
  • If your regular expression engine supports Unicode, use \uFFFF rather than \xFF to insert a Unicode character. The euro currency sign occupies code point 0x20AC. If you cannot type it on your keyboard, you can insert it into a regular expression with \u20AC.

Basic vs. Extended Regular Expressions:

Refer: http://www.gnu.org/software/grep/manual/html_node/Basic-vs-Extended.html

In basic regular expressions the meta-characters ?, +, {, |, (, and ) lose their special meaning; instead use the backslashed versions \?, \+, \{, \|, \(, and \).

Portable scripts should avoid { is grep -E patterns and should use [{] to match a literal {. Some implementations support \{ as meta-character.

How a Regex Engine works internally?

Knowing how the regex engine works will enable you to craft better regexes more easily.

The regex-directed engines are more powerful:

There are two kinds of regular expression engines:

  • text-directed engines, and
  • regex-directed (important) engines.

Certain very useful features, such as lazy quantifiers and backreferences, can only be implemented in regex-directed engines. No surprise that this kind of engine is more popular.

Notable tools that use text-directed engines are awk, egrep, flex, lex, MySQL and Procmail. For awk and egrep, there are a few versions of these tools that use a regex-directed engine.

You can easily find out whether the regex flavor you intend to use has a text-directed or regex-directed engine. If backreferences and/or lazy quantifiers are available, you can be certain the engine is regex-directed. You can do the test by applying the regex /regex|regex not/ to the string regex not. If the resulting match is only regex, the engine is regex-directed. If the result is regex not, then it is text-directed. The reason behind this is that the regex-directed engine is eager.

The Regex-Directed Engine Always Returns the Leftmost Match:

This is a very important point to understand: a regex-directed engine will always return the leftmost match, even if a “better” match could be found later. When applying a regex to a string, the engine will start at the first character of the string. It will try all possible permutations of the regular expression at the first character. Only if all possibilities have been tried and found to fail, will the engine continue with the second character in the text. Again, it will try all possible permutations of the regex, in exactly the same order. The result is that the regex-directed engine will return the leftmost match.

Character Sets (or Classes): [ ]

With a “character set”, you can tell the regex engine to match only “one” out of several characters (order in which characters are listed in character-set doesn’t matter). Simply place the characters you want to match between square brackets ([ and ]). If you want to match an a or an e, use [ae]. You could use this in gr[ae]y to match either ‘gray’ or ‘grey’.

A character class matches only a single character. gr[ae]y will not match ‘graay’, ‘graey’ or any such thing. The order of the characters inside a character class does not matter. The results are identical.

Ranges:

You can use a hyphen (-) inside a character class to specify a range of characters. [0-9] matches a single digit between 0 and 9. You can use more than one range. [0-9a-fA-F] matches a single hexadecimal digit, case insensitively. You can combine ranges and single characters. [0-9a-fxA-FX] matches a hexadecimal digit or the letter X. You can have multiple ranges within a character set (to choose only character from all those ranges). Again, the order of the characters and the ranges does not matter.

Negated Character Classes:

Typing a caret (^) after the opening square bracket will negate the character set. The result is that the character set will match any character that is not in the character set. Unlike the dot, negated character classes also match spaces and (invisible) line break characters.

Note that a caret after the opening square bracket negates the “whole character set”, not just the character immediately after it.

It is important to remember that a negated character class still must match a character. q[^ut] does not mean: “a q not followed by a u or t”. It means: “a q followed by a character that is not a u or t”. It will not match the q in the string “Iraq”. It will match the q and the space after the q in “Iraq is a country”. Indeed: the space will be part of the overall match, because it is the “character that is not a u or t” that is matched by the negated character class in the above regexp.

Metacharacters Inside Character Sets:

Note that the only special characters or metacharacters inside a character class are the closing bracket (]), the backslash (\), the caret (^) and the hyphen (-). The usual metacharacters are normal characters inside a character class, and do not need to be escaped by a backslash. To search for a star or plus, use [+*].

Your regex will work fine if you escape the regular metacharacters inside a character class, but doing so significantly reduces readability.

To include a backslash as a character without any special meaning inside a character class, you have to escape it with another backslash. [\\x] matches a backslash or an x.

The closing bracket (]), the caret (^) and the hyphen (-) can be included by escaping them with a backslash, or by placing them in a position where they do not take on their special meaning. It is recommended to follow the latter method, since it improves readability.

  • To include a caret, place it anywhere except right after the opening bracket. [x^] matches an x or a caret.
  • You can put the closing bracket right after the opening bracket, or the negating caret. []x] matches a closing bracket or an x. [^]x] matches any character that is not a closing bracket or an x.
  • The hyphen can be included right after the opening bracket, or right before the closing bracket, or right after the negating caret. Both [-x] and [x-] match an x or a hyphen.

You can use all non-printable characters in character classes just like you can use them outside of character classes. E.g. [$\u20AC] matches a dollar or euro sign, assuming your regex flavor support Unicode.

POSIX regular expression:

POSIX regular expressions treat the backslash as a literal character inside character classes. This means you can’t use backslashes to escape the closing bracket (]), the caret (^) and the hyphen (-). To use these characters, position them as explained above in this section. This also means that special tokens like shorthands are not available in POSIX regular expressions.

Shorthand Character Classes:

Since certain character classes are used often, a series of shorthand character classes are available:

  • The \d is short for [0-9].
  • The \w stands for word character, usually [A-Za-z0-9_]. Notice the inclusion of the underscore and digits.
  • The \s stands for whitespace character. Again, which characters this actually includes, depends on the regex flavor. In general, it includes [ \t\r\n]. That is: \s will match a space, a tab or a line break. Some flavors include additional, rarely used non-printable characters such as vertical tab and form feed.

Shorthand character classes can be used both inside and outside the square brackets. For example, \s\d matches a whitespace character followed by a digit, and [\s\d] matches a single character that is either whitespace or a digit.

Negated Shorthand Character Classes:

The above three shorthands also have negated versions.

  • The \D is the same as [^\d],
  • The \W is short for [^\w] and
  • The \S is the equivalent of [^\s].

Character Class Subtraction:

Character class subtraction makes it easy to match any single character present in one list (the character class), but not present in another list (the subtracted class). The syntax for this is:

[class-[subtract]]

If the character after a hyphen (-) is an opening bracket, these flavors interpret the hyphen as the subtraction operator rather than the range operator. You can use the full character class syntax within the subtracted character class.

For example,the character class [a-z-[aeiuo]] matches a single letter that is not a vowel. In other words: it matches a single consonant.

Nested Character Class Subtraction:

Since you can use the full character class syntax within the subtracted character class, you can subtract a class from the class being subtracted. [0-9-[0-6-[0-3]]] first subtracts 0-3 from 0-6, yielding [0-9-[4-6]], or [0-37-9], which matches any character in the string 0123789.

Note that class subtraction must always be the last element in the character set. For example, [0-9-[4-6]a-f] is not a valid regular expression. It should be rewritten as [0-9a-f-[4-6]].

Negation takes precedence over subtraction:

The character class [^1234-[3456]] is both negated and subtracted form. In all flavors that support character class subtraction, the base class is negated before it is subtracted from. This class should be read as “(not 1234) minus 3456”. Thus this character class matches any character other than the digits 1, 2, 3, 4, 5, and 6.

Character Class Intersection:

Character class intersection is supported by Java and Ruby. It makes it easy to match any single character that must be present in two set of characters. The syntax for this is:

[class&&[intersect]]

You can use the full character class syntax within the intersected character class. If the intersected class does not need a negating caret, then you can omit nested square brackets:

[class&&intersect]

For example, the character class [a-z&&[^aeiuo\]] matches a single letter that is not a vowel. In other words: it matches a single consonant.

Intersection of Multiple Classes:

You can intersect the same class more than once. [0-9&&0-6&&4-9] is the same as [4-6] as those are the only digits present in all three parts of the intersection. You can write the same regex as [0-9&&[0-6]&&[4-9]], [0-9&&[0-6&&4-9]], [0-9&&[0-6]&&4-9], [0-9&&0-6&&[4-9]], or [0-9&&[0-6&&[4-9]]]. The nested square brackets are only needed if one of the parts of the intersection is negated.

Intersection in Negated Classes:

The character class [^1234&&3456] is both negated and intersected. In Java, negation takes precedence over intersection (similar to its precedence over subtraction). Java reads this regex as “(not 1234) and 3456”. Thus in Java this class is the same as [56] and matches the digits 5 and 6. In Ruby, intersection takes precedence over negation. Ruby reads [^1234&&3456] as “not (1234 and 3456)”. Thus in Ruby this class is the same as [^34] which matches anything except the digits 3 and 4.

Repeating Character Classes:

If you repeat a character class by using the ?, * or + operators, you will repeat the entire character class, and not just the character that it matched.

For example,

The regex [0-9]+ can match “837” as well as “222”.

If you want to repeat the matched character, rather than the class, you will need to use backreferences. ([0-9])\1+ will match “222” but not “837”.

Word Sets (Alternation): |

Refer: http://www.regular-expressions.info/alternation.html

We already discussed on how we can use character classes to match a single character out of several possible characters. Alternation is similar. We can use alternation (using pipe symbol |) to match a single regular expression out of several possible regular expressions.

For example, if you want to search for the literal text “cat” or “dog”, separate both options with a vertical bar or pipe symbol: "cat|dog". If you want more options, simply expand the list: cat|dog|mouse|fish.

Alternation operator priority:

The alternation operator has the lowest precedence of all regex operators. That is, it tells the regex engine to match either everything to the left of the vertical bar, or everything to the right of the vertical bar. If you want to limit the reach of the alternation, you will need to use round brackets for grouping.

If we want to improve the first example to match whole words only, we would need to use \b(cat|dog)\b. This tells the regex engine to find a word boundary, then either “cat” or “dog”, and then another word boundary. If we had omitted the round brackets, the regex engine would have searched for “a word boundary followed by cat”, or, “dog” followed by a word boundary.

Note, you must be careful about operator preferences. For example,

# Ruby
"A good thing" =~ /^[Gg]ood|[Bb]ad|best/ #=> nil
"A bad thing" =~ /^[Gg]ood|[Bb]ad|best/ #=> 2
"A good thing" =~ /^([Gg]ood|[Bb]ad)|best/ #=> nil
"good thing" =~ /^([Gg]ood|[Bb]ad)|best/ #=> 0

Note that here we used parenthesis to override the operator precedence, but they have another use too. Parenthesis can be used to remember the matching text.

The Dot

The Dot Matches (Almost) Any Character:

The dot matches a single character, without caring what that character is. The only exception are newline characters. In most regex flavors, the dot will not match a newline character by default. So by default, the dot is short for the negated character class [^\n] (UNIX regex flavors) or [^\r\n] (Windows regex flavors).

In almost all programming languages and regex libraries, activating single-line mode has no effect other than making the dot match newlines.

JavaScript does not have an option to make the dot match line break characters. In JS, you can use a character class such as [\s\S] to match any character.

Remember: The dot is not a meta-character inside a character class, so in order to match a . character we do not need to escape it with a backslash.

Use Negated Character Classes Instead of the Dot:

A negated character class is often more appropriate than the dot. For example,

Suppose you want to match a double-quoted string. Sounds easy. We can have any number of any character between the double quotes, so ".*" seems to do the trick just fine. The dot matches any character, and the star allows the dot to be repeated any number of times, including zero. If you test this regex on Put a “string” between double quotes, it matches “string” just fine. Now go ahead and test it on

Houston, we have a problem with "string one" and "string two". Please respond.

Ouch. The regex matches "string one" and "string two". Definitely not what we intended. The reason for this is that the star is greedy.

So, our original definition of a double-quoted string was faulty. We do not want any number of any character between the quotes. We want any number of characters that are not double quotes or newlines between the quotes. So the proper regex is [^"\r\n\]*.

Anchors

Anchors do not match any character at all. Instead, they match a position before, after or between characters.

Line Boundaries: ^ and $:

The character ^ is “the start of line” anchor, and the character $ is “the end of line” anchor. They can be used on a single/multiple line string.

Note that the ^ is only an anchor if it is the first character of regex pattern. Similarly, the $ is only an anchor if it is the last character of regex pattern.

Applying ^a to “abc” matches a. ^b will not match “abc” at all, because the b cannot be matched right after the start of the string, matched by ^. Similarly, c$ matches c in “abc”, while a$ does not match at all.

Pattern Matches:

Zero-Length Matches:

We saw that the anchors match at a position, rather than matching a character. This means that when a regex only consists of one or more anchors, it can result in a zero-length match. Depending on the situation, this can be very useful or undesirable.

Strings, Lines, and Multi-line mode:

Note that a given string may contain multiple lines.

If you have a string consisting of multiple lines, like first line\nsecond line (where \n indicates a line break), it is often desirable to work with lines, rather than the entire string.

Almost all the regex engines have the option to expand the meaning of both anchors. ^ can then match at the start of the string, as well as after each line break. Likewise, $ will still match at the end of the string, and also before every line break. In text editors like EditPad Pro or GNU Emacs, and regex tools like PowerGREP, the caret and dollar always match at the start and end of each line. This makes sense because those applications are designed to work with entire files, rather than short strings.

In most programming languages and libraries, except Ruby, you have to explicitly activate this extended functionality. It is traditionally called multi-line mode. In Perl, you do this by adding an m after the regex code, like this: m/^regex$/m.

String Boundaries: \A and \Z

\A only ever matches at the start of the string. Likewise, \Z only ever matches at the end of the string. These two tokens never match at line breaks. This is true even when you turn on multiline mode.

In EditPad Pro and PowerGREP, where the caret and dollar always match at the start and end of lines, \A and \Z only match at the start and the end of the entire file.

JavaScript, POSIX and XML do not support \A and \Z. You’re stuck with using the caret and dollar for this purpose.

Strings Ending with a Line Break:

Even though \Z and $ only match at the end of the string (when the option for the caret and dollar to match at embedded line breaks is off), there is one exception: If the string ends with a line break, then \Z and $ will match at the position before that line break, rather than at the very end of the string.

Word Boundaries:

What is a word character?

In all flavors, the characters [a-zA-Z0-9_] are word characters. These are also matched by the short-hand character class \w. Flavors showing “ascii” for word boundaries in the flavor comparison recognize only these as word characters. Flavors showing “YES” also recognize letters and digits from other languages or all of Unicode as word characters. Notice that Java supports Unicode for \b but not for \w. Python offers flags to control which characters are word characters (affecting both \b and \w).

A “word character” is a character that can be used to form words. All characters that are not “word characters” are “non-word characters”.

The anchor \b:

The metacharacter \b is an anchor like the caret and the dollar sign. It matches at a position that is called a “word boundary”. This match is zero-length.

There are three different positions that qualify as word boundaries:

  • Before the first character in the string, if the first character is a word character.
  • After the last character in the string, if the last character is a word character.
  • Between two characters in the string, where one is a word character nd the other is not a word character.

Simply put: \b allows you to perform a “whole words only” search using a regular expression in the form of \bword\b.

Negated Word Boundary:

\B is the negated version of \b. \B matches at every position where \b does not. Effectively, \B matches at any position between two word characters as well as at any position between two non-word characters.

The regex engine is eager:

As we already discussed that the regex engine is eager. It will stop searching as soon as it finds a valid match. The consequence is that in certain situations, the order of the alternatives matters.

Suppose you want to use a regex to match a list of function names in a programming language: ‘Get’, ‘GetValue’, ‘Set’ or ‘SetValue’. The obvious solution is Get|GetValue|Set|SetValue. Let’s see how this works out when the string is “SetValue”. It matches “Set” in “SetValue”.

Repetition (?, *, +)

Where they can be used?

They work on immediately preceding “character” or immediately preceding “set of characters enclosed within parenthesis”.

There Nature:

Consider following cases:

  • The .* doesn’t mean that just a same character will be repeated. In case you want to find repeating characters, you would require to use back-references.
  • The \d+ matches a string of one of more digits. For example 123, 1, 22422
  • The 4+ matches 4, 44, 444, etc.

The Question Mark (?):

The question mark makes the preceding token in the regular expression optional (that is, zero or once).

For example,

  • colou?r matches both ‘colour’ and ‘color’.
  • Nov(ember)? matches ‘Nov’ and ‘November’.

The Asterisk (*):

The asterisk or star tells the engine to attempt to match the preceding token zero or more times.

For example, <[A-Za-z][A-Za-z0-9]*> matches an HTML tag without any attributes. The angle brackets are literals. The first character class matches a letter. The second character class matches a letter or digit. The star repeats the second character class. We could also have used <[A-Za-z0-9]+>. We did not, because this regex would match <1>, which is not a valid HTML tag.

The Plus (+):

The plus tells the engine to attempt to match the preceding token once or more.

Repetition operator ({min, max}):

Modern regex flavors have an additional repetition operator that allows you to specify how many times a token can be repeated.

The syntax is: {min,max}

Here, min is a positive integer number indicating the minimum number of matches, and max is an integer equal to or greater than min indicating the maximum number of matches.

If the comma is present but max is omitted, the maximum number of matches is infinite. So {0,} is the same as *, and {1,} is the same as +.

Omitting both the comma and max tells the engine to repeat the token exactly min times.

For example, you could use \b[1-9][0-9]{3}\b to match a number between 1000 and 9999. \b[1-9][0-9]{2,4}\b matches a number between 100 and 99999. Notice the use of the word boundaries.

Their Greediness:

The ‘question mark’, ‘star’, ‘plus’, and ‘repetition operator’ all are greedy.

For example, the question mark gives the regex engine two choices: try to match the part the question mark applies to, or do not try to match it. The engine will always try to match that part. Only if this causes the entire regular expression to fail, will the engine try ignoring the part the question mark applies to.

The effect is that if you apply the regex Feb 23(rd)? to the string Today is “Feb 23rd, 2003”, the match will always be “Feb 23rd” and not “Feb 23”. Similarly, the * always matches the longest possible string that satisfies the regular expression. For example, ^s(.*)s will “sitting at starbucks”, but not “sitting at s”.

Making them Lazy:

Lazy quantifiers are sometimes also called ungreedy or reluctant. You can do that by putting a question mark behind the ‘question mark’, ‘plus’, ‘star’, or ‘repetition operator’ in the regex.

Negative Lookahead:

Refer: http://www.regular-expressions.info/lookaround.html

There is a way to negate a regex pattern, but it doesn’t seem to work everywhere.

Grouping and Back-references

By placing part of a regular expression inside round brackets or parentheses, you can group that part of the regular expression together. This allows you to apply a regex operator, e.g. a repetition operator, to the entire group.

Note that only round brackets can be used for grouping. Square brackets define a character class, and curly braces are used by a special “repetition operator”.

Back-references

Round Brackets create a Back-reference:

Besides grouping part of a regular expression together, round brackets also create a back-reference. A back-reference stores the part of the string matched by the part of the regular expression inside the parentheses.

For example, the regex Set(Value)? matches ‘Set’ or ‘SetValue’. In the first case, the first backreference will be empty, because it did not match anything. In the second case, the first backreference will contain Value.

If you do not use the backreference, you can optimize this regular expression into Set(?:Value)?. The question mark and the colon after the opening round bracket are the special syntax that you can use to tell the regex engine that this pair of brackets should not create a back-reference. Note the question mark after the opening bracket is unrelated to the question mark at the end of the regex.

Unless you use non-capturing parentheses, remembering part of the regex match in a back-reference, slows down the regex engine because it has more work to do. If you do not use the back-reference, you can speed things up by using non-capturing parentheses, at the expense of making your regular expression slightly harder to read.

How to Use Back-references:

Back-references allow you to reuse part of the regex match. You can reuse it inside the regular expression, or afterwards.

What you can do with it afterwards, depends on the tool or programming language you are using. The most common usage is in search-and-replace operations. The replacement text will use a special syntax to allow text matched by capturing groups to be reinserted. This syntax differs greatly between various tools and languages, far more than the regex syntax does. Please check the replacement text reference for details.

Using Back-references within The Regular Expression:

Back-references can not only be used after a match has been found, but also during the match. Suppose you want to match a pair of opening and closing HTML tags, and the text in between.

For example,

grep -E '^.*(\b\w+\b).*\1.*$' language python.md

It will print all the line with any word is repeated within a line.

For example, by putting the opening tag into a back-reference, we can reuse the name of the tag for the closing tag. Consider <([A-Z][A-Z0-9]*)\b[^>]*>.*?</\1> regex. This regex contains only one pair of parentheses, which capture the string matched by [A-Z][A-Z0-9]* into the first back-reference. This back-reference is reused with \1 (backslash one). The / before it is simply the forward slash in the closing HTML tag that we are trying to match.

For example,

# Ruby
MAX_ALLOWED_REPETITION = 3
input_roman =~ /(.)\1{#{MAX_ALLOWED_REPETITION},}/

the above condition will fail if string object of “input_roman” variables has pattern in which any character is repeated more than 3 times continuously.

To figure out the number of a particular back-reference, scan the regular expression from left to right and count the opening round brackets. The first bracket starts back-reference number one, the second number two, etc. Non-capturing parentheses are not counted.

This fact means that non-capturing parentheses have another benefit: you can insert them into a regular expression without changing the numbers assigned to the back-references. This can be very useful when modifying a complex regular expression.

Using same back-reference multiple times:

You can reuse the same back-reference more than once. The ([a-c])x\1x\1 will match “axaxa”, “bxbxb” and “cxcxc”.

Last match saved is used:

The regex engine does not permanently substitute back-references in the regular expression. It will use the last match saved into the back-reference each time it needs to be used. If a new match is found by capturing parentheses, the previously saved match is overwritten.

For example, see the difference between ([abc]+) and ([abc])+. Though both successfully match cab, the first regex will put ‘cab’ into the first back-reference, while the second regex will only store ‘b’.

For more, refer:

Named Capturing Groups

Refer: http://www.regular-expressions.info/named.html

Branch Reset Groups

Refer: http://www.regular-expressions.info/branchreset.html

Regex Matching Modes

Most regular expression engines support the following four matching modes:

  • /i makes the regex match case insensitive.
  • /g perform a global match (find all matches rather than stopping after the first match)
  • /s enables single-line mode. In this mode, the dot matches newlines.
  • /m enables multi-line mode. In this mode, the caret and dollar match before and after newlines in the subject string.
  • /x enables free-spacing mode. In this mode, whitespace between regex tokens is ignored, and an unescaped # starts a comment.

Two languages that don’t support all of the above four are JavaScript and Ruby. Some regex flavors also have additional modes or options that have single letter equivalents. These are very implementation-dependent.

Mode Modifiers

Normally, matching modes are specified outside the regular expression. In a programming language, you pass them as a flag to the regex constructor or append them to the regex literal.

Sometimes, the tool or language does not provide the ability to specify matching options. In those situations, you can add the mode modifiers to the start of the regex.

Refer: http://www.regular-expressions.info/modifiers.html

Atomic Grouping & Rest of the Topics

Refer: http://www.regular-expressions.info/atomic.html

Rest of the topics from http://www.regular-expressions are left.

Regex Performance

Existing regex engine is slow and can have exponential running time. There is an open source implementation, RE2 which does prevent slow matches.

Its Python wrapper is pre2 (but note that it is not stable and not necessarily faster for every specific regex as compared to re). Also, there is a Ruby wrapper for old regex engine safe_regexp which fails a regex if it takes more than given timeout setting.

Also read Catastrophic Backtracking.

Consider following example using re:

import re
template = re.compile("(\w+)+\.")
target = "a"*30
template.search(target)

Here, the search() takes minutes to complete and CPU usage goes to 100%. Interesting fact is that if string is less than 20-25 characters, then search() returns in no time.

Using Regex in Linux Shell

Bash:

As of version 3, Bash has acquired its own RE-match operator: =~.

Refer: http://www.tldp.org/LDP/abs/html/regexp.html

But, I find the easiest way to be using grep -E, with format such as:

# prints matched lines with matching words colored
echo "some string in double quotes" | grep --color=always -E "SOME_REGEX_PATTERN"
# prints only matching words
echo "some string in double quotes" | grep --color=always -E "SOME_REGEX_PATTERN" -o

Zsh:

The operator =~ works in Zsh as well.

References

--

--

Munish Goyal
The Startup

Designing and building large-scale data-intensive cloud-based applications/APIs.