Regular Expressions (Regex)
introduction :
A regular expression is a pattern describing a certain amount of text. In this tutorial, regular expressions are printed between guillemots: «medium».
- «medium» is a valid regex, which matches the “medium” literal text in any given string or text.
- «\b[A-Z0–9._%+-]+@[A-Z0–9.-]+\.[A-Z]{2,4}\b» is a more complex pattern. It describes a series of letters, digits, dots, underscores, percentage signs and hyphens, followed by an at sign, followed by another series of letters, digits and hyphens, finally followed by a single dot and between two and four letters. In other words: this pattern describes an email address.
Simple examples :
- The most basic regular expression consists of a single literal character, e.g.: «k». It will match the first occurrence of that character in the string. If the string is “Ashoka is the greatest among all kings”, it will match the “k”after the “o” in ‘Ashoka’.
- Similarly, the regex «cat» will match “cat” in “About cats and dogs”. This regular expression consists of a series of three literal characters. This is like saying to the regex engine: find a «c», immediately followed by an «a», immediately followed by a «t». Note that regex engines are case sensitive by default. «cat» does not match “Cat”, unless you tell the regex engine to ignore differences in case.
→ Regex Characters :
Because we want to do more than simply search for literal pieces of text, we need to reserve certain characters for special use.
They are,
- The opening square bracket «[»
- The backslash «\»
- The caret «^»
- The dollar sign «$»
- The period or dot «.»
- The vertical bar or pipe symbol «|»
- The question mark «?»
- The asterisk or star «*»
- The plus sign «+»
- The opening round bracket «(»
- The closing round bracket «)».
These special characters are often called “meta-characters”.
To use meta-characters in your regex strings as literals, you need to escape them with a backslash(\). If you want to match “1+3=4”, the correct regex is «1\+3=4». Otherwise, the plus sign will have a special meaning.
- Escaping a single meta character with backslash works in all regex engines.
- Most of regex engines support \Q … \E escape sequence. All the characters between the \Q and the \E are interpreted as literal characters. E.g. «\Q*\d+*\E» matches the literal text „*\d+*”. The \E may be omitted at the end of the regex, so «\Q*\d+*» is the same as «\Q*\d+*\E».
- You can include any character in your regular expression if you know its hexadecimal ASCII or ANSI code for the character set that you are working with. In the Latin-1 character set, the copyright symbol is character 0xA9. So to search for the copyright symbol, you can use «\xA9».
Note :
In your source code, you have to keep in mind which characters get special treatment inside strings by your programming language. That is because those characters will be processed by the compiler, before the regex library sees the string. So the regex «1\+1=2» must be written as “1\\+1=2” in C++, Objective-C, swift languages. These language compilers will turn the escaped backslash in the source code into a single backslash in the string that is passed on to the regex library. To match „c:\temp”, you need to use the regex «c:\\temp». As a string in the above mentioned languages’ source code, this regex becomes “c:\\\\temp”. Four backslashes to match a single one indeed.
→ Regex Engine :
A regular expression “engine” is a piece of software that can process regular expressions, trying to match the pattern to the given string.
There are two kinds of regular expression engines.
- Text-directed Engine
- Regex-directed Engine which is more popular because it supports lazy-quantifiers and back-references.
You can easily find out whether the regex flavour you intend to use has a text-directed or regex-directed engine. If back-references and/or lazy quantifiers are available, you can be certain the engine is regex-directed. You can do the test by applying the regex «regex|regex not» to the string “regex not”. If the resulting match is only „regex”, the engine is regex-directed. If the result is „regex not”, then it is text-directed. The reason behind this is that the regex-directed engine is “eager”.
Note :
The regex-directed engine will always return the leftmost match, even if a “better” match could be found later. When applying a regex to a string, the engine will start at the first character of the string. It will try all possible permutations of the regular expression at the first character. Only if all possibilities have been tried and found to fail, will the engine continue with the second character in the text. Again, it will try all possible permutations of the regex, in exactly the same order. The result is that the regex-directed engine will return the leftmost match.
When applying «cat» to “He captured a catfish for his cat.”, the engine will try to match the first token in the regex «c» to the first character in the match “H”. This fails. There are no other possible permutations of this regex, because it merely consists of a sequence of literal characters. So the regex engine tries to match the «c» with the “e”. This fails too, as does matching the «c» with the space. Arriving at the 4th character in the match, «c» matches „c”. The engine will then try to match the second token «a» to the 5th character, „a”. This succeeds too. But then, «t» fails to match “p”. At that point, the engine knows the regex cannot be matched starting at the 4th character in the match. So it will continue with the 5th: “a”. Again, «c» fails to match here and the engine carries on. At the 15th character in the match, «c» again matches „c”. The engine then proceeds to attempt to match the remainder of the regex at character 15 and finds that «a» matches „a” and «t» matches „t”.
The entire regular expression could be matched starting at character 15. The engine is “eager” to report a match. It will therefore report the first three letters of catfish as a valid match. The engine never proceeds beyond this point to see if there are any “better” matches. The first match is considered good enough.
→ Character Classes or Character Sets :
With a “character class”, also called “character set”, you can tell the regex engine to match only one out of several characters. Simply place the characters you want to match between square brackets. If you want to match an a or an e, use «[ae]».
You could use this in «gr[ae]y» to match either „gray” or „grey”. A character class matches only a single character. «gr[ae]y» will not match “graay”, “graey” or any such thing. The order of the characters inside a character class does not matter. The results are identical.
You can use a hyphen inside a character class to specify a range of characters. «[0–9]» matches a single digit between 0 and 9. You can use more than one range. «[0–9a-fA-F]» matches a single hexadecimal digit, case insensitively. You can combine ranges and single characters. «[0–9a-fxA-FX]» matches a hexadecimal digit or the letter X. Again, the order of the characters and the ranges does not matter.
Note : Note that the only special characters or metacharacters inside a character class are the closing bracket (]), the backslash (\), the caret (^) and the hyphen (-). The usual metacharacters are normal characters inside a character class, and do not need to be escaped by a backslash.
→ Negated Character Classes :
Typing a caret after the opening square bracket will negate the character class. The result is that the character class will match any character that is not in the character class.
It is important to remember that a negated character class still must match a character. «q[^u]» does not mean: “a q not followed by a u”. It means: “a q followed by a character that is not a u”. It will not match the q in the string “Iraq”. It will match the q and the space after the q in “Iraq is a country”. Indeed: the space will be part of the overall match, because it is the “character that is not a u” that is matched by the negated character class in the above regexp.
Note that the only special characters or metacharacters inside a character class are the closing bracket (]), the backslash (\), the caret (^) and the hyphen (-). The usual metacharacters are normal characters inside a character class, and do not need to be escaped by a backslash. To search for a star or plus, use «[+*]». Your regex will work fine if you escape the regular metacharacters inside a character class, but doing so significantly reduces readability.
To include a backslash as a character without any special meaning inside a character class, you have to escape it with another backslash. «[\\x]» matches a backslash or an x. The closing bracket (]), the caret (^) and the hyphen (-) can be included by escaping them with a backslash, or by placing them in a position where they do not take on their special meaning. I recommend the latter method, since it improves readability. To include a caret, place it anywhere except right after the opening bracket. «[x^]» matches an x or a caret. You can put the closing bracket right after the opening bracket, or the negating caret. «[]x]» matches a closing bracket or an x. «[^]x]» matches any character that is not a closing bracket or an x.
→ Shorthand Character Classes :
Since certain character classes are used often, a series of shorthand character classes are available.
- «\d» is short for «[0- 9]».
- «\w» stands for “word character”. Exactly which characters it matches differs between regex flavours. In all flavours, it will include «[A-Za- z]».
Shorthand character classes can be used both inside and outside the square brackets. «\s\d» matches a whitespace character followed by a digit. «[\s\d]» matches a single character that is either whitespace or a digit. When applied to “1 + 2 = 3”, the former regex will match „ 2” (space two), while the latter matches „1” (one).
→ Negated Shorthand Character Classes :
The above shorthands also have negated versions.
«\D» is the same as «[^\d]»
«\W» is short for «[^\w]»
and «\S» is the equivalent of «[^\s]».
We need to be careful when using the negated shorthands inside square brackets. «[\D\S]» is not the same as «[^\d\s]». The latter will match any character that is not a digit or whitespace. So it will match „x”, but not “8”. The former, however, will match any character that is either not a digit, or is not whitespace. Because a digit is not whitespace, and whitespace is not a digit, «[\D\S]» will match any character, digit, whitespace or otherwise.
→ Repeating Character Classes :
If you repeat a character class by using the «?», «*» or «+» operators, you will repeat the entire character class, and not just the character that it matched. The regex «[0–9]+» can match „837” as well as „222”.
If you want to repeat the matched character, rather than the class, you will need to use back-references. «([0- 9])\1+» will match „222” but not “837”. When applied to the string “833337”, it will match „3333” in the middle of this string.