Working with Regular Expressions

Sachini Navodani
SachiniNP
5 min readMar 24, 2019

--

In a Regular Expression, everything what we write is essentially a character. We write patterns to match sequence of characters. Those patterns are Strings. Normally the patterns uses ASCII characters, but Unicode characters can also be used. RE’s are very significant when we try to extract information from text. The text can be of any format such as codes, log files, spreadsheets or any other text documents.

Letters Matching

For matching a letter sequence, we can simply type the matching letters. For an example to select words with the sequence ‘abc’ we can use ‘abc. All the below three lines can be matched by the pattern ‘abc’.
abcdej
abcklsi
abc

Digits Matching

Sequence of characters can also be matched as same as the letters matching. Further the character \d can be used to match any digit from 0 to 9. \D can be used to match any non-digit character.

Matching any Character — The Dot (.)

For matching pieces of text that we do not know the content exactly, . can be used. The dot can be used to match any single character while \. is used to match a period (.) character specially.
example-:
match-:
agh.
743.
?=+.
skip-:
abc1
Here we can use the pattern …\.

Matching Specific Characters

By defining the specific characters that we want to match inside square brackets, we can do this. For matching a, b or c, [abc] can be used.
example-:
match-:

run
sun
fun
skip-:
lun
bun
cun
Here we can use the pattern [rsf]un

Excluding Specific Characters

To find the pattern of not a, b nor c, [^abc] can be used.
example-:
match-:

car
far
skip-:
war
Here we can use the pattern [^w]ar

Matching Character Ranges

For matching a character in a list of sequential characters,, we can use the square brackets with the dash to imply the range. For an exampl, we can use [0–5] to match any single digit from 0 to 5 and [^a-d] to match any character except letters a to d.
example-:
match-:

Apl
Bqm
Crn
skip-:
aax
bby
ccz
Here we can use the pattern [A-C][p-r][l-n]

To match any alphanumeric character \w which is similar to the long term, [A-Za-z0–9_] can be used. For matching any non-alphanumeric character, \W can be used.

Matching Repetitions

Here the curly braces are used to specify the number or the range of repetitions. For an example, to match ‘a’ exactly 6 times a{6} is used and to match ‘a’ no more than 6 times but no less than 3 times a{3,6} can be used. Further this notation can be used with metacharacters also. [abc]{5} will match 5 characters each of which can be a ‘a’, ‘b’ or ‘c’. To match 3 to 6 times of any character, .{3, 6} notation can be used.
example-:
match-:

abccccccef
abcccef
skip-:
abcef
Here we can use the pattern abc{3, 6}ef

Matching Zero or more Repetitions with Kleene Star (*)
To match 0 or more any character .* notation is used and to match any number of digits \d* notation is used.
Matching 1 or more Repetitions with Kleene Plus (+)
To ensure that the input string has atleast one digit, \d+ notation can be used. Further a+ would match one or more ‘a’ s and [abc] would match one or more of any ‘a’,’b’, or ‘c’.
example-:
match-:

aaaabcc
aabbbbc
aacc
skip-:
a
Here we can use the pattern aa+b*c+

Matching Optional Character

The ? (question mark) is used to denote the optionality. It allows to match either zero or one of the preceding character or group of characters. For an example, ‘b’ is optional in the pattern ab?c. Therefore this will match either the string ‘abc’ or ‘ac’. As ‘?’ is a special character, if we want to match plain ? in a text, we have to use the notation \?.
example-:
match-:

6 text there?
89 texts there?
673 texts there?
skip-:
No texts there.
Here we can use the pattern \d+ texts? there\?

Matching Whitespace

The widely used forms of whitespace are the space (), the tab (\t), the newline (\n) and the carriage return (\r). The notation \s will match any of the specific whitespaces mentioned above. \S is used to match any non-whitespace character.
example-:
match-:

1. abc
2. abc
3. abc
skip-:
4.abc
Here we can use the pattern \d\.\s+abc

Matching Starts and Ends

It is really important to write RE’s as specific as possible. So a pattern that can specify both the start and the end would be very useful. Using ^ (hat) to denote the start and $ (dollar sign) to denote the end, we can write a pattern to match a particular line. Specially these notations would be important for pattern matching in log files.
example-:
match-:

Test: passed
skip-:
Final Test: failed
Next Test: passed only for case1
Here we can use the pattern ^Test: passed$

Matching Groups

When extracting information for further processing, defining groups of characters is useful. Inside parentheses ( ) groups are defined. For an example to list all the image files we have in a particular directory, a pattern like ^(IMG\d+\.png)$ can be used. If we want to extract only the filenames without the extension, the notation ^(IMG\d+)\.png$ can be used. Here the pattern would extract only the part before the period.
example-:
match-:
text_daily.pdf
text_73638.pdf
skip-:
file_filename.pdf.tmp
Here we can use the pattern ^(text.+)\.pdf$

Nested Groups
The results of captured groups are in the order in which they are defined by opening parenthesis. For an example to capture both the filenamed and image numbers of image files in a particular directory the pattern ^(IMG(\d+))\.png$ would be used.
example-:
match-:

March 2006
April 2010
May 2018
Here we can use the pattern (\w+ (\d+)) to capture the month and year as one group and the year again in a separate group.

All the quantifiers like the star *, plus +, repetition {m,n} and the question mark ? can be used in capturing groups.
example-:
match-:

1234x567
6789x4739
4327x832
Here we can use the pattern (\d{4})x(\d+) to capture both the lengths and widths in each.

Using the | (logical OR, aka. the pipe) with Groups
The | is used to denote different possible sets of characters. Here the sets of characters can be metacharacters or sequences of characters.For an example to match ‘car ‘ or ‘far’, or ‘van’ or ‘man’, the pattern ([cf]ar|[vm]an) can be used.
example-:
match-:

They are dolphins
They are whales
skip-:
They are turtles
They are seals
Here we can use the pattern They are (dolphin|whale)s

Additional Metacharacters
\b
matches the boundary between a word and a non-word character. This may be useful in capturing entire words. For example \w+\b.
When referencing captured groups following metacharacters can be used.
\0 for referencing the full matched text
\1 for referencing the first group
\2 for referencing the group group and so on
These metacharacters would be really useful for a situation like search and using RE’s to swap 2 numbers when in a text editor. So we can search for (\d+)-(\d+) and replace it with \2-\1.

--

--