Regex — The Good, the Bad and the Basics

Computers, being computers, are great at most things. They allow us to store and display a vast amount of information, they speedily connect us with others and they are fantastic number crunchers. However, they are pony and trap at identifying patterns. This is where Regular Expressions, otherwise known as Regex, comes in.

What is Regex?

Regex is a way of describing complex search patterns using a sequence of characters.

So……. what does that mean?

Before deciding to write this blog I’d only really seen Regex being passed as an argument when using the #split method on a String, like so:

str = "This string contains: a colon, a comma and an exclamation mark!"
str.split(/\W+/)
# => ["This", "string", "contains", "a", "colon", "a", "comma", "and", "an", "exclamation", "mark"]

There it was, my first Regex /\W+/ not so scary is it! However, this is only a very simple Regex which in pseudo-English states [start of Regex][one or more non-word character][end of Regex]. The resulting array is due to the #split method dividing the string at any point where the pattern matches, so in this instance whenever we don’t have a word character. Each division (sub-string) is then shovelled into an array.

As you can imagine the complexity of Regex’s increases exponentially the more specific the pattern is you are trying to match. Here is an example of a Regex used to identify valid UK postcodes from large pieces of text:

/^([A-PR-UWYZ0–9][A-HK-Y0–9][AEHMNPRTVXY0–9]?[ABEHMNPRVWXY0–9]? {1,2}[0–9][ABD-HJLN-UW-Z]{2}|GIR 0AA)$/

Don’t worry if this looks complex it’s suppose to and it’s the main reason why a lot of developers stay well clear of using Regex. However, when you finish reading this, expression like this will be far less inconceivable.

The Basics

The observant out of you may have already noticed that Regex’s are easily identifiable as they are always contained within forward slashes, like this/pattern/. They are widely used to grab text in files, validate text input, and search for text files. They can also be used with any programming language!

I’ll start by going through the fundamentals and looking at some examples.

Character Classes

A character class allows a set of possible characters, rather than a single character, to match at a particular point in a Regex. Character classes are denoted by [...] with a set of characters to be possibly matched inside.

/[abc]/     - matches a single character of: a, b or c
/[^abc]/ - matches any single character except for: a, b or c
/[a-z]/ - matches any single character in the range a-z
/[a-zA-Z]/
- matches any single character in the range a-z or A-Z

If you want to match an a and an 8, use [a8]. The order of the characters inside the character class does not matter, the results are identical!

str = "I am 8 years old"
str.scan(/[a8]/)
# => ["a", "8", "a"]
str.scan(/[^a8\r]/)
# => ["I", "m", "y", "e", "r", "s", "o", "l", "d"]

The second example includes a caret which results in a character class matching any character that is not in the character class, known as a negated character class. It is also important to note that if you do not want a negated character class to match whitespaces we need to specify this. We do by including the \r in the negated character class.

Here are some examples of character classes:

/dog/                 # matches with 'dog'
/[dlf]og/ # matches with 'dog', 'log' and 'fog'
/[Dd][Oo][Gg]/ # matches 'dog' in case in-sensitive way
/dog/i # also matches 'dog' in case in-sensitive way

The last example includes an i modifier which is a handy trick to making the match case in-sensitive.

Special Character Classes

Programmers being programmers are lazy and therefore special characters were developed as a short hand to using common character classes. The respective character classes that each special character class represents is shown below.

/./  - matches any character except newline (\n) 
/\d/ - matches a single character that is a digit (/[0-9]/)
/\w/ - matches a single character that is a word character (letter, number, underscore) (/[a-zA-Z0-9_]/)
/\s/
- matches any whitespace characters (/[ \t\r\n\f]/)

\d, \w and \s also present their negations with \D, \W and \S respectively. For example, \D will match any non-digit thus performing the inverse of \d.

These special character classes can be used both inside and outside of character classes ([...]) like so:

/..ng/             # matches any two characters followed by a 'ng'
/\d\d:\d\d:\d\d/
# matches a hh:mm:ss time format
/\w\W\d/ # matches a word character, followed by a non- word character, followed by a digit

WARNING!!! The . modifier, also known as a period, should be used with care and should not be solely used if you wish your Regex pattern to match with a full stop. Instead, it is necessary to place the period within a character class [.] or after a backslash \..

Repetition Cases

Repetition operators repeat the preceding regular expression a specified number of times. Explicit repetition operators have the following format, {min,max}, with the min and max being signified by an integer. However, if you wish to specify a certain number of occurrences for your regex pattern to match with, a solitary number can be used instead of a range as shown:

/\d{3}/    - matches with exactly 3 digits (/\d\d\d/)
/\d{3,}/
- matches with 3 or more digits
/\d{3,5}/
- matches with between 3, 4 or 5 digits

Yet, developers being developers decided to devise some new operators titled greedy quantifiers. These quantifiers tell the engine to match as many instances of its quantified regex pattern as possible, hence the term greedy. The three greedy quantifiers are specified below:

/abc?/ - matches a string that has 'ab' followed by zero or one 'c'
/abc*/
- matches a string that has 'ab' followed by zero or more 'c'
/abc+/
- matches a string that has 'ab' followed by one or more 'c'

So {0,1} is the same as ?, {0,} is the same as * and {1,} is the same as +. These quantifiers are confusing as both ? and * matches zero occurrences of the pattern and one or more respectively. Here are some examples of them in action:

str = "My cat is a black cat called Catfish Meowcat"
str.scan(/cat?/i)
# => ["cat", "cat", "ca", "Cat", "cat"]
str.scan(/cat*/i)
# => ["cat", "cat", "ca", "Cat", "cat"]
str.scan(/cat+/i)
# => ["cat", "cat", "Cat", "cat"]

In the examples above, the greedy quantifiers are acting on the ‘t’ in ‘cat’, not the entire word. Therefore, both the ? and * operator will match with instances off ‘ca’ and ‘cat’ unlike the + operator which will only match with instances of ‘cat’.

Regular Regex Expressions

Now I’ll briefly take you through a couple of widely used Regular Expressions and apply some of the aforementioned basic functions.

1. Matching a Username

/^[a-z0-9_-]{3,16}$/

The first character used is a caret ^. However, it is not located within a character class and therefore does not translate this Regex into a negated character class. Instead, it symbolises the beginning of a string and is known as an anchor. The Regex is also ended by a $ sign which is another anchor and signifies the end of a string.

What these anchors wrap is the pattern used to determine if the username is valid. It states that the username must contain between 3 and 16 characters. The acceptable characters include lowercase letters, numbers, underscores and hyphens.

string that matches: us3r_nam3
string that doesn't: Wayt00long_and_hasCap1tals

2. Matching an Email Address

/^([a-z0-9_\.-]+)@([\da-z_\.-]+)\.([a-z\.]{2,6})$/

Again our identifying Regex pattern is wrapped in anchors which signify the start and end of the line. Inside the first group, (...), we match one or more lowercase letters, numbers, hyphens, dots and underscores. Directly after an @ symbol must be matched. Next group is the domain name, this can include one or more number, lowercase letters, underscores, dots or hyphens. Following this must be a dot (full stop), not a period as the dot succeeds a backslash. The last group, must contain between 2 and 6 lowercase letters or dots. This is because of country specific TLDs like .co.uk and .my.us.

string that matches: random_email@gmail.com
string that doesn't match: email@domain.some_thing

See that wasn’t too bad now was it! However, these are only some basic Regex patterns and we have a long way still to go before being able to write and implement far longer and more specific regular expressions.

Hopefully after reading this you will now feel more comfortable when coming face to face with Regex. Instead of just copying and pasting some absurd pattern from stackflow, with the wild hope that this weird set of characters will somehow be able to magically divide your string into the required sub-string components. You will be able to determine why the Regex is grouped as it is and be able to recognise the character classes used.


Thank you for reading,

< Harry >