regex or ‘what the f- am I looking at’

Intro to Regex

If you are new to regular expression, when you first encounter them you may feel a sense of panic and dread. It’s not the prettiest or most intuitive search string to grasp at first sight but once you get a handle on the modifiers, operators and syntax you’ll learn to really appreciate all it has to offer and find that it’s a very useful skill to have in your dev toolkit.

Regular expressions, also known as regex or regexp, are patterns matching specific text. It’s a great tool that allows you to search and manipulate text based data in an easy and succinct way.

Pros:

Many languages provide regex capabilities either as a built-in feature (JavaScript, Ruby, Perl) or a library (Python, Java, C++), making its basic features pretty universal.

Just like any other language, once you get a handle on the syntax and character meanings, regex is pretty simple to write.

It’s very concise.

Cons:

Though it may be simple to write once you get the hang of it, regex can be a bit difficult to read and the nature of regex structure can obfuscate the logic being used.

Overuse of regexes can have a negative impact on performance.

Some rules may differ from language to language.

Regex Basics:

The basic structure of a regex is a contained within an opening and closing forward slash known as delimiters. The pattern within the delimiters is the regex, any characters after the ending forward slash is known as modifiers. The regex pattern is matched against text which is known as the subject string.

/<regex>/<modifiers>

Regexes can contain the following: line anchors, character classes, quantifiers and groupings.

Line Anchors or Positioning ‘^’ ‘$’

Including anchors in a regular expression informs where a pattern starts and where it stops, with a caret (^) and dollar ($), respectively.

If we wanted to search for art with the regex /art/ we could get partial matches of ‘start,’ ‘mart’ and ‘cart’. By including line anchors we specify the starting and ending of a pattern so that we only get that specific sequence of characters from start to finish.

Character classes and range ‘[ ]’ ‘-’

Brackets indicate a character class; character classes allow you to match one of several characters listed.

[0123456789] represents a character class of digits, we can also represent this as [0–9].

The dash has a special meaning within a character class, representing a range.

Quantifiers ‘+’ ‘*’

Quantifiers can match additional instances of a pattern. Character classes allows for a single match, by adding a quantifier of + we can match a pattern 1 or more times, or with the quantifier * we can match a pattern 0 or more times.

/[0–9]+/ can match ‘1’, ‘10’, ‘100’ and so on.
/[0–9]*/ can match ‘’ (nothing), ‘1’, ‘10’, ‘100’ and so on.

Alternation ‘|’

The pipe meta character is the ‘or’ operator, allowing you to alternate between patterns. For example if we wanted to find pancakes or hotcakes we could use the following regex:

 /pancakes|hotcakes/

Wildcard ‘.’

The dot will match any character at a given point except newline characters. Adding a star (*) is an easy way to capture all possible entries (without newline characters) in a given location.

/.*\.txt/ can match any txt file (more on that backslash in a bit!)

Interval Quantifiers ‘{}’

Specifies the min and max number of character to allow for more precise pattern matching. If only one number is indicated it matches exactly X times.

/[0–9/{5} matches only five digit numbers.
/[0–9]/{1,4} matches one to four digit numbers.

Grouping and Matching ‘()’

Parentheses serves two purposes: grouping patterns and capturing matches.

There will be certain points where you will need to group a pattern in order to make sense of an expression but you might not necessarily want that returned as a match. ex:

/http:\/\/www\.[a-z]\.(com|net|org)/ is different than /http:\/\/www\.[a-z]\.com|net|org/

if we were trying to matching the pattern for a website that ended in ‘.net’ or ‘.org’ we wouldn’t (read: shouldn’t) be able to do so with the above pattern because of the way it’s organized, it will search for the first pattern before the or operator (including ‘com’) or the individual word ‘net’ and the individual word ‘org’. We need it to be grouped in order to interchange ‘com’, ‘net’ or ‘org’.

/http:\/\/www\.[a-z]\.(com|net|org)/

putting a parentheses around will now capture sites ending with ‘net’, ‘com’, and ‘org’.

Parentheses allows you to group and capture specific patterns. You can nest groups to capture several patterns within one expression, for example a university might assign an id number to every student composed of their initials and unique four digit number. If you wanted to match the entire id and initials you can do this by using parentheses

/(([a-z]{4})[0–9]{4})/

We can also capture the sites return with our earlier regex by including parentheses around the entire regex:

/(http:\/\/www\.[a-z]\.(com|net|org))/

In this case, we’ll get the site returned and the individual word ‘com’, ‘net’ or ‘org’. All groupings will be returned as a matched patterned which isn’t always needed or desired.

Optional Matching ‘?:’ ‘?’

To avoid capturing unnecessary groupings we can include ‘?:’ within the parentheses in the beginning of the pattern.

/(http:\/\/www\.[a-z]\.(?:com|net|org))/ Will return the matched sites only.

In other cases where a pattern is optional, we can follow it with a ?. This will match the preceding pattern 0 or 1 times, this works with individual characters, groupings and character classes.

/colou?r/
/y(es)?/
/ID-[0–9]{4}[a-z]?/

Escapes ‘\’

The backslash is used to escape metacharacters with special meanings so they can represent their literal meaning. If we wanted to search for a website we would need to utilize the escape character:

/http:\/\/www\.[a-z]\.com/

A backslash is needed in front of the forward slash because the forward slash represents the beginning/end of a regex.

Another backslash is needed in front of the dot because that represents the matchall wildcard.

Generic Character Types

These are alternative character types that represent character types.

/d — any decimal digit

/D — any character not a decimal digit

/s — any whitespace character

/S — any character not a whitespace character

/w — any ‘word’ character (this includes a-z,A-Z, 0–9 and _)

/W — any non-word character

Context Matters!

Inside of a character class ‘[ ]’ the caret ‘^’ has a different meaning, it is actually a negator and searches for patterns that do not have the characters listed.

ex Let’s get rid of Edwin! We can do this with a search for all names that start with ‘Ed’ do not end with ‘win’.

/Ed^win/

Edward, Eddard, Eddy would be returned but not Ed. The reason Ed isn’t included is because the pattern is looking for a character that is not w-i-n, but it has to have a character.

Modifiers

Modifiers can be used to modify the regex within the delimiters. Modifiers are placed outside of ending slash, there can be multiple modifiers listed at once.

This can be different from language to language so make sure you refer to the documentation of your language of choice. Below are a few for the languages I’m more familiar with JavaScript and Python:

JavaScript

i — search with case-insensitivity.

g — global search, finds all matches.

m — multi-line matching, makes the caret and dollar of the subject string apply to all new lines.

Python

s — makes the wildcard ‘.’ match any character including newline

i — ignores case

l — locale aware match (matching for characters within different languages)

m — multi-line matching

x — enables verbose regex, allows for organization in a way to allow for better readability

u — makes several escapes dependent on the unicode database.

In part two we’ll go through how to put these into use!