Regex for Dummies. Part 5: Lookaround Assertions — Lookaheads and Lookbehinds

NALSengineering
7 min readOct 17, 2023

Credit: Nguyễn Thành Minh (Android Developer)

See Part 1 here: Quantifiers

See Part 2 here: Flavors, Flags, and Assertions

See Part 3 here: Character Set, Or condition, and Word Boundary Assertion

See Part 4 here: Capturing Groups and Backreferences

In this part, we will find out the advanced concept in Regex: Lookaround Assertions — Lookaheads and Lookbehinds

Lookaround Assertions: Lookaheads and Lookbehinds

Lookaround assertions are non-capturing groups that return matches only if the target string is followed or preceded by a particular character.

Lookaround assertions do not consume the characters in the input string or text like other assertion metacharacters such as Input Boundary Assertion (^ and $) and Word Boundary Assertion (\b and \B).

There are two types of lookaround assertions: lookahead and lookbehind. The two also have their positive and negative forms, so there are positive lookahead, negative lookahead, positive lookbehind, and negative lookbehind assertions.

  1. Positive Lookahead (?=chars)

A positive lookahead checks whether a particular pattern can be matched after the current position in the string, without consuming characters from the string. In simpler terms, it asserts that a certain expression can be found at the right side of the current position. It is represented using the syntax:

(?=chars)

For example, the pattern x(?=y) means match x only if it is followed by y.

Another example, xyz(?=123) means match xyz only if it is followed by 123.

Another one, apple (?=pie)

Note that it only matches with ‘apple ’, not the entire ‘apple pie’, because the lookaround assertions do not consume the characters in the input string.

What if we reverse x(?=y) to (?=y)x? To understand the pattern (?=y)x, we need to comprehend the patterns (?=y) and (?=y)y first.

The pattern (?=y) matches strings that are followed by ‘y’, but there is no string preceding it. Therefore, it will match an "empty string" as shown in the image below. Note that it only matches with empty strings that are followed by ‘y,’ not all empty strings.

The pattern (?=y) matches an empty string that comes just before ‘y’. So, the pattern (?=y)y matches an empty string before ‘y’ and is immediately followed by the letter ‘y’. In simpler terms, it’s just the letter ‘y’.

Similarly, the pattern (?=y)x matches an empty string before ‘y’ and is immediately followed by the letter ‘x’. In other words, it matches strings that consist of only one character ‘x’ but must start with ‘y’. This scenario is impossible, so this pattern will not match any string.

What if we insert .* into the pattern (?=y)x like this: (?=.*y)x? In this case, it matches ‘x’ that ‘y’ comes after it like ‘xy’, ‘x123y’, ‘x1y2z3’.

First, one thing for sure is that this pattern only matches exactly one character ‘x’ because (?=.*y) is an assertion, so it doesn’t consume any characters.

Then, we need to understand how the pattern (?=.*y) works.

  • .*: Matches any sequence of characters (except for a new line).
  • y: Matches the literal character 'y'.
  • Together, .*y matches any string that has at least one character ‘y’.

Therefore, (?=.*y) matches an empty string that ‘y’ comes after it.

Finally, the pattern (?=.*y)x matches the single character 'x' in strings that ‘y’ comes after it such as in 'xy,' 'x123y,' and 'x1y2'.

In this case, .* matches with ‘x’, so the entire .*y matches with ‘xy’ but does not match with ‘yx’. That’s why (?=.*y)x does not match with ‘yx’. You can check it out by grouping the pattern .* like this: (?=(.*)y)x

We can see the group (.*) matches with ‘x’, ‘xz’, and ‘xbc’.

In short, (?=.*y) is typically used to check if the string has at least one character ‘y’. Similarly, we have the following useful patterns:

  • (?=.*[a-z]).+ matches a string that has at least one lowercase character.
  • (?=.*[A-Z]).+ matches a string that has at least one uppercase character.
  • (?=.*\d).+ matches a string that has at least one digit.
  • (?=.*[^A-Za-z0–9\s]).+ matches a string that has at least one special character except whitespace.

Positive lookahead assertions can be useful for validating passwords. For instance, to validate a password with a minimum of 8 characters, including at least one uppercase letter, one lowercase letter, one numeric digit, and one special character (excluding whitespace).

^(?=.*[A-Z])(?=.*[a-z])(?=.*\d)(?=.*[^A-Za-z0-9\s]).{8,}$

In which:

  • ^: anchors the pattern to the beginning of the string.
  • (?=.*[A-Z]): asserts the presence of at least one uppercase letter.
  • (?=.*[a-z]): asserts the presence of at least one lowercase letter.
  • (?=.*\d): asserts the presence of at least one numeric digit..
  • (?=.*[^A-Za-z0-9\s]): asserts the presence of at least at least one special character (excluding whitespaces).
  • .{8,}: matches a string that has at least 8 characters.
  • $: anchors the pattern to the ending of the string.

2. Negative Lookahead (?!chars)

In the syntax of negative lookahead, you replace the equal sign with an exclamation mark:

(?!chars)

For example, the pattern x(?!y) matches ‘x’ only if it is NOT followed by ‘y’.

Another example is the pattern x(?!123), which matches ‘x’ when it is not followed by ‘123’.

Negative lookahead assertions can be useful for validating strings that do not start with specific words. For example, to validate a URL that does not begin with ‘http’ or ‘https’.

^(?!http|https).+$

In the pattern above,^ asserts that the current position is the start of the string and the negative lookahead (?!http|https) asserts that what follows the current position in the string is not the characters “http” or “https”. That means it will match strings that do not start with “http”, and “https”.

3. Positive Lookbehind (?<=chars)

A lookbehind assertion is similar to lookahead assertion. But instead of checking if a certain character(s) follows what you’re trying to match, it checks whether the character(s) precedes what you’re trying to match.

Like lookaheads, there are also positive and negative lookbehind assertions. A positive lookbehind returns a match only if the character you want to match is preceded by another character you specify in your pattern. On the other hand, a negative lookbehind returns a match only if the character you want to match is not preceded by another character.

This is the positive lookbehind’s syntax:

(?<=chars)

For example, the pattern (?<=x)y indicates you want to match y only if there's x before it. In this case, xx or yx won't match, but xy would match.

Positive lookbehind assertions can be useful for matching numbers preceded only by a certain currency symbol, for example numbers preceded by the dollar sign.

The regex pattern below has a positive lookbehind that matches a number only if it is preceded by a dollar sign:

(?<=\$)\d+

In the pattern above, the positive lookbehind (?<=\$) checks whether there's a dollar sign before one or more digits represented by \d+.

4. Negative lookbehind (?<!chars)

For a negative lookbehind, an exclamation mark replaces the equals sign:

(?<!chars)

For example, the pattern (?<!x)y means do not match y if there's x before it. In this case vy would match, ny, would match, but never xy.

Negative lookbehind assertions can be useful for validating files that don’t end with specific extensions. For example, to validate a file path that does not end with “js”, “css”, or “html”.

^.+(?<!js|css|html)$

In the pattern above,$ asserts that the current position is the end of the string and the negative lookbehind (?<!js|css|html) asserts that what precedes the current position (the end position) in the string is not the characters “js”, “css”, or “html”. That means it will match strings that do not end with “js”, “css”, and “html”.

Conclusion

This concept is the most challenging part of regex for me. After this lesson, I believe you will be able to use regex to solve most problems. In the next part, we will apply it to programming.

--

--

NALSengineering

Knowledge sharing empowers digital transformation and self-development in the daily practices of software development at NAL Solutions.