Breaking Down a Head-Scratcher Regex for Password Validation

Andre Knob
The Startup
Published in
8 min readOct 27, 2020
Photo by bruce mars on Unsplash

No matter how much you like or hate regular expressions, the truth is that from time to time we’re gonna need to rely on them, either to make a simple string testing, to implement a search engine, or to do lexical analysis.

Today I’m going to break down some concepts around a regex I needed for password validation. Like any password that you need to create nowadays, my password needed to contain:

  • at least one lowercase letter [a-z];
  • at least one uppercase letter [A-Z];
  • at least one digit [0–9];
  • six characters or more.

Just for the sake of simplicity, let’s say that this password won’t allow non-word characters.

There are a bunch of ways to create a regex for this purpose, the most preferable would be to create something that would fall into the best practices category, but today, for the sake of learning, we’re gonna explore a clever and interesting way of solving this problem. Due to the reduced readability, this is not a regex that I’d like to see in any implementation, but understanding how some regular expressions can be built and exploring different ways of solving the same problem should really boost our options when faced with complex scenarios.

We’ll be using JavaScript, yet I believe you may translate the concepts learned here to the regex standards used by other languages. Before we dive deep into understanding this much anticipated regex, let’s see how we would solve the problem at hand with best practices:

Best practices for regex using
Best practices alternative

Easy to understand, right? We divide and conquer our problem, solving each part individually. This makes it really readable and maintainable, even for those that really don’t get along with regexes. There may be, however, some points of doubt in the first regex, /^[A-Za-z0-9]{6,}$/, so let’s briefly break it down:

  • ^ is for asserting the position at the start of the line;
  • [A-Za-z0-9] will match a single character defined in this list;
  • {6,} is a quantifier, it will match between 6 and unlimited times, as many times as possible. The unlimited feature is due to the comma right after the 6. Without the comma, we’d have {6}, and it would try to match exactly 6 times;
  • $ asserts the position at the end of the line.

In this particular regex, the biggest point of confusion may be the simultaneous usage of ^ and $ anchors, but it is just a small trick.

In natural language, the usage of the^ anchor could translate to “there should be nothing before the match”. For $, similarly, we’d have “there should be nothing after the match”. Once both are united, they can translate to “there should be nothing both before and after the match”.

If you make use of both assertions, any match possible will have the same length as the line itself. At the same time you will estipulate that for a match to occur, the entire line content should match the subregex between ^ and $.

With that behind us, let’s finally start diving into the cool regex. It is a simple one-liner:

Frightening, right? What really caught my attention in this regex and made me write this article is the super cool way it uses a specific assertion, the lookahead assertion.

Lookahead Assertion

Lookahead assertions, are mechanisms that make it possible to check if a certain subpattern happens (or not) right after the asserting position, without including this pattern in the match, i.e. without consuming characters from the matching string. Similarly, there are also lookbehind assertions, that analyses if the subpattern happens behind the asserting position. Both these assertions are referenced together as lookaround assertions. Lookaround assertions can be either positive or negative, establishing if the subpattern should happen or not for a match to occur.

Lets see two examples defining regular expressions with positive and negative lookahead assertions to better understand how these assertions work and what are the neat tricks we can do with them.

First, imagine we need to match only 3 letters, but these 3 letters must be followed by a number, how could we do that? Remember what assertions are about, they don’t really include their matched patterns in your match, they just validate if the patterns do exist without consuming them. So doing some tests in Chrome’s console, we can write a regex like the one below.

Regex with positive lookahead assertion

This regex captures any characters from [a-z] and [A-Z] exactly three times, and them it does a positive lookahead (a lookahead must be defined between a pair of parentheses, with ?= followed by a subregex) to check if there’s a digit right after the asserting position.

In this case, the asserting position that would trigger the lookahead is c, the third character of the string abc5. Then the lookahead finds a match, because there’s a digit right after c; however, notice how the match abc doesn’t contain 5. That’s the useful lookaround’s feature of not consuming the characters.

The string abc_5 would fail to match, because there’s no sequence of three letters and a number right after:

Failed match

And what would happen if we tried to match against a more composed string, let’s say, _abc5 or abc555_?

Other matching examples

It’ll still match, because since there’s no delimiting rules, we’re also trying to match substrings.

What if I want to match only if the whole string is a match, yet maintaining this cool lookaround’s behavior of limited matching? Oh, we can use the anchors ^ and $ to achieve this, just like we’ve seen a bit ago, right?

Epic fail

Wait… what? Where’s the match? Oh… I see. Before you go full “you promised me this was incredible and would work” mode, let’s hold our horses and see what’s happening here, and it is true that the ^ and $ duo is extremely powerful, just not for this particular case.

The thing is, with the $ anchor we’re asserting that “there must be nothing after the third character from the [a-zA-Z] set”, and with the lookahead we’re asserting “there must be a digit after the third character from the [a-zA-Z] set”. This regex is a paradox, it’ll never match anything because the rules contradict themselves. It is also worth noting that this problem will always happen in any regex with this particular pattern: a lookahead closely followed by a $ anchor.

We’ve created a little monster, how can we solve this, and again, maintain the useful behavior of matching just a part of the string, without recurring to other resources, like capturing groups? Well, we just need to replace the $ anchor for a negative lookahead.

Negative Lookahead

Negative lookaheads are essential. They are a great way to solve the problem of “match something not followed by something else”. Let’s quickly see what our solution would be:

We have a match!

Great! We’ve just replaced the $ anchor for a negative lookahead, maintaining everything else. In this new lookahead, we’re just saying “there must NOT be two or more characters after the third character from the [a-zA-Z] set”. Still, with the first lookahead, we need to have a single digit right after the third character. After combining both lookaheads, we achieved the desired behavior. Let’s see other cases to validate our regex:

Looking good

Right, so we’ve understood some core concepts about lookarounds. For specifically lookbehind, we just need to abstract what we’ve learned from lookahead and we should be fine. To use a positive lookbehind, just use ?<= in the same structure as the lookahead. A negative lookbehind will use ?<!.

With all the lookaround’s concepts out of the way, we can take a closer look at our password regex, now much less scarier.

Back to the password regex

Let’s first remember how that regex looked like:

The now quite pleasant regex

With everything we learned at our disposal, now we can break this down easily.

  • ^ is just our anchor to the start of the line;
  • (?=.*\d) is a positive lookahead to find a digit in any position of the string. Since it is placed at the beginning of the regex, it will scan the whole string for the rule defined. Also, since we don’t know where exactly we’ll need to match the digit, we can use .* to define that anything can come before the digit itself;
  • (?=.*[a-z]) is a similar lookahead rule, but instead of a digit, it tries to find a character in the [a-z] set;
  • (?=.*[A-Z]) is the same thing but for the [A-Z] set;
  • [a-zA-Z0-9]{6,}, here we delimit what we’ll be matching, just any character in the [a-zA-Z0-9] set, six or more times. This will leave out the undesired non-word characters;
  • $, finally, we set our end of line anchor.

You could be asking yourself “why are we combining anchors and lookarounds now?”. Well, since this time we need to match the full string, not just one part of it, there’s no chance of putting a lookahead just before a $ anchor, so there’s no risk falling in the same paradox as before. Our lookahead rules won’t be conflicting with our anchors, they will just complement each other, so this time we can safely use both ^ and $ anchors.

Combining all of the behaviors defined in this regex, we’d have something like “match a string that has at least one digit, one lowercase letter, one uppercase letter, and is composed by anything in the set [a-zA-Z0–9], with six of more characters”.

Great! Now let’s see if it works properly:

Looking good²

It looks like it is working as intended. Now we’ve fully broken down and understood this scary looking regex. As I said earlier, this is not a best practices regex, yet in some cases I find real value in breaking down and learning from complex examples, not to mention that this is where the fun resides. I hope that this guide was helpful for you and that we’ve learned some useful tricks to apply in future problems we’ll be facing.

--

--