Learning Regular Expressions with Colors
Regular expressions are a bit like a scientific calculator’s extra functions. They’re extremely powerful, and can save the day when nothing seems to be working right, but the instructions can be a bit daunting, especially if you don’t have a conceptual foundation for what you’re trying to do.
Instead of jumping right into developing Regex strings, patterns, modifiers, and using them for testing, replacing, and extracting strings, we’re going to take a gentler approach, using something many of us know and love: colors!! I’ve also tried to make my examples accessible to anyone with vision impairments by adding labels to my visualizations.
These are the colors we’ll use to make this happen
A basic match
Let’s look at a series of colors
Based on this series, we should be able to ask some simple questions about it. Like “are there any reds?”, the answer for which is, of course, yes! There are 3. So if we wanted to define a pattern that we could use to find a red square, it should find 3. To make this easier to see, we’ll display these patterns and our color series like this (note the red boxes around the matched sequences. You’ll see more of these)…
Similarly, we could check to see if the series has a blue square followed by a red like so
Following this pattern, we find exactly 1 match (at the beginning of the series). The rule we follow with these patterns is that if two colors follow each other in the pattern, they have to follow each other the exact same way in the series, with the same quantity and ordering.
A match with conditions
Now that we have our foundations, let’s add one idea: multiple colors. What if we want to adjust our pattern to allow the first color to be either blue or gray. For this, we’ll just put those colors in brackets, signifying either color can work when trying to find a match. We call this collection of colors a set.
Since the first color of the pattern can be blue or gray, we find two matches. The first, at the beginning of the series where we have a blue followed by a red and later where we have a gray followed by a red. It’s important to note, though, that the set only matches one of the colors inside it. It does not signify a gray color followed by a blue.
What if we wanted to know if there were repeated patterns? Like 3 red squares in a row. Or two red-green combinations in a row. One seemingly-obvious solution would be to just check a pattern of exactly that, like below.
And that works! But let’s say you wanted to check a pattern of more, like 5. All of a sudden writing these patterns gets more and more involved. For that reason, we have a shortcut. We can group that single red-green pattern and just add an indicator after it saying we want exactly two of that group.
Ranges of occurrences
Not only can we match exact numbers like we did above, we can also match ranges of quantities. For example, we can match a range of 1 to 3 occurrences of that red-green group like this
Since there are two occurrences of that pattern and we’re trying to find anywhere from 1 to 3 , we automatically have one match with the last four colors. One that may have slipped by unnoticed, though, was the last two colors. They also form a match by themselves, since it is one occurrence of the red-green group.
But why not also form a match with the second and third squares, since they are also one occurrence of the red-green group? The answer is a little confusing, but important to grasp, so I will step through the process. When trying to find matches using this pattern we would step through, and examine each square to see if it is the beginning of a match. If it is, we will try to find the longest match possible. Once a match is completed, we would move onto the next square. For this reason, there would never be two matches starting with the same square. Visually, it looks a little like this
We would start at the first square and try to see if that square was the beginning our pattern. Since it’s blue, and the beginning of the pattern we’re looking for is red, there’s no match, and we advance to the next square.
This time, we see that the square is red and followed by a green, so there’s one match of our pattern. But we don’t stop there. We keep pushing forward in the series until we can’t any longer. Because of this, we end up finding two occurrences of our red-green group, thus completing our first match.
We’re done looking at the second square now, so we move on to the third.
Just like the first square, there’s no match here, since the square is green, and we move on.
Here, we find exactly one occurrence of the red-green group, but there’s nothing after it, so we end the match there.
Our last square is no surprise, and there is no match. Since there are no more squares in the series to examine, we’re done here, left with two matches for our pattern.
Getting a little more specific
At this point, we’ve got quite the toolset to make some pretty complicated matches. We could stop here, but I’d like to introduce a couple more things. Right now, we don’t have the ability to specify where in our series we’re looking to try and find a match. There’s no way for us to say “this pattern must be found at the beginning of the series” or “this pattern must be found at the end of the series”.
What we’re going to add is exactly that.
We’ll use the carat symbol (^) to indicate the start of a series. What this means in this pattern is that we’re looking for a red and an adjacent green square at the start of the series and nowhere else. That’s why we only see one match in this series, not two.
Similar to the start symbol, we have a symbol to designate the end of a sequence, and that is the dollar sign ($). In the context of this pattern it means we’re looking for a red square followed by a green square followed immediately by the end of the series.
We can use both symbols in a single pattern to match the entire series. In this example, we see that we’re looking for the start of the series followed by a single purple-gray pair, followed by the end of the series.
In this example, we have two purple-gray pairs, but we find no matches. Even though there’s a pair at the start and a pair at the end, it still doesn’t match the pattern. The pattern is very strict in saying that there should be exactly one purple square at the start, followed by one gray square, followed immediately by the end of the series. Since this isn’t the case here, there are no matches.
Getting a lot more generic
Last thing for now, I promise.
What if we didn’t care what color a square was. What if we wanted to let it be any color? Well we could create a set with every color in it, but that would be tedious, especially if we had dozens or hundreds of colors. Instead, we designate another character, the period/full-stop (.).
Anywhere this character is dropped in, we allow for any generic color to be matched.
In this pattern, we’re looking for a red-green pair, followed by any color, followed by a purple-gray pair.
When scanning the series, we find a red-green pair, followed by a blue, followed by purple and gray, making a match for our pattern.
In this article, we’ve covered matching basic colors, sets of colors, groupings of patterns, checking for any number of repetitions of patterns, patterns locked to one or both ends of a series, and matching generic colors (a lot, I know).
The good news is that if you came here to learn how regular expressions work, most of the groundwork has been laid here. Once you understand basic pattern matching, transitioning from colored squares in a series to characters in a string seems a little more natural.
If you made it this far, thank you! I hope you found this article useful and fun to read (I definitely had fun making the visualizations). If you enjoyed it or found it useful, show it some love or follow me on Twitter to keep up with what I’m working on!