Dipping Toes Into Regular Expressions

Kristy Parker
8 min readSep 10, 2020

--

I’ve used regular expressions only lightly in the past. I had just enough exposure to them to recognize that they are extremely powerful, and also overwhelming at first glance.

A regular expression with a border around it. It reads: ^\s*(#{1–6})\s+(.*?)\s*$
How can this possibly be meaningful?

I recently completed a code challenge that asked me to turn a string of markup text into an HTML heading. The above code made my life a lot easier, and my code a lot cleaner, than it would have looked using other approaches I considered to accomplish the same task.

This post will briefly go over regular expressions, then go into more detail on how this piece of code in particular was able to pull out the exact information needed from a string input.

REGULAR EXPRESSIONS

Regular expressions are text strings for describing a search pattern. The idea is that by choosing a set of tokens to specify exactly what you are looking for, you can select one or many matches, and pull out very specific information for comparisons. And you have a lot of tools to choose from.

An assortment of tools on a table

Tokens can be divided up into several categories. Some of the most common are listed below, but there are more to work with.

Anchors will help you designate where what you’re searching for is located. Are you looking for something at the beginning or end of a string of text? Or perhaps at a word boundary? Anchors will help you specify where you expect to find something.

Meta-sequences allow you to specify the type of characters you’re looking for, and commonly have a negated version. For example, if you are looking for any whitespace character, you can use \s to specify. However, if you want anything but a whitespace character to match, you can use \S. It is similar for word characters with \w and \W.

Quantifiers allow you to control how many characters of a certain type you want to match. This could be by using a specific number or range, or could allow you to select all of the specified characters that are present.

Flags and modifiers allow you to specify about how you are searching. For example, you could specify that you want to stop at the first match, or whether you want to find all matches. You can choose whether to treat a multi-line input as a single input where the beginning and end of each line may match the beginning and end anchors, or whether you want to use the entire block of text with only a single beginning and end.

EXAMPLE

Ok, back to this code.

A regular expression with a border around it. It reads: ^\s*(#{1–6})\s+(.*?)\s*$

As stated in the intro, this is addressing a problem where we need to turn a string of markdown into an HTML heading. Below is an example of what that will look like.

Reads “# Header” → “<h1>Header</h1>” “## Header” → “<h2>Header</h2>” with border

In the input, we get a markdown string that contains a number of hashes. Those hashes will correspond to a header size between 1–6. These two examples are very kind versions though. We can expect that there could be spaces before or after the hashes, and before or after the text that will go inside of the heading.

Anything that doesn’t contain a hash followed by a space followed eventually by a character, will not be considered valid markdown.

Given the above, we have three goals for our regular expression.

  1. Identify valid input
  2. Identify the number of hashes so we can assign a heading size
  3. Identify the text to go inside of the header

To be sure I was grabbing the information I wanted, I used a handy regex tool that allows you to see the effects of your changes instantly. It also has descriptive information about all of the selection tokens at your disposal.

I entered some test strings that I wanted to positively select, and some that I wanted to be certain were not selected as valid input.

Test cases split into valid and invalid input
Test Cases. The top group are valid inputs. The bottom group are invalid inputs.

Now I wanted to step through carefully to make my selections, and groups.

regular expression search bar

Since I have a bunch of test cases on different lines, and I know I want to match for more than a single selection, you can see that on the right hand side I’m using two flags. The g is for global, and it will enable me to make more than one match. The second is m for multiline, which will enable me to use anchors for each line.

Now it’s down to pattern matching. I find it helpful to put into words what I am trying to select for with regular expressions in the order that they appear.

In this case, the first thing I’m going to encounter is the potential white space at the beginning of the input. If my input is ‘ # header’, beginning with one or more spaces, I want to make sure that matches.

^\s in the regex search bar, highlighting any white space at the beginning of each line

Now this regex will match, starting at the beginning(^) of the input, a single space(\s).

But what if there’s more than one? We need a quantifier. We can use the greedy quantifier, *, which will match zero or more consecutive characters; whitespace characters in our case.

Same example with ^\s* now in the search field, highlighting single or multiple spaces at the beginning of input

Ok, the next part is one of our groups that we care about. This is information that we want to be able to pull out from the match because we need it to determine which size header to use. To make a group, we’ll put the next part in ().

This group we’re making is going to contain only hashes. That’s all we care about in that group.

Same example now with ^\s*(#) in the regex search field, and highlighting test fields that have any number of spaces then a #

Now we can see by the highlighting in green that we have formed a group. This group will be accessible to us later.

But we need more than one hash allowed in our group. We can’t use star, because only between one and six hashes will be valid since there are only six heading sizes. Instead, we’ll use {} quantifiers to state a range. Since our range is one to six, we can use {1,6}.

Same example with ^\s*(#{1,6}) in the search, highlighting that there are now groups with 1–6 hashes

Now we can see that our group contains up to six consecutive hashes. Our last line contains seven hashes, and only the first six are highlighted for our group. Notice that at this point, we still aren’t eliminating our invalid inputs.

To have a valid input, our one to six hashes must be followed by at least one space. We don’t want that in our group, so we’ll put it after our group parentheses.

Same example with ^\s*(#{1–6})\s now highlighting all with 0 or more spaces, followed by 1–6 hashes, followed by a space.

Great! Now we aren’t highlighting our invalid inputs because they don’t follow the pattern that we have set anymore. We are highlighting everything that starts with zero or more spaces, is followed by 1–6 hashes, and followed by a space. We’d like to allow for one or more spaces though. We’ll use the + quantifier for that.

Same example with ^\s*(#{1–6})\s+ now highlighting all with 0 or more spaces, followed by 1–6 hashes, then 1 or more spaces

*space added to show selection

Ok, now we can have more than one space after our hashes.

The next part is another group that is important to us. This group will contain the text that we need to insert into the heading. As you know already, we can use () to create a group that we will have access to later.

We can’t assume what the character will be, or what someone might choose for their heading, so we’ll select for any character using .

We do know that we’ll want to allow for more than one character, up to an unlimited amount, which we already know is accomplished by using the * quantifier.

Same example with ^\s*(#{1,6})\s+(.*) in the search bar. All valid inputs now have from header text to the end grouped

Great! Except we don’t want the extra spaces after our heading title. No problem! We’ll just take care of the spaces similar to how we handled the first part. We’ll end it by selecting one or more spaces and instead of using a ^ to specify the beginning of the input, we’ll specify that we are accounting for all spaces attached to the end($) of the input (or in our case each line, since we are using the m flag for multiline). The $ anchor will do it.

Same example except with ^\s*(#{1,6})\s+(.*)\s*$ in the search bar. No changes from the last image.

Except it didn’t.

We used the star in our second group for any character, which includes spaces. Since it’s greedy, it’s taking up all of the spaces up to the end of the line. We need to tell regex to be lazy and only match as few characters as it can get away with, but still to allow many characters. We don’t want our group to contain those extra spaces. We can add the lazy ? quantifier for that.

Same example with ^\s*(#{1,6})\s+(.*?)\s*$ in the search. hashes are contained in a group, headers in another

That looks right! So, now we can see that we are selecting all valid inputs, and we have separated out the hashes group, which we can use to determine the heading, and the heading title group, some of which contain hashes themselves.

Now what?

Table showing the matches and groups for each of the test code lines

Well, we have these groups! I used javascript for my solution, so I used match().

The return value of input.match(/^\s*(#{1,6})\s+(.*?)\s*$/g) is an array of length three. The first item contains a full match, the second item contains the hashes, and the third item contains the heading text.

Using the top line, that would return [‘# header’, ’#’, ’header’].

So, for this problem, I set a constant to the returned array, then used a simple string interpolation to pop in the values I needed to make the correct headings. This made for a fairly concise solution.

I hope this was helpful!

RESOURCES

Here’s the regex101.com tool for crafting regex search terms

Check out the Regular Expressions documentation.

THANKS!

I’m forever learning; feel free to leave comments or feedback. I’m happy to make fixes and improve.

--

--

Kristy Parker

I’m a scientist turned software engineer who is excited to help modernize health and research. Connect on LinkedIn: www.linkedin.com/in/kristynparker/