Don’t Fear REGular EXpressions

Understand the language of Regex in your applications.

Nishant Aanjaney Jalan
CodeX
7 min readAug 22, 2023

--

Regex is something that looks quite intimidating to new programmers; however, it is one of the important concepts to learn. A good thing about Regex is that it is universally accepted, and every language implements Regex in a certain way.

Why should you learn Regex?

Regex (short for Regular Expressions) is a language used to test strings and find patterns. These are useful in testing if the string fits a particular requirement or extracting information from long texts. Different languages implement the regex interface differently. However, the main regex pattern remains consistent across every language.

Note: My favorite website to check if I have the correct regex pattern is regex101.com. Feel free to experiment on the website.

Learning Regex by Token Types

Ranges

Anything inside [..] will match a single character in the brackets.

// [abc] will match either a, or b, or c.
// [^abc] will match neither a, nor b, nor c.
// [a-z] will match any lowercase letter (ASCII 97-122).
// [^a-z] will match any character that isn't a lowercase letter.
// [a-z0-9A-Z] will match any letter or digit.

const str = "heylo"
const isMatch = str.match(/[abc]/) // will return null
const isMatch = str.match(/[^abc]/) // will match 'h'

const str = "99var"
const isMatch = str.match(/[a-z]/) // will match 'v'
const isMatch = str.match(/[^a-z]/) // will match '9'
const isMatch = str.match(/[a-z0-9A-Z]/) // will match '9'

Groups

As the name suggests, it groups multiple patterns together to perform some operation. They can be:

  1. Capturing — (...) when you can reference the group by a numeric ID.
  2. Non-capturing — (?:...) simply used to perform an operation together, cannot use an ID to get this group.

See below to get a better idea of groups.

Repeaters (not official name)

Repeaters are quantifiers where you specify the number of times you want a pattern to repeat.

// {3} the previous token must repeat exactly 3 times.
// {2,3} the previous token must repeat at least 2 and at most 3 times.
// {,3} the previous token must repeat at most 3 times (can be zero)
// {2,} the previous token must repeat at least 2 times (can be infinite)

const str = "abbbbdd"
const isMatch = str.match(/a{3}/) // will return null
const isMatch = str.match(/b{3}/) // will return "bbb"
const isMatch = str.match(/d{2,}/) // will return "dd"

const str = "azAZZazZz"
const isMatch = str.match(/[aA][zZ]{2,3}/) // will return "AZZ"
// match a or A followed by z or Z repeated between 2 and 3 times.

const str = "xyxyxyxy"
const isMatch = str.match(/(xy){,3}/) // will return "xyxyxy"
// can use a group to match a collective pattern.

Quantifiers

Quantifiers are shorthand for certain repeaters. We have 3 quantifiers to use.

// * the previous token must repeat zero or more times, aka {0,}
// + the previous token must repeat one or more times, aka {1,}
// ? the previous token must repeat zero or once, aka {0,1}

const str = "abbbbdd"
const isMatch = str.match(/a*/) // will return null
const isMatch = str.match(/b{3}/) // will return "bbb"
const isMatch = str.match(/d{2,}/) // will return "dd"

const str = "azAZZazZz"
const isMatch = str.match(/[aA][zZ]{2,3}/) // will return "AZZ"
// match a or A followed by z or Z repeated between 2 and 3 times.

const str = "xyxyxyxy"
const isMatch = str.match(/(xy){,3}/) // will return "xyxyxy"
// can use a group to match a collective pattern.

Shorthand

There are many shorthand tokens that you can use in your regex pattern. You can find a lot more shorthand tokens in regex101.com in the bottom right corner.

// \s means any whitespace character. \S is the negation of \s
// \w means any word character. \W is the negation of \w
// \d means any digit. \D is the negation of \d

const str = "hello bye"
const isMatch = str.match(/\s+/) // will match " "
const isMatch = str.match(/\d/) // will return null
const isMatch = str.match(/\d*/) // will match ""
// (note, * means 0 occurances too)
const isMatch = str.match(/\w/) // will match "h"

Lazy Quantifiers

The quantifiers you learned above are greedy quantifiers. They will try to match the maximum possible. Lazy on the other hand will try to match the minimum possible. A great example is given in this StackOverflow thread.

Let’s say that you need to print out all text that is within double quotation marks (“). With greedy quantifiers, you would do the following:

const str = 'Regex is "awesome". Javascript is "cool". I love to "code"';
const quotedText = str.match(/".+"/g); // (note the g for later)

// This matched "greedily" covering everything from awesome to code.
// Instead:

const quotedText = str.match(/".+?"/g)
// now you have the desired outcome

How does this work? If you put a ? after a quantifier, you make it lazy. It will stop looking any further than the next match. The normal greedy approach will try to match as many characters as it can.

Miscellaneous

  1. . — This matches any character — ANY character.
  2. ^ — This tells the engine that the pattern should start with the beginning of the string.
  3. $ — This tells the engine that the pattern should stretch to the end of the string.

Global Modifiers

These are single letters that are placed after the /../[gmiU].

  1. global (‘g’) The regex will return all matches and won’t stop after the first match.
  2. multiline (‘m’) — ^ and $ will now refer to the start/beginning of each line and not the whole string.
  3. case-insensitive (‘i’) — lowercase and uppercase characters are considered equal.
  4. Ungreedy (‘U’) — Make all greedy quantifiers lazy.

Lookaheads and Lookbehinds

These are special groups that do not contribute to the match but affect it indirectly. Each of them has a positive and negative type.

Positive/Negative Lookaheads: If you have a pattern like posi(?=tive), the engine looks for all “posi” that is followed with “tive”. posi(?!tive) will look for all “posi” that are not followed with “tive”.

// (?=...) checks if the content ahead matches the group.
// (?!...) checks if the content ahead does not match the group.

const str = "It is positive, not negative";
const isMatch = str.match(/posi(?=tive)/); // will match "posi"
const isMatch = str.match(/posi(?!tive)/); // will return null

Positive/Negative Lookbehinds: Similar to Lookaheads, but will check if the pattern comes after the group.

// (?<=...) checks if the content behind matches the group.
// (?<!...) checks if the content behind does not match the group.

const str = "It is positive, not nega-tive";
const isMatch = str.match(/(?<=nega)tive/); // will return null
const isMatch = str.match(/(?<!nega)tive/); // will match "tive"
// note the hyphen!!

Note: These do not exhaust the language of Regex. But they are the most commonly used tokens.

Real-world Examples

The examples get progressively complex.

Test if a given string is a Google email

const someString = "sudar.pichai@gmail.com"
const isGmail = someString.match(/[\w.]+@gmail\.com/)
[\w .]+  @gmail\.com
  • [\w .]+ — There should be at least one occurrence of a word character or a period. Note: . within a range [] exactly matches a dot.
  • @gmail\.com — Exactly match the above string @, g, m, a, i, l, ., c, o, m in that order.

Testing if a given string is a URL

const someString = "https://www.medium.com"
const isURL = someString.match(/^https?:\/\/(?:www\.)?.+\.com/)
^  http  s?  :\/\/  (www\.)?  .+  \.com
  • ^ — The following regex pattern must be the start of a line. The match will fail if the line does not start with the next token.
  • http — Exactly match the characters h, t, t, p in that order.
  • s? — The character s may or may not appear; match both cases.
  • :\/\/ — Exactly match the characters :, /, / (Note, you must escape the forward slash).
  • (?:www\.)? — Non-capturing group www\. may or may not appear; match both cases (Note, . must be escaped. See below).
  • .+. is a wildcard for any character. + is a quantifier saying that ‘any character should at least appear once’.
  • \.com — Exactly match the characters ., c, o, m.

Replace all quoted strings in uppercase

let someString = 'Regex is "awesome". Javascript is "cool". I love to "code"'
const quotedWords = someString.match(/(?<=")\w*?(?=")/g)
quotedWords.forEach(word => {
someString = someString.replace(/(?<=")[a-z]*?(?=")/, word.toUpperCase())
})
  • (?<=”)\w*?(?=”) — This regex pattern returns all the words (note the ‘g’ modifier) that are preceded and succeeded with a double quotation mark using positive lookahead and look behind.
  • (?<=”)[a-z]*?(?=”) — This returns the first appearance of a lowercase word surrounded with double quotation marks.

Retrieve package version from package.json

const packageJson = fs.readFileSync(path.join(__dirname, "package.json"))
const packageVersion =
packageJson.match(/(?<="version": ")(?:\d+\.){2}\d+(?:-(?:alpha|beta|rc)\d*)?(?=",)\s*$/m)[0];

For context, here is a basic package.json file:

{
"name": "my-node-app",
"version": "1.0.0",
"description": "",
"main": "index.js",
"scripts": {
"test": "echo \"Error: no test specified\" && exit 1"
},
"keywords": [],
"author": "",
"license": "ISC"
}
(?<="version": ")  (?:\d+\.){2}\d+  (?:-(?:alpha|beta|rc)\d*)?  (?=",)\s*$

We need to extract the 1.0.0 considering different channels such as alpha or beta releases.

  • (?<="version": ") — The first group is a positive lookbehind. The match must be preceded by the exact text "version": ".
  • (?:\d+\.){2}\d+ — This matches the actual version number. We open with a non-capturing group (?:\d+\.){2} where we say that need to have a number and a period occurring twice. This must be followed by a number, \d+. Hence, 1.0.0 matches this part.
  • (?:-(?:alpha|beta|rc)\d*)? — We have another non-capturing group which is optional, (?:...)?. Within this group, the pattern must start with a hyphen followed by either alpha or beta or rc. There may or may not be a number after the channel name.
  • (?=”,)\s*$ — We end with a lookahead, making sure the version ends with a ",. This also must be the end of the line (because of ‘m’ flag) after whitespaces, if any.

Conclusion

Regex is a very useful tool that is used in many applications. It is handy to use while testing strings and extracting useful information in a very specific way. While it might look bizarre to a normal eye, it is not too difficult for the experienced eye.

I hope you enjoyed reading my article and learned something. Thank you! Love what I do?

Consider Buying me a coffee!
Want to connect?

My GitHub profile.
My Portfolio website.

--

--

Nishant Aanjaney Jalan
CodeX
Editor for

Undergraduate Student | CS and Math Teacher | Android & Full-Stack Developer | Oracle Certified Java Programmer | https://cybercoder-naj.github.io