Regex For Dummies: Part 3: Character Set, Or condition, and Word Boundary Assertion

NALSengineering
6 min readSep 27, 2023

Credit: Nguyễn Thành Minh

In the previous part, we learned about the Flavors and Flags in Regex.

In this part, we will find out the following concepts in Regex:

  • Character Set (also called character class)
  • Or condition (pipe character)
  • Word Boundary Assertion
  1. Character Set []

A character set, also called character class, is a group of characters enclosed in square brackets [ ]. Character sets are used to define a set of characters that you want to match at a specific position in the text you're searching. They provide a way to specify a range of acceptable characters or a list of characters that can appear at that position.

For example, the pattern [123] will match any of 1, 2, and 3, while [xyz] will match any of x, y, and z.

Here are some examples of character sets and what they do:

  • [abc] matches either a, b, or c
  • [aeiou] matches any vowel character
  • [a-z] matches any lowercase letter from a to z
  • [A-Z] matches any uppercase letter from A to Z
  • [0-9] matches any digit from 0 to 9
  • [A-Za-z] matches any uppercase or lowercase letter
  • [A-Za-z0-9] matches any alphanumeric character

Inside the square brackets, you don’t need to escape metacharacters because they lose their special meaning. The only symbol that has a meaning in the square brackets is a hyphen (-), which you can use to specify ranges, as I have done with some examples of character sets. For example, [abcd] is the same as [a-d]. They match the "b" in "brisket", and the "c" in "chop".

However, the hyphen (-) is not treated as a metacharacter inside square brackets when it is placed in certain positions and contexts For example, [abcd-] and [-abcd] match the "b" in "brisket", the "c" in "chop", and the "-" (hyphen) in "non-profit".

2. Negating a Character Set [^]

We can negate a character set to match any character that is not in the specified set. To negate a character set, you typically use the caret (^) as the first character within the square brackets []. Here’s the syntax:

[^characters_to_exclude]

For example:

  • [^0-9] matches any character that is not a digit (0-9).
  • [^aeiou] matches any character that is not a lowercase vowel (a, e, i, o, or u).
  • [^A-Z] matches any character that is not an uppercase letter.
  • [^a-zA-Z] matches any character that is not an uppercase or lowercase letter.
  • [^ \t\r\n] matches any character that is not a whitespace character (space, tab, carriage return, or newline).

3. Shorthand Character Sets

Shorthand character sets allow you to match common groups of characters without listing each individual character explicitly. Here are some commonly used shorthand character sets:

  • \d matches any digit, equivalent to [0-9].
  • \D matches any non-digit character, equivalent to [^0-9].
  • \w matches any word character (alphanumeric character plus underscore), equivalent to [a-zA-Z0-9_].
  • \W matches any non-word character, equivalent to [^a-zA-Z0-9_].
  • \s matches any whitespace character (space, tab, newline, etc.).
  • \S matches any non-whitespace character.

These shorthand character sets can be very handy when you want to match specific categories of characters in your regular expressions. Here are some examples of how to use them:

  • \d{3} matches any three-digit number like 123, 555,…
  • \D{3} matches any sequence of three non-digit characters like AbC, !@#,…
  • \w+ matches one or more word characters (letters, digits, or underscores) like a_bc, A12,…
  • \W{3} matches any sequence of three non-word characters like +-=, $%^,…
  • \s* matches zero or more whitespace characters.
  • \S{3} matches any sequence of three non-whitespace characters like a2@, A#3,…

4. Word boundary assertion

A word boundary assertion is used to specify a position in the text where a word begins or ends. It does not consume any characters; instead, it asserts a position in the text where a specific condition is met. Word boundaries are typically represented using the following metacharacters:

  • \b: Represents a word boundary that asserts the position where a word begins or ends.
  • \B: Represents a non-word boundary, which asserts the position where a word does not begin or end.

Examples:

  • \bm matches the "m" in "moon".
  • oo\b does not match the "oo" in "moon", because "oo" is followed by "n" which is a word character.
  • oon\b matches the "oon" in "moon", because "oon" is the end of the string, thus not followed by a word character.
  • \w\b\w will never match anything, because a word character can never be followed by both a non-word and a word character.
  • \bpre matches "pre" only when it appears at the beginning of a word.
  • suffix\b matches "suffix" only when it appears at the end of a word.
  • \Bing matches "ing" only when it is not at the beginning or end of a word.
  • /\Bon/ matches "on" in "at noon".
  • /ye\B/ matches "ye" in "possibly yesterday".

5. Or condition using pipe character (|)

| (pipe character) acts as a logical OR operator, allowing you to specify multiple alternative patterns that you want to match. It means that if any of the specified patterns is found in the text being searched, the match will succeed.

For example, green|red matches "green" in "green apple" and "red" in "red apple".

  • \.jpg|\.png matches either “.jpg” or “.png” file extensions
  • ^(I|You)\s.*$ matches sentences that start with either “I” or “You”
  • (http|https):// matches URLs that start with either “http://” or “https://”

6. Practice

Now we will try to solve the exercises from the previous part by applying all the metacharacters we have learned.

6.1. Hashtag Validation

  • The old regex: ^#.+$ (LINK)

The old answer has a mistake because it matches hashtags that contain whitespace, such as #summer vacation.

If we want to exclude hashtags with whitespace, we just need to replace the . with \S. However, in doing so, it will also match hashtags like ##… because \S matches “#” or “.” (literal dot).

To fix it, we need to replace \S with [^\s#]

The final answer is ^#[^\s#]+$.

However, if you want your hashtag to include no special characters except for the underscore (_) character, the regex would be ^#\w+$.

If you want to exclude all special characters, including underscores, the regex would be ^#[^\W_]+$.

6.2. Email Validation

  • The old regex: ^.+@.+\..+$ (LINK)

Indeed, for a valid email, we can only use characters a-z, A-Z, 0–9, +, ., _, %, and before the ‘@’ symbol, and we can only use characters a-z, A-Z, 0–9, and after the ‘@’ symbol. Therefore, in the example above, we used the metacharacter . to match any character, which is incorrect. We need to correct it as follows:

^[\w%+.-]+@[A-Za-z0-9-]+\.[A-Za-z0-9-]+$

6.3. URL Validation

  • The old regex:^https?:\/\/.+\..+$ (LINK)

This answer has a mistake as it cannot match URLs using other protocols like fpt or rtsp. Therefore, we need to modify it as follows:

^(https|http|ftp|rtsp):\/\/.+\..+$

Conclusion

While there are still advanced concepts in regex to learn, with an understanding of the concepts mentioned above, we are now able to solve most problems using regex. In the next part, we will find out advanced concepts in Regex.

Continue to Part 4:

--

--

NALSengineering

Knowledge sharing empowers digital transformation and self-development in the daily practices of software development at NAL Solutions.