Regular Expression For Dummies. Part 1: Quantifiers
Credit: Nguyễn Thành Minh (Android Developer)
- Why should we learn Regular Expression (Regex)?
Have you ever been involved in a project with over 500 code files and realized there are a lot of hard-coded dimensions? Your task is to find and fix them like this:
width: 40 -> width: 40.responsive()
height: 50 -> height: 50.responsive()
top: 60 -> top: 60.responsive()
bottom: 70 -> bottom: 70.responsive()
left: 40 -> left: 40.responsive()
right: 30 -> right: 30.responsive()
fontSize: 13 -> fontSize: 13.responsive()
horizontal: 10 -> horizontal: 10.responsive()
vertical: 20 -> vertical: 20.responsive()
How much time does it take to do this? It will surely take a lot of time and effort if you have to manually search and fix each place. However, it only takes me 2 minutes to complete. That’s thanks to a powerful feature in IDEs and Text Editors called Find and Replace by Regex.
Of course, you need to know Regex to be able to use this feature. That’s why you should learn Regex. It will make your life much easier.
In addition to Find and Replace, Regex also proves useful for us in many other scenarios such as:
- Data Validation and Parsing: Regular expressions are often used to validate user input, such as email addresses, phone numbers, and passwords, against specific patterns. They can also be used to parse structured data from unstructured text.
- Web Scraping: When extracting information from web pages or other structured documents, regular expressions can help identify and capture specific data patterns.
2. What is Regex?
[-+.A-Za-z0-9]+@[A-Za-z0-9][A-Za-z0-9-]+(\.[A-Za-z0-9][A-Za-z0-9-]+)+
If you’re anything like me, you’re probably wondering what the f❀ck I am looking at.
“Regex”, short for “regular expression,” is a powerful sequence of characters that forms a search pattern. It is a text string that describes a specific pattern of characters and is used to perform operations like search, match, replace, and validation within strings of text. Regular expressions are supported in various programming languages, text editors, and tools.
Regular expressions consist of a combination of characters, special symbols, and metacharacters, which define a set of rules for pattern matching. These patterns allow you to perform complex string manipulations with concise and flexible syntax.
For example:
0\d{9}
matches any 10-digit phone number starting with0
, such as 0905123456, 0907987654[-+.A-Za-z0–9]+@[A-Za-z0–9][A-Za-z0–9-]+(\.[A-Za-z0–9][A-Za-z0–9-]+)+
matches with emails such as ntminh@gmail.com, minhnt3@nal.vn,…
It may look complex, but it’s really not difficult to learn. In this series, I will introduce metacharacters from simple to complex, along with plenty of exercises for practice. These exercises will help us gain a deeper understanding of Regex.
3. Quantifiers
In Regex, a “quantifier” specifies how many times a certain character or group of characters should appear in the input string. Quantifiers control the repetition of characters and help define the flexibility of a match. They allow you to specify a range of occurrences, making your patterns more flexible and capable of matching different lengths of text.
In this article, we will look into quantifiers like {n}
, {n,}
, {n, m}
, ?
, +
, *
and the most popular metacharacter in Regex: .
3.1. Quantifier: {n}
{n}
: “n” is a non-negative number. This quantifier specifies that the preceding character or group should occur exactly “n” times. It allows you to specify a fixed number of repetitions for a certain pattern. For example:
a{3}
matches withaaa
abc{4}
matches withabcccc
Hel{2}o
matches withHello
To start using and learning how to use Regex, I often use the website https://regex101.com/. The interface is as follows:
First of all, you only need to know the three red-highlighted areas: the place to input regex, the place to input test cases, and the area displaying matching results. The other two areas, Regex Flavor and Regex Flag, will be explained in the next article. For the time being, we should use Regex Flavor ECMAScript
and Regex Flag gm
.
We can easily see that when entering the regex a{3}
with the provided test cases as shown in the image, there are 2 matching results.
3.2. Quantifier: {n,}
x{n,}
: “n” is a non-negative number. This quantifier is used to match “n” or more occurrences of the preceding character or group. It specifies a lower bound for the number of times the pattern should appear. For example:
a{3,}
matches strings with three or more consecutive ‘a’ characters. such asaaa
,aaaa
abc{4,}
matches withabcccc
,abccccc
,…Hel{2,}o
matches withHello
,Hellllllo
,…
3.3. Quantifier: {n, m}
x{n, m}
: “n” and “m” are non-negative numbers and n ≤ m
. This quantifier is used to match a range of occurrences of the preceding character or group. It specifies that the pattern should appear at least "n" times and at most "m" times. It allows you to define a lower and upper bound for the number of repetitions.
For example:
a{2,4}
matches strings with 'a' repeated 2 to 4 times consecutively, such asaa
,aaa
,aaaa
.ab{3,5}
matches strings with 'a' followed by 'b' repeated 3 to 5 times consecutively, such asabbb
,abbbb
,abbbbb
.ab{3,5}c
matches strings where the character ‘a’ is followed by ‘b’ repeated 3 to 5 times consecutively, and then followed by the character ‘c’, such asabbbc
,abbbbc
,abbbbbc
.
3.4. Metacharacter: .
It is not a quantifier but it is the most popular metacharacter in Regex. The .
(dot) character in regular expressions matches any single character except for a newline (\n
). It's a wildcard that can stand in for any character.
Here are some examples to illustrate the usage of the .
character:
.
: This pattern matches any single character, such asa
,á
,b
,c
,1
,2
,3
,世
,€
, 🌟, @, #a.b
: This pattern matches any string that has an 'a', followed by any character, and then followed by a 'b', such asaxb
,a4b
,a$b
.{2,}
matches two or more of any character (except for newline characters), such asabcd
,123
,@@
,:D
a.b.{3}c
: This pattern matches any string that has an 'a', followed by any character, followed by 'b', followed by three any characters, and then followed by 'c', such asa1b123c
,axbxyzc
,a€b世~@c
3.5. Quantifier: ?
The ?
character indicates that the preceding character or group is optional, meaning it can appear zero or one time. It's used to specify that something is optional.
Here are some examples to illustrate the usage of the ?
character:
colou?r
: This pattern matches eithercolor
orcolour
, as theu
is optional..?
matches zero or one arbitrary character. Examples:(empty)
,a
,b
,#
,1
.a.?z
matches withaz
,abz
,a1z
,a$z
3.6. Quantifier: +
The +
character indicates that the preceding character or group should appear one or more times. It's used to specify that something should be repeated at least once.
Here are some examples to illustrate the usage of the +
character:
a+
matches strings that contain one or more consecutive occurrences of the character ‘a’, such asa
,aa
,aaaaaaa
.+
matches strings that have at least one character, such asa
,ab
,abc123
a.+z
matches withabz
,abcdz
,aaaaaaz
3.7. Quantifier: *
The *
character indicates that the preceding character or group should appear zero or more times. It's used to specify that something can be repeated any number of times, including zero times.
Here are some examples to illustrate the usage of the *
character:
a*
matches strings that contain zero or more consecutive occurrences of the character ‘a’, such as(empty)
,a
,aa
,aaaaaaa
.*
matches strings that contain any sequence of characters, including zero characters, such as(empty)
,a
,abc123
a.*z
matches withaz
,abz
,abcdz
,aaaaaaz
3.8. Greedy/lazy quantifiers
By default quantifiers like *
and +
are "greedy", meaning that they try to match as much of the string as possible. The ?
character after the quantifier makes the quantifier "non-greedy" (or “lazy”): meaning that it will stop as soon as it finds a match. For example, given a string like some <foo> <bar> new </bar> </foo> thing
:
<.*>
will match<foo> <bar> new </bar> </foo>
<.*?>
will match<foo>
4. Escaping
The backslash \
is known as the escape code, which restores the original literal meaning of the following character.
Here are some examples:
\.
matches a literal dot.
, not matches witha
,b
\?
matches a literal question mark?
Conclusion
The regexes learned can be summarized in the image below.
In the next article, we will focus on the practical part with exercises related to common programming problems.
Continue to Part 2: