Java RegEx: Part 6 — Group and Subgroup

Sera Ng.
Tech Training Space
5 min readOct 18, 2020

In previous parts, we have worked with the concepts of grouping in which we can group the entire pattern so that users either must input or ignore the whole matched string.

Group can also be applied to pre-define a list of items that users are required to select from, such as:

String pattern = (music|sport|book|movie)

With the above-defined group, users must select one of the items: music, sport, book, or movie.

Also, in the last session, we invoked the group() method in the Matcher class to extract matched digits.

In this session, we are going to explore more on groups and sub-groups.

To remind you, a group is a sequence of characters. Put in other words, groups are the way to treat a sequence of characters as a single unit.

The main purpose of using groups and subgroups is that we can separate the input string into sections and extract those sections if they are matched with the defined pattern.

In java regular expressions, every pattern can have groups. And groups are indexed like in array’s elements. The first group index starts at 0 just like array indexes.

There are two types of groups: explicit and implicit groups:

- Characters placed in brackets (()) are called explicit groups. Those that we have used in the previous example.

- Characters NOT placed in brackets (()) are called implicit groups.

By default, when we create a pattern and characters in the pattern are not placed in brackets (()), then the entire pattern belongs to the implicit group and has the index of 0.

Observe the following pattern:

In the above pattern, there is one implicit group with index 0.

Observe another pattern:

In the above patter, there are two groups:

- The first implicit group with index 0 contains the entire pattern.

- The second group with index 1 is the explicit group, which is the one in the bracket ((\\d{3})).

Take a look at another example:

The above pattern has 3 groups, 1 implicit and 2 explicit ones:

- The first implicit group with index 0 contains the entire pattern

- The second explicit group with index 1, which is the one in the first brackets ((\\d{3}))

- The third explicit group with index 2, which is the one in the second brackets ((\\d{3}))

Let’s study the following code:

As always, I have a pre-defined pattern:

searchString = “\\w+ \\d{1,2}-\\d{1,2}-\\d{4}”;

The pattern matches any date formats if:

- Starting with characters a-z, 0–9, underscore (_), followed by a whitespace.

- Followed by 1 or 2 digits and a dash (-) character, which represents for the day

- Followed by 1 or 2 digits and a dash (-) character, which represents for the month

- Followed by 4 digits, which represents for the year

Then, we need to compile the pattern to make sure that all characters are syntactical valid:

pattern = Pattern.compile(searchString, Pattern.CASE_INSENSITIVE);

Note that I supply the second parameter in the Pattern.compile() pattern:

Pattern.CASE_INSENSITIVE

That means characters in the inputted string can be both upper and lower case.

I have also defined an input string, or you can get the input string from users by using Scanner class as we have done previously:

text = “Monday 12–9–2013”;

And just like the previous example, I used the find() method to scan through the entire string. For each matched group, the group() method was invoked to extract the matched string

while (matcher.find()) {
System.out.println(“found: “ + matcher.group(0));
}

Note that, I have passed 0, which is the matched group index, as a parameter in the group() method.

So, the method call:

System.out.println(“found: “ + matcher.group(0));

is equivalent to:

System.out.println(“found: “ + matcher.group());

Now, if we run the program:

found: Monday 12–9–2013

We’ve got the original string because the entire input string matched with the pattern in the first implicit group with index 0.

Suppose now I want to extract:

  • The day part
  • The month part
  • The year part

To achieve the tasks, I need to place each part in a group so that I can extract them if there is a match:

I change the previous program as follows:

As you have noticed, I have placed each part in an explicit group:

searchString = “(\\w+ \\d{1,2})-(\\d{1,2})-(\\d{4})”;

So, there are totally 4 groups:

  • The first implicit group with index 0 is the entire pattern
  • The second explicit group with index 1 is the one in the first brackets: (\\w+ \\d{1,2})
  • The third explicit group with index 2 is the one in the second brackets: (\\d{1,2})
  • The fourth explicit group with index 3 is the one in the third brackets: (\\d{4})

Once those groups are matched, we can extract each matched group by supply the corresponding indexes in the group() method:

while (matcher.find()) {
System.out.println(“found: “ + matcher.group(1));
System.out.println(“found: “ + matcher.group(2)); System.out.println(“found: “ + matcher.group(3));
}

And we also count the number of groups in the pattern by invoking the groupCount() method in the Matcher class:

System.out.println(“There are “ + matcher.groupCount() + “ groups in the pattern!”);

Note that the groupCount() method only returns the number of explicit groups, which are 3 groups in our example.

Run the code and we have the outputs:

found: Monday 12

found: 9

found: 2013

There are 3 groups in the pattern!

Java regular expression engine also supports nested groups. That means we can have groups inside groups. The inside groups are called sub-groups.

For instance, in the above example, in the second explicit group (index 2), I want to extract the weekday, which should be produced “Monday”. Then, I will need to put the weekday part in another group which is the sub-group of the second explicit group:

searchString = “((\\w+) \\d{1,2})-(\\d{1,2})-(\\d{4})”;

Now, there are 5 groups in the pattern:

  • The first implicit group with index 0 is the entire matched pattern.
  • The second explicit group with index 1 is the one in the first brackets: (\\w+ \\d{1,2}).
  • The third explicit group with index 2 is now the sub-group (which is inside the second explicit group), the one in the second brackets: (\\w+)
  • The fourth explicit group with index 3 is the one in the third brackets: (\\d{1,2})
  • The fifth explicit group with index 4 is the one in the last brackets: (\\d{4})

Since now we have one more group, which is the sub-group, we need to invoke the group() method one more time with index 4 to extract all grouped data:

while (matcher.find()) {
System.out.println(“found: “ + matcher.group(1));
System.out.println(“found: “ + matcher.group(2)); System.out.println(“found: “ + matcher.group(3)); System.out.println(“found: “ + matcher.group(4));}

Run the program and we have the outputs as follows:

found: Monday 12

found: Monday

found: 9

found: 2013

There are 4 groups in the pattern!

--

--