Java RegEx: Part 8 — Backreferences

Published in

Tech Training Space

3 min readOct 19, 2020

In this session, I’m going to discuss backreferences in regular expressions.

When dealing with groups in a regular expression, backreferences provide a way to match previously captured text without re-writing the entire pattern again.

A backreference is specified in the expression pattern by using the form \x right after the group pattern, where x is the group index in the whole pattern.

One of the applications of backreferences is that we can utilize this special technique to check for duplication of certain words in a particular text.

Let’s explore the following example:

In the above program, I have defined the pattern with backreference as follows:

String searchPattern = “(\\w\\w)\\1”;

The expression pattern consists of 2-word characters in sequence and is placed in a group, then followed by \1. The number 1 is the group index, which is the one in the parentheses. Our pattern has a backreference that refers to the second group.

What it means is that it will match any 2-word characters, followed by exactly the same those 2-word characters.

Let’s run the program:

Enter a string to search: This is a string
Not found duplicate word
Enter a string to search: This isis a string
Found duplicate: isis
Enter a string to search: This is abab string
Found duplicate: abab

“This is a string”: was not found a match because there were no two characters followed by those two characters. In other words, there were no duplicated two characters.

“This isis a string”: was found a match because there was a sub-string starting with “is” and followed by exactly “is”, which was the one “isis”.

“This is abab string”: also was found a match because there was sub-string “abab”.

Now let’s see another pattern with backreference.

Let me change the pattern as follows:

String searchPattern = “(\\w\\w)(\\d\\d)\\2”;

I have changed the pattern with 2 digits following the 2-word characters, and the backreference now refers to the second explicit group.

That means the matched string or substring must start with 2-word characters, followed by 2 digits, and followed by exactly the same 2 digits.

Let’s run the program and test with some data:

Enter a string to search: This is23 a string
Not found duplicate word
Enter a string to search: This is2323 a string
Found duplicate: is2323

“This is23 a string”: no match found because there was no sub-string starting with 2 characters, followed by 2 digits, and followed exactly by those 2 digits.

“This is2323 a string”: a match found because the sub-string “is2323” started with 2 characters “is”, followed by 2 digits “23”, and followed exactly by those 2 digits “23”.

Let’s see another useful example:

In the program, I have defined a pattern with the usage of backreference and a boundary matcher:

String searchPattern = “\\b(\\w+)\\s+\\1\\b”;

As you have learned in the previous part, the boundary character \b will match the entire group of certain characters. In our case here, those characters include a-z, 0–9, and whitespace.

And the backreference in this pattern refers to the explicit group with index 1, which is the one with (\\w+).

So, with those definitions, the pattern will match any strings starting with characters a-z, 0–9, whitespace, and followed exactly by those groups of characters.

Let’s run the program to see the result:

Enter a string to search: This string contains contains some duplicate word word
Found duplicate: contains contains
Found duplicate: word word

As you can notice from the outputs, we’ve got duplications: “word word”, and “contains contains”

Java RegEx: Part 8 — Backreferences

Written by Sera Ng.