Java RegEx: Part 10 — Look-Behind
In the last part, we had some examples using look-ahead techniques.
Like look-ahead, there are 2 types of look-behinds: positive and negative look-behinds.
Positive look-behind confirms the presence of a certain pattern to the left of the current position in the input string.
And negative look-behind confirms the non-existence of a certain pattern to the left of the current position in the input string.
Put in a simpler term, it checks to see if a text is preceded by a certain pattern.
And also note that, like look-ahead, look-behind does not capture the matched pattern, rather it just confirms the existence or non-existence of the section in the input string.
Positive Look-Behind
Let’s take an example of a positive look-behind.
Suppose you have an address book and you want to search for all people that have their first name as “David
”.
In the program, I used an ArrayList to keep a list of sample names:
List<String> documentList = new ArrayList<>();documentList.add(“David Anderson”);documentList.add(“Jonathan David San”);documentList.add(“David Gorge”);documentList.add(“David M. Sandre”);documentList.add(“Peter Tran”);
Then, I defined the searched pattern with already having the name “David
”, you might want to get the name “David” from users.
searchString = “(?<=^David).*”;
Notice how we used the positive look-behind technique in the pattern:
(?<=^David)
Like the other look-ahead techniques, we start the look-behind technique by opening a group, followed by the question mark (?
), followed by a less than character (<
), followed by an equal sign (=
), and followed by our desired pattern, which is ^David
in our case.
I used the caret sign (^
) before the word “David
” to make sure that there were no characters allowed to the left of this word.
Following the positive look-behind pattern was a familiar pattern: .*
which allows users to input any characters. You can change to fit your needs.
Now, let’s run the program:
Matched: David Anderson
Matched: David Gorge
Matched: David M. Sandre
It’s worth to delve into more details how look-behind works here:
First, the first name in the list “David Anderson
” is taken out for checking. The pattern engine picks up the first word “David
” and sees that it is not preceded by the word “David”, so the engine rejects the word.
Next, the engine picks up the white space and see that it is preceded by the word “David
” so it is a match. So, the pattern engine found the first match.
Next, the second name on the list is checked. The same process is repeated:
- The first word “
Jonathan
” is not preceded by “David
”. So, the regular expression engine passes. - The next is the white space character, which does not match.
- The next is the word “
David
”, which does not match either because it is not preceded by “David
”.
However, when the engine picks up the next whitespace, which is preceded by the word “David
”, the engine does not find this group a match because in our look-behind pattern, I have used a caret sign (^
) in front of the word “David” to indicate that there should be no characters before this word. This is to ensure only names with the first names as “David” is found.
So, the rest of the names have performed a check in similar manners, and finally, we have the results as you have seen.
Negative Look Behind
The other kind of look-behind is called negative look-behind which works in the opposite way.
That means the negative look-behind confirms if a text is not preceded by a certain text pattern.
I re-use the example in the last lecture with a minor change in the pattern:
Suppose now I want to list all the names containing the sub-string “an”. But the sub-string “an
” must not be preceded by either letter “r
” or “s
”.
Based on those requirements, I have defined a pattern utilizing the negative look-behind technique:
searchString = “(?<![rs])an”;
In the pattern, I have replaced the equal sign (=
) with the exclamation mark (!
) to form a negative look-behind pattern; followed by letters “r
” and “s
” in square brackets; and followed by sub-string “an
”. That means in order to find a match, there should not be any sub-string “an
” preceded by either “r
” or “s
”.
If we execute the program, we have the following outputs:
Matched: David Anderson
Matched: Jonathan David San
These two names were found as a match because they both contain sub-string “an
” without letter “r
” or “s
” being preceded.
The name “David M. Sandre” did not satisfy the pattern because although the name had sub-string “an
”, the sub-string was preceded by the “S
” letter.
The name “Peter Tran” also was not found as a match for the same reason.