Java RegEx: Part 9 — Look-Ahead

Sera Ng.
Tech Training Space
5 min readOct 19, 2020

In this part, we are going to discuss a very powerful technique in regular expression called Look-ahead.

The basic idea of look-ahead is similar in some situations in real life. For instance, you want to visit a friend, but you are not sure whether he is at home or not, so you decide to call ahead to check if he is at home. If he is not available at home, then you won’t come to visit him.

The look-ahead in regular expression has the same idea.

It allows us to check the input string to match a certain pre-conditioned pattern before continuing to look at the rest of the string. If the pre-condition is not satisfied, then the matching process is stopped. This helps to save a lot of time and achieve much better performance in case we need to search in a long text.

There are 2 types of Look-ahead techniques: Positive and Negative Look-ahead.

Positive Look-Ahead

Let’s take an example:

Suppose I want to check for the validity of a book’s ISBN. The ISBN needs to conform to the following format:

ISBN-\\d{3}-\\d{4}

With the defined expression pattern, the matched strings must:

  • starting with “ISBN”, followed by a dash (-) character
  • followed by 3 digits and a dash (-) character
  • followed by 4 digits

The positive look-ahead technique will perform a check on the input string like the following process:

  • If the input string does not start with “ISBN”, then the matching process will stop without scanning the rest of the string.
  • If the input string starts with “ISBN”, the matching process will carry on to the next match. If one of the next parts is not satisfied, the whole matching process will stop without further operations on the original string.

Let’s see how we write a positive look-ahead pattern:

In the pattern, I have used the positive look head technique for the string “ISBN”:

searchString = “(?=ISBN)\\w{4}\\-\\d{3}-\\d{4}”;

The positive look ahead pattern must be placed in a group. It starts with the question mark (?), followed by equal sign (=), and followed by our desired pattern that we want to apply a positive look-ahead in.

The pattern will match with strings that:

  • Start with 4 characters a-z, 0–9, underscore (_). since we specify the ISBN in the look-ahead, the input string must contain the text ISBN. If the input string does not have ISBN, then the expression engine stops looking further
  • followed by a dash (-) character and 3 digits
  • followed by a dash (-) character and 4 digits

In the program, I have also defined a sample input:

String isbn = “ISBN-123–4567”;

And if we run the program, we will have the output:

Matched: ISBN-123–4567

Again, it is definitely to achieve the same result without a positive look-ahead technique. But if we have to scan through a very long text, and in case of no matching found, we could save a lot of time and gain better performance in general.

Another application of the positive look-ahead technique is that it can be used to check a password to make sure the user password is strong enough.

Let’s define our password policy.

When users register a new account in our system, a strong password is required for security reasons. The strong password policy needs to satisfy the following conditions:

  • at least 4 characters
  • max of 8 characters
  • at least 1 digit from 0 to 9
  • at least 1 upper case letter

Based on the requirements, I have defined the password pattern as in the following program:

Pay attention to the pattern:

String passwordPattern = “^(?=.*\\d+.*)(?=.*[A-Z]+.*)\\w{4,8}$”;

The pattern includes 2 positive look ahead patterns:

  • (?=.*\\d+.*): this look ahead pattern requires the password must contain at least one digit from 0 to 9.
  • (?=.*[A-Z]+.*): this look ahead requires the password must contain at least one upper case letter from A to Z.

And to restrict the min and max number of characters, we use the familiar pattern: \\w{4,8}

Also, note that it is mandatory to apply the caret sign (^) at the beginning, and the dollar sign ($) at the end of the pattern. This is to make sure the entire input password must be matched with the pattern.

Let’s run the program to see if our pattern works as expected:

Enter your password: ajkd3

Invalid password!

Enter your password: Akijj

Invalid password!

Enter your password: 8Aj

Invalid password!

Enter your password: 8abCd

Password is valid!

“ajkd3”: this password was not valid because there was no upper case letter

“Akijj”: this password was not valid because there was no digit

“8Aj”: this password was not valid because it did not meet the min length

“8abCd”: this password was completely valid

Negative Look Ahead

The look-ahead pattern in the above examples is called positive look-ahead, which performs the check of the existence of a certain section in the input string.

The other type of look-ahead is called negative look-ahead.

A negative look-ahead is similar to a positive look-ahead in the way they both check a certain pattern before continuing to look at the rest of the input string.

However, negative look-ahead is different in the way that it will confirm a certain section that does not exist in the input string, instead of confirming the existence as in the positive look-ahead.

A common application of using a negative look-ahead is that it is used to search a text that does not contain a certain word.

For instance, I want to filter the content of certain websites that can contain the word sex but not the word porn, which might highly produce unhealthy content.

Let’s see an example:

In the program, I have defined a negative look-ahead pattern as follows:

searchString = “(?!.*porn).*sex.*”;

We start a negative look-ahead by opening a group with a question mark and an exclamation mark, then followed by the pattern that we do not want to exist in the input string.

Like in our pattern, we apply a negative look-ahead for the word “porn”. Note that we use the dot and star characters before the word “porn” with the meaning that the word “porn” can appear at any position in the input string.

Next in our pattern is the word “sex” means that the input string must contain this word. The dot and star characters on either side of the word “sex” mean that this word can appear at any position in the input string.

I created a list to store some sample data:

List<String> documentList = new ArrayList<>();
documentList.add(“My sex is female”);
documentList.add(“Sex education can be taught in high school”);documentList.add(“This sentence contain a porn word”);documentList.add(“These are sex and porn websites”);

As you have noticed in the sample list, there are strings containing only the word “sex”; and there is only the word “porn”.

And if we run the program, we have the following results:

Matched: My sex is female

Matched: Sex education can be taught in high school

Strings containing the word “porn” were not listed since they did not match the pattern.

--

--