Java RegEx: Part 4— Replacing and Extracting Text

Sera Ng.
Tech Training Space
10 min readOct 18, 2020

In this part, we are going to explore methods that can be used to replace and extract text from a string, which are very common tasks in programming in general, not just in Java.

Replacing Text

There are many cases that we need to replace a string or certain characters. For instance, we need to replace error or redundant characters with the right ones.

The java.lang.String class in Java supports 2 methods for these purposes: replace() and replaceAll()

For instance, I have a string as follows:

This is a strIng wIth some typos

In the string, there are some upper case I letters and I want to replace them with a lower case i letter.

The code:

In the above code, I invoked the replace() method in the String class:

text = text.replace(“I”, “i”);

The replace() method takes 2 parameters:

  • the first is the character or substring you want to replace
  • the second is the new character or substring

The replace() method will replace all occurrences of the old character or substring with the new one. And in our example code, the method will replace all the upper case I characters with the lower case i characters

Let’s run the program and see the result:

This is a string with some typos

However, the replace() method does not take a regular expression as its first parameter.

Therefore, it will not work in cases we want to replace certain characters based on a particular pattern.

For instance, if we have a string like the following:

This is 34a string 23that contain56s many typos

In the above string, we want to remove all the digits. And it is much more convenient if we could specify \\d as a pattern to be removed.

Of course, we can use the replace() method to achieve the task, but then we need to specify individual characters as a replacement.

We have a better solution: the replaceAll() method.

The replaceAll() method works similarly to the replace() counterpart with one significant difference: it supports regular expression as its first parameter.

Let’s see the code:

In the above code, I have invoked the String.replaceAll() method as follows:

text = text.replaceAll(“\\d”, “”);
  • The first parameter should be \\d because we want to remove all the digits in the input string.
  • The second parameter is the new string to be replaced. Since we want to remove digits, an empty character should be applied here.

Let’s run the program and we’ll see all the digits are gone:

This is a string that contains many typos

Let’s take another example

I have another text containing some un-intended special characters: the dollar sign, the percentage sign, the star character, and the exclamation mark. Those special characters need to be removed.

Since there are many special characters, it is easier to specify in the patterns what we want to keep, rather than what we want to remove.

So in the pattern, I use the caret character followed by \w and whitespace. That pattern means: remove everything except those represented by w and the whitespace.

Why the whitespace? Because the \w character represents only for upper and lower case characters from a to z and digits from 0 to 9, not including the whitespace. So if we don’t specify the whitespace in our pattern, all whitespaces will be removed.

Run the program and you’ll see the output:

This is text with some special characters in it

Extracting text

Extracting text with java.util.Scanner class

In this section, I’m going to show you how we can use the Scanner class in the package java.util to extract text from a string.

Before going to examples, let’s discuss tokens and delimiters.

In a string, there are 2 types of text: tokens and delimiters. Tokens are meaningful words, and delimiters are characters that separate tokens.

For instance, I have a string: I love you so much.

So, in my string:

  • tokens are I, love, you, so, and much
  • delimiters are whitespace characters.

However, what are tokens and delimiters largely depending on our purpose. For instance, we can use whitespaces as delimiters; or we can specify any characters to work as delimiters. Almost all of the string splitting methods in Java uses whitespace character as the default delimiter.

In this section, we will use the Scanner class to extract text based on particular delimiters, or based on a pattern.

Very often, we use the scanner class to get the input from users via a console user interface by specifying the System input as its constructor parameter.

Actually, the Scanner class can take its source of input as system input, or a string variable, or a file.

The first method we can use to split a string into tokens is the next() method in the Scanner class, which uses whitespace as the default delimiter.

Let’s see some code:

In the above code, first, an instance of the Scanner was created. And instead of passing System.in object to the constructor, I passed the string variable s because we needed to read and then manipulate the string.

String s = “I love you so much. I want to marry you”;
sc = new Scanner(s);

Next, I used a while loop to traverse all the tokens in the string. In the while loop’s condition, the hasNext() method was invoked to check if there was any more next token:

while (sc.hasNext())

The hasNext() method returns true if there is more token, otherwise false is return, which also indicates that the end of the string has been reached.

By default, the hasNext() method uses whitespaces as a delimiter to separate and navigate among tokens.

The hasNext() method reads a token and stops if it reaches a delimiter. If the method is being kept invoking, it reads the next token and stops if it reaches another delimiter. The whole process repeats until there is no more token.

Actually, the hasNext() method does not read the tokens, rather it just checks to see if there are any more tokens left.

The one that actually reads and returns the tokens is the next() method.

String token = sc.next();

Run the program and we will get the outputs:

I

love

you

so

much.

I

want

to

marry

you

We’ve got the above results because as mentioned earlier, by default the hasNext() method uses whitespace as delimiters.

However, we can inform the hasNext() method to use any characters as delimiters.

For instance, now I want to use both whitespace and dot (.) characters as delimiters.

That can be done like this:

In the code, I have added the following method call:

sc.useDelimiter(“[ .]”);

The useDelimiter() method is used to inform the hasNext() method what to use as delimiters. And as you can observe, I have specified whitespace and dot characters.

Note that when we use customed delimiters with the useDelimiter() method, whitespace characters are no longer the default ones. Therefore, if you want to use white space as delimiters, you need to explicitly claim that as we just did.

Now it’s time to run the program:

I

love

you

so

much

I

want

to

marry

you

In the output, you can see there was an empty line. That’s because we have used both whitespace and dot characters as delimiters. And there was a time these 2 characters came right next to each other (between the word much and I).

If we want to treat 2 (or more) delimiter characters being right next to each other as a single one, then we need to apply one quantifier character as follows:

sc.useDelimiter(“[ .]+”);

Run the program again:

I

love

you

so

much

I

want

to

marry

you

And the empty line has been removed.

Besides using specific characters as delimiters, we can also specify a regular expression as delimiters.

Suppose I have the following string:

I love you 4 so much. 34 I 23 want to marry you

There are digits in the string and I want to break the string into substrings based on those digits.

I can achieve the task as follows:

As you can notice, I have used a digit pattern as a parameter in the useDelimiter() method:

sc.useDelimiter(“\\d+”);

And also notice that we need to use the plus (+) sign so that if there are digits right next to each other, they will be treated as a single digit.

Run the program and we will have:

I love you

so much.

I

want to marry you

With the above case, we can also do the reverse, which means we can retrieve all the numbers: 4, 34, and 23. That means I will use all the characters as delimiters except digits.

To achieve the task, all we need to do is to make a minor change in the pattern:

As you can see, I have used the caret sign (^) right before \\d, which I think you still remember that the caret sign means ‘except’.

So, the pattern means any characters will be used as delimiters except digits.

Run the program and we got:

4

34

23

Extracting text with String.Split()

The String class provides developers another option to split a string into words based on certain delimiters is the split() method.

The split() method can break up a string into tokens with certain delimiters just like the Scanner class we have come across in the above.

However, there are 2 main differences from the Scanner class:

  • The split() method does not base on whitespace characters as a delimiter. Therefore, if we wish to use whitespace characters as delimiters, we need to explicitly specify so.
  • The split() method returns a string array containing extracted tokens. This is very convenient if we plan to process those tokens later on.

Let’s see an example:

In the program, I have the following string:

String s = “I love you so much! But I cannot marry you.”;

I want to break the string into substrings or tokens based on whitespace characters. I can achieve the task as follows:

tokens = s.split(“[ ]”);

Since the split() method returns an array of extracted tokens, we need a loop or the likes to get those tokens:

for (String token : tokens) {
System.out.println(token);
}

Run the program and we have outputs:

I

love

you

so

much!

But

I

cannot

marry

you.

In case you want to specify more characters as delimiters, you can do as below, which I use both whitespace and exclamation mark (!):

tokens = s.split(“[ !]”);

The complete example:

Note that in the above example, I also printed out the length of the token array which of course were the number of extracted tokens.

I have the output:

I

love

you

so

much

But

I

cannot

marry

you.

Number of tokens: 11

As you can see in the output, we have totally of 11 tokens, which means there was an empty token.

That’s because I have used both whitespace and exclamation mark as delimiters and there were times these two characters appearing right next to each other. And that caused the split() method to treat them as an empty token.

If we want to remove the empty token, which means to treat adjacent delimiters as one, we just need to add the plus (+) sign at the end of the pattern. Like below:

tokens = s.split(“[ !]+”);

Run the program and we have the following result:

I

love

you

so

much

But

I

cannot

marry

you.

Number of tokens: 10

And as you can see, we now have only 10 tokens.

Apart from using specific characters as delimiters, we can supply regular expressions to the split() method as parameters.

Let’s see the following program:

In the program, I have the following string:

String s = “I love you 4 so much. 34 I 23 want to marry you”;

And I want to retrieve tokens based on digits and whitespace characters. I can write the split() method as follows:

tokens = s.split(“[\\s\\d]+”);

In the parameterized pattern, I have

  • \s: represents for whitespace characters. Also keep in mind that whitespace characters include: space, tab, newline (\n), line feed (\f), and carriage return (\r)
  • \d: represents for digits as you should be familiar with already

Run the program and we have:

I

love

you

so

much.

I

want

to

marry

you

Number of tokens: 10

Extracting text with java.util.StringTokenizer

Previously, you learned how to use Scanner class and String.split() methods to break up a string into words or tokens.

Another way to accomplish such tasks is to use the StringTokenizer class which is also located in the package java.util. This is one of the oldest classes in Java since it was presented from JDK 1.0.

However, the StringTokenizer class can only read from a string variable. It cannot read from the system input which is the console window, or from a file like the Scanner class.

It has one similarity to the Scanner class in which it also uses the whitespace as the default delimiter and supports custom delimiters.

Let’s go for an example of how and why to use the old java.util.StringTokenizer class:

I have a string in the program and this string needs to be passed as a parameter in the constructor:

String s = “I love you so much! But I cannot marry you.”;
stk = new StringTokenizer(s);

Then, we need to use a while loop and invoke the method:

stk.hasMoreTokens()

in order to check if there is any more token. By default, this method uses whitespace character as a delimiter.

If there is a token, the following method is used to read and return the token:

stk.nextToken()

It’s just as simple as that. Now let’s run the program:

I

love

you

so

much!

But

I

cannot

marry

you.

If you want to specify a list of characters as custom delimiters, you need to supply to the second parameter in the constructor:

Like in the above code, I have used both whitespace and exclamation mark characters as delimiters:

stk = new StringTokenizer(s, “ !”);

Here is the output if we execute the program:

I

love

you

so

much

But

I

cannot

marry

you.

Note that in the above output, although there was a time when both delimiters (whitespace and exclamation mark) came right next to each other (between the word much and But), the StringTokenizer treated them as one delimiter without us having to supply the quantifier plus (+) sign as we had done in previous examples with Scanner and String.split() method.

Even if we had supplied the plus (+) sign, the StringTokenizer would have used the plus (+) sign as a delimiter, not a quantifier in the regular expression.

That’s because StringTokenizer does not support regular expression.

And this is the biggest difference from the Scanner class and the String.split() method.

The reason for not supporting regular expression in StringTokenizer is that StringTokenizer had been presented in JDK 1.0, while up to JDK 1.5, regular expression was introduced.

And because of no support whatsoever of a regular expression, StringTokenizer does not take any overhead to analyze and process regular expression patterns, which leads to providing much better performance than the other two counterparts (Scanner and String.split()) in case of proceeding a very long text.

So, when you have to analyze and process a very long text and no specific regular expressions are required, take StringTokenizer into consideration.

--

--