Javarevisited
Published in

Javarevisited

Making Regex your friend in Java

Have you ever seen this meme and sort of agree with it?

Regex meme

Well, then you’re like me. And I for one would like to change that. This article aims to solve that and make it into another tool available in your developer arsenal. I believe it can definitely be over-used, however, there are situations where it is called for.

Let’s dive into it.

Regex constructs

First, let’s talk about some constructs that are helpful in constructing your regular expression. Note that these relate to Java and I believe they can differ across languages.

Brackets is called a character class while parenthesis is referred to as a capturing group [1].

A character class is a way to say which characters you want to match. For instance.

Here the output will be the following.

A capturing group is something completely different. This is a way for you to say that you want your regex to save what it found so that you can retrieve it later. In other words, you’re telling the regex to “match against my regex, but save these sub-results for me”. Let’s look at an example.

Here we are trying to match a string of the format: DIGIT+DIGIT=DIGIT. And we want to extract the digits.

The first pattern uses only character classes, “+” (which is escaped because it has another meaning in regex) and “=”. The second pattern uses unnamed capturing groups. The final and third pattern uses named capturing groups.

Note the syntax for named capturing groups: (?<NAME>pattern)

The output from this is the following.

This is a good time to discuss the differences between matches and find.

Essentially there are two ways you can use your regex. One is to match against the whole string. And another use is to find sub-strings.

Let’s use the previous example of trying to get the digits from strings. But in this case, we want to simply get digits, we don’t really care about the general format of the string. In that case, we can use named capturing groups again but with the find method instead. Like this.

And the output from this is the following.

We can note that the first group number 0 is always the whole match from the original pattern. And the number of groups is basically 1 + defined groups.

This example simply finds digits in a string with no other considerations basically. Therefore, a simple rule is to use find if you are looking for a specific substring and matches if you want to ensure a specific format of a string.

So far I’ve only shown examples with matching single occurrences. But it’s helpful to sometimes be able to specify a repeat occurrence. There are actually three types of quantifiers, greedy, reluctant and possessive.

I will only touch on greedy in this article but feel free to check up on the other two if needed. Let’s say you are interested in finding digit sequences. Let’s say we want to generalise our previous example of “DIGIT+DIGIT=DIGIT” to allow for numbers such that “100+50=150” will also work.

I have also used a pre-defined character class “\d” which simply means a digit, you can find some more pre-defined classes here [1].

The meaning of “\\d{1,}” means that there should be at least one digit. You can give it another argument such that you allow for a maximum 3 for instance like so: “\\d{1,3}.

This operator can be helpful if you need a conditional match. Let’s say we have strings like “Rating 5”, “Score 2”, “Scores 3”, and “Point 9”. And we need to make sure that the format is correct while extracting the digit. To do this we can use the or operator.

The output here is as expected.

I added the “s{0,1}” to show that you can make the expressions inside the OR as complex as you like.

Let’s say you get some strings like the following.

  • Apple: 3 Apples
  • Orange: 4 Oranges
  • Banana: 1 Banana

Essentially you want to match something like: “WORD: DIGIT WORD(s)”

For this you would need to be able to reference the match inside the pattern, this is done using back-references like so.

You can see that I have referenced the named capturing group “fruit” at the end of the pattern using the syntax “\\k<NAME>”.

That’s the final item for this article. I hope these tools can give you some more confidence when using regular expressions in Java. I want to end the article with a note on performance and a tip when using regex in the wild.

Compiling Patterns

Using the matches function on Pattern (see the first example) actually compiles a regex and matches it in the background. The other examples used the compile method on Pattern explicitly. This function is quite expensive which I can demonstrate below.

So you want to reuse it across method calls, perhaps set it as a static final field.

I will reuse the last example with the fruits and do some refactoring into methods. See the example below.

Here are the results, so on my machine the difference is substantial. You should probably get a similar results.

When doing microbenchmarks in Java you need to be careful because of how the compiler works and all the optimizations that it performs. Therefore it is usually recommended to do a warm-up as you can see that I’ve done in the first loop.

I have talked about this in previous articles such as when I compared LinkedList to ArrayList and my top tips when doing micro benchmarking in Java, see the links below.

--

--

A humble place to learn Java and Programming better.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store