Java RegEx: Part 3— Matching text for validation (Cont.)

Sera Ng.
Tech Training Space
6 min readOct 17, 2020

In this session, we are going to use a regular expression to perform validation on some common inputs: phone numbers and email addresses

Case 1: Checking phone numbers

The phone number formats vary countries from countries. It really depends on your cases to have an appropriate regular expression for validating phone number format.

Let’s say I want to check a phone number with the following format:

Country code (3 digits) — area code (2 digits) — individual phone number (7 digits)

With that in mind, users are required to input:

  • Country code: must input 3 digits, followed by a dash (-) character
  • Area code: must input 2 digits, followed by a dash (-) character
  • Individual phone number: must input 7 digits

Let’s see the following code:

Based on the required format, I have defined the pattern as follows::

String phonePattern = “\\d{3}-\\d{2}-\\d{7}”;
  • The part is for nation code which requires 3 digits, so I have used \\d{3} for the task
  • The second part is for area code which contains 2 ditgits, so \\d{2} should be applied
  • The last part is for an individual phone number which contains 7 digits, and hence \\d{7}

Now run and test your program:

Input your phone(xxx-xx-xxxxxxx): 084–888–1234567

Invalid data!

Input your phone(xxx-xx-xxxxxxx): 084–88–123456

Invalid data!

Input your phone(xxx-xx-xxxxxxx): 084–38–1234567

Valid data

084–888–1234567: invalid because the area code contained 3 digits while this part requires only 2 digits

084–88–123456: invalid because the individual phone number contained only 6 digits, while this one should be 7 digits

084–38–1234567: completely matched the pattern

Case 2: Checking phone number format with optional part using grouping technique

Usually, when we make a phone call inside a country (at least in my country) we do not need to dial the country code. Therefore, it is convenient for users if they could optionally input the country code.

To achieve the task, we can utilize a technique called grouping. Grouping is a mechanism that a group of regular expression characters can be treated as a single unit. And by grouping certain pattern characters in a group, we can allow users either to input the whole group or ignore the entire group.

Therefore, in order to optionally allow the inputted country code, we will place the country code part into a group followed by the appropriate quantifier character.

Let’s check out the following code:

I have defined the pattern as follows:

String phonePattern = “(\\d{3}-)?\\d{2}-\\d{7}”;

Basically, it is the same pattern as previously. But I have placed the first part (\\d{3}-) into a group using brackets.

Every character if being placed in brackets belongs to a group. You can have as many groups as you see fit in a pattern.

Following our group here is the question mark (?). This is when the optional part comes in.

The question mark (?) means 0 or 1, which means users can ignore the entire group; or they can input the whole group but only one time. That makes sense because we don’t want the country code to appear more than one time.

It’s time to run the program:

Input your phone(xxx-xx-xxxxxxx): 123–1234567

Invalid data!

Input your phone(xxx-xx-xxxxxxx): 12–12–1234567

Invalid data!

Input your phone(xxx-xx-xxxxxxx): 084–38–1234567

Valid data

123–1234567: invalid because the area code had 3 digits

12–12–1234567: invalid as well because the country code had only 2 digits. Note that, since we put the country code in a group followed by the question mark, users can ignore the group. But if users choose to input, the entire group must be provided

084–38–1234567: completely matched the pattern

Let’s run the program again:

Input your phone(xxx-xx-xxxxxxx): 38–1234567

Valid data

38–1234567: also valid because we could skip the country code, the other parts matched the pattern

Case 3: Checking email formats

In real-life applications, different software providers require different email formats. There is no single pattern that can be used to validate all email formats. It really depends on each case that we need an appropriate approach.

Let’s start with a simple email format:

email@address.com

In this simple email sample, we can split up to 5 parts:

  • The first part is the user name (email): this part can contain alphabetical characters and digits with a min of 3 and a max of 15 characters allowed
  • The second part is the at (@) sign
  • The third part is the domain name (address): this part can contain alphabetical characters and digits with a min of 3 and a max of 15 characters allowed
  • The fourth part is the dot (.) character
  • The fifth part is the domain extension (com): this part can contain alphabetical characters only with a min of 2 and a max of 5 characters allowed

The program:

In the program, I have had the pattern:

String emailPattern = “[a-zA-Z0–9]{3,15}@[a-zA-Z0–9]{3,15}[.][a-zA-Z]{2,5}”;
  • The first part can contain letters and digits so [a-zA-Z0–9] is applied. Note I did not use \w because \w includes underscore (_) as well
  • The second part is the at (@) sign, so I just placed it there
  • The third part is the domain name. This one is similar to the first part
  • The fourth part is the dot (.) character. Pay attention to this one. In a regular expression, the dot represents any character. But in this part, we want users to input the dot, not any character, so we need to place it in spare brackets to treat it as a normal character. Another way to achieve this is to escape the dot character \\.
  • The fifth part is the domain extension which allows only alphabetical characters. So, the pattern [a-zA-Z] is enough.

Now let’s run the program:

Input your email(email@address.com): email@gmail.

Invalid data!

Input your email(email@address.com): emailgmail.com

Invalid data!

Input your email(email@address.com): email@gmail.com

Valid data

email@gmail: invalid because it missed the . and domain extension

emailgmail.com: invalid because there was no @ sign

email@gmail.com: completely matched the pattern.

Now let’s upgrade our email pattern a little bit.

Some people have email addresses like:

email@gmail.com.us

As you can see, this email address has the second domain extension(.us). and that raises another requirement: we should allow users to optionally input another domain extension. But remember users either input the whole another domain extension or not at all.

Now it’s time to apply the grouping technique again.

There are 2 ways to achieve the task using the grouping technique.

The first and long way:

String emailPattern = “[a-zA-Z0–9]{3,15}@[a-zA-Z0–9]{3,15}[.][a-zA-Z]{2,5}([.][a-zA-Z]{2,5})?”;

The second and short way:

String emailPattern = “[a-zA-Z0–9]{3,15}@[a-zA-Z0–9]{3,15}([.][a-zA-Z]{2,5}){1,2}”;

I’ll leave you to test the program yourself here.

More on grouping techniques

Previously, we used the grouping technique to optionally allow users to provide inputs. In this part, you will see that the grouping technique can be applied in other cases.

Another case of utilizing the grouping technique is to provide a list of items so that users can select from.

Let’s see the following code:

In the code, I have had the pattern:

String emailPattern = “(sport|music|book|movie)”;

To provide a list of items so that users can choose from, we need to put those items in a group and separate each item by the vertical bar. That means users can only input one of those in the group.

Let’s run the program and see:

Input a favorite (sport,music,book,movie): reading

Invalid data!

Input a favorite (sport,music,book,movie): sport

Valid data

“reading”: invalid because it was not in the list

“sport”: valid because it was in the list

Another case is that we can apply the grouping technique in checking date-time formats which are very common in data inputs.

Let’s see the following code:

In the code, I have used the following pattern to check birthday format:

String dobPattern = “([1–9]{1,2}/)?([1–9]{1,2})/([0–9]{4})”;

In the pattern, I have 3 groups:

Group 1: ([1–9]{1,2}/)?

This is for the date. In this group, users must input at least 1 digit and a max of 2 digits from 1 to 9; followed by / to separate with the month. Users can ignore this part.

Group 2: ([1–9]{1,2})

This is for the month. I’m sure you can understand what it means. Well, it is not different from group 1.

Group 3: ([0–9]{4})

This is for the year. Users must input all 4 digits

Let’s run and check:

Input day of birth: 1212/1980

Invalid data!

Input day of birth: 12/1980

Valid data

Input day of birth: 1/12/1980

Valid data

1212/1980: invalid because there was no / between date and month

12/1980: completely valid because users can ignore the date

1/12/1980: also valid

Matching text for validation is only one of the applications of a regular expression. There are countless ways where we can apply regular expressions.

Let’s proceed to the next part to see we can apply regular expressions for string manipulation.

--

--