Regular Expression in R: How to Capture Canadian Postal Code

Michael T Vu
4 min readNov 3, 2022

--

If you are looking for a regex to capture Canadian postal code, then you are in the right place. There are many reasons to validate the postal code. It could be a registration form, or you could fill missing data (provinces) from the postal code.

When it comes to data analysis, there are chances you need to clean data and deal with missing data. Fields such as personal addresses are often the ones which are partly omitted in many cases. If we understand about the Canadian postal code, we could make the data cleaner than its original.

Regex R to capture Canadian Postal Code

What is Canadian Postal Code made of? It contains 6 characters as the following pattern:

A1A 1A1
A: represents for a letter
1: represents for a number

The first letter in the postal code represents for each province. What do I mean by saying that? That means each province has its own letter(s) for the postal code. For example, when we see this K1N 9C3 postal code, we could say that it is in Ontario because the letter K belongs to Ontario. Similarly, J8Y 1W5 could be identified as a Quebec postal code.

Each province could have one or more than one letter for the postal code (the first letter). In Ontario, the first letter of the postal code could be K, L, N and P. There are also many exceptions for postal code. To understand more about postal code, you could click on the following references: Wikipedia, Canadian Government, Canada Post, and

Canadian Postal Codes:
A Canadian postal code is a six-character string that forms part of a postal address in Canada. Canada’s postal codes are alphanumeric. They are in the format A9A 9A9, where
A is a letter and 9 is a digit, with a space separating the third and fourth characters.

There are some exceptions — The letters D, F, I, O, Q and U never appear in a postal code because of their visual similarity to 0, E, 1, 0, 0, and V respectively. In addition to avoiding the six; letters W and Z also do not appear as the first letter of a postal code.

Source: https://subscribe.hollywoodreporter.com/sub/validPostalCodeFormats.htm

After we have got the hang of the Canadian Postal Code. It’s time to look at some scenarios we could use regular expression to capture it. Let’s look at the sample data below.

SCENARIO 1:

Your task is to validate the Canadian postal code and return TRUE or FALSE

ca_pc <- data.frame(NAME = c("Peter", "Hannah", "John", "Michelle"),
CODE = c("k1n 9c3", "R2X 1C1 ", "T6C 4E3", "93301"),
PRO_STATE = c("Ontario", "Manitoba", "Alberta", "California"))
#I added extra spaces after "R2X 1C1" on purpose

Here is how I capture the postal code:

ca_pc_vali <- ca_pc %>%
mutate(VALIDATE = str_detect(toupper(str_squish(CODE)),
regex("^[ABCEGHJKLMNPRSTVXY][0-9][ABCEGHJKLMNPRSTVWXYZ] [0-9][ABCEGHJKLMNPRSTVWXYZ][0-9]$")))

regex("^[ABCEGHJKLMNPRSTVXY][0–9][ABCEGHJKLMNPRSTVWXYZ] [0–9][ABCEGHJKLMNPRSTVWXYZ][0–9]$") means: the postal code begins with one of the letters in the square brackets, then a number, a letter, a space, a number, a letter and a number in the end.

str_squish() function: to remove the excess leading and trailing white space in the string

toupper(): to convert all lowercase alphabet to uppercase alphabet

str_derect(): detect the string and return the result as TRUE or FALSE

We can see, the pattern captures the Canadian postal code (both lower and uppercase), the trailing white spaces after R2X 1C1 is also removed when detecting the postal code process (not the CODE column itself). Let’s try again without str_squish() function. It returns as FALSE.

In the same way, if you remove toupper(), it will return FALSE in the VALIDATE column.

SCENARIO 2:

The PRO_STATE column has missing values, which are the Canadian province to be specific in this case. Fortunately, you have postal code you have already known the basic rules of the Canadian Postal Code. How do you fill in these missing values?

Here is how:

ca_pc_vali <- ca_pc %>%
mutate(VALIDATE = str_detect(toupper(CODE), regex("^[ABCEGHJKLMNPRSTVXY][0-9][ABCEGHJKLMNPRSTVWXYZ] [0-9][ABCEGHJKLMNPRSTVWXYZ][0-9]$"))) %>%
mutate(PRO_STATE = case_when(str_detect(toupper(CODE), regex("^[KLNP][0-9][ABCEGHJKLMNPRSTVWXYZ] [0-9][ABCEGHJKLMNPRSTVWXYZ][0-9]$"))~ "Ontario", TRUE ~ PRO_STATE))

You can see the pattern this time is a little bit different. If the postal code starts with one of the four letters: K, N, L, and P, the missing value will be Ontario. It returns as expected.

To learn more about case_when(), click here.

Today I just showcase a few scenarios we could use this pattern to capture Canadian Postal Code. There are much more you can do with that. If you like this post, hit the clap button and follow me to get my latest updates.

--

--