Data Cleaning with Regular Expressions in Amazon Redshift

Ling Hong
Analytics Vidhya
Published in
5 min readFeb 26, 2020

We all know about the importance of data cleaning. You can’t get good results with incorrect data, even though you have a fancy machine learning model.

However, data cleaning can be daunting, especially with regular expressions. This article is a guide to get you started on data cleaning in Amazon Redshift. I will walk through my code in detail, so no worries if you have never worked with regular expressions before.

Photo from Google Images

Data Validation

The very first step of data cleaning, is to know what are in your data.

Most of the time, I want all data to be numeric without mixing in any word characters, punctuation characters (except for decimal points).

We can use Redshift functions — REGEXP_COUNT , SIMILAR TO , or LIKE here.

REGEXP_COUNT searches for a pattern and returns the number of times that the pattern occurs.

SIMILAR TO also matches a string using regular expressions. But it will only return true, if the pattern matches the entire string. Also there is a limited set of pattern-matching metacharacter that SIMILAR TOcan support.

pattern-matching metacharacters for SIMILAR TO source: Amazon Redshift documentation

However, regular expression matching is expensive. Instead of using regular expressions, LIKE uses wildcard characters %(percent) and _(underscore). If you have a very specific substring to match, LIKE operator is recommended.

Below are some examples. To break the regular expression down, we can split it into four parts — ^, [0-9], +and $. [0-9]matches one single digit from 0123456789. ^ and $ defines the boundary. ^ followed by [0-9] means the string should start with digits. $ after [0-9] means the string should end with digits. + as a quantifier, says that the pattern should appear for one or more times, so [0–9]+ matches all integers.

-- returns 1 when the string is an integer, 0 when it's not
select REGEXP_COUNT('41', '^[0-9]+$');

-- will return FALSE because ^ and $ are not supported as matching metacharacters
select '^123$' SIMILAR TO '^[0-9]+$'

-- returns TRUE
select '123' SIMILAR TO '[0-9]+'
-- match any string that starts with 1select '123' LIKE '1%';select '1a' LIKE '1%';

Furthermore, if we want to include decimal numbers. {0,} is another quantifier that matches the previous item for 0 or more times. In general, {m,n} matches the previous item for m to n times. {1, } does the same job as +.

() is used to include a capturing group. (.[0–9]+) captures the factional part in a decimal number. () is often used with quantifiers to specify the occurrence of that pattern.

Notice that it’s different from []. [] means range, so [0–9] matches any digits from 0 to 9. However, (0–9) will match the exact string “0–9”

-- to include decimal numbers
select REGEXP_COUNT('1.233' , '^[0-9]+(.[0-9]+){0,}$');

Sometimes decimal numbers would appear like .11, without any digits before the decimal point. To capture this, we can use | , the OR operand.

-- to include decimal numbers like .11
select REGEXP_COUNT('.11' , '^[0-9]+(.[0-9]+){0,}$|^([0-9]+){0,}.[0-9]+$');

Sometimes the task can be more complex. We want to accept the data even though it’s alphanumeric. For example, for height, we want to take “5 ft 4 in”. To allow for “ft” and “in”, we need to include them in the pattern.

Here \\s matches one whitespace. Pay attention that in Amazon Redshift, you need to escape the operator using two backslashes.

-- returns 1
select REGEXP_COUNT('5 ft 1 in', '^[0-9]+\\sft(\\s[0-9]+\\sin)?$');

-- returns 1
select REGEXP_COUNT('5 ft', '^[0-9]+\\sft(\\s[0-9]+\\sin)?$');

-- returns True
select '5 ft' SIMILAR TO '[0-9]+\\sft(\\s[0-9]+\\sin)?';

-- returns True
select '5 ft 1 in' SIMILAR TO '[0-9]+\\sft(\\s[0-9]+\\sin)?';

Data Extraction

Data extraction helps us get the desired information from data, focus on important parts and avoid unnecessary data cleaning.

Let’s take glucose level as an example. Oftentimes, glucose data comes in with the measurement unit e.g. “98 mg/dl”, “101 mg/dL”, “89 MG/DL” or even messier “100 mg./dl.”

However, we only care about the numeric part. It doesn’t matter how dirty the unit part is.

We can use SPIT_PART to split on whitespace and get the numeric reading.

-- returns 98
select SPLIT_PART('98 mg/dl', ' ', 1)

What if the data is “98mg/dl”? We need to turn to another function REGEXP_SUBSTR.

REGEXP_SUBSTR searches for a regular expression pattern and returns it. [0–9]+ matches “98”, regardless of what comes after it.

-- returns 98mg/dl, not working
select SPLIT_PART('98mg/dl', ' ', 1)
-- returns 98
select REGEXP_SUBSTR('98mg/dl', '[0-9]+')

Data Conversion & Correction

Although this article is about how to use regular expressions, it’s equally important to know when it’s not necessary to use them.

For case conversion (upper to lower and vice versa), LOWER, UPPER can handle it easily.

-- returns 100 mg/dl
select LOWER('100 MG/DL')

For mapping values in SQL, CASE WHEN is a good option.

select CASE WHEN value='60 min' THEN '1 hour'
WHEN value='120 min' THEN '2 hours'
WHEN value='180 min' THEN '3 hours'
ELSE value END AS mapped_value
from ...

For more complex tasks, like replacing multiple whitespaces with one, we can use REGEXP_REPLACE. Notice that regular expression is used to specify matching patterns, but cannot be used in the replacement string.

-- returns 5 ft 8 in
select REGEXP_REPLACE('5 ft 8 in', '\\s+', ' ');

-- returns 5sfts8sin
select REGEXP_REPLACE('5 ft 8 in', '\\s+', '\\s');

To sum up, in Redshift we can perform a lot of data cleaning tasks from data validation, data extraction to data conversion and correction. Several functions can be used to complete these tasks. Sometimes we can do it without regular expressions, but as the data gets messier, regular expressions stand out by offering us more flexibility.

Thanks for reading. I hope you find this article useful! Feel free to drop a comment if you have any suggestions!

You can also reach me out on LinkedIn.

--

--

Ling Hong
Analytics Vidhya

Healthcare data scientist, interested in time series modeling, high-dimensional modeling, machine learning interpretability and bias in machine learning