Data Cleaning with SQL 🧼
The foundational preprocessing step explained the easy way!
The pandemic has taught us a lot about cleaning. Let’s go ahead and put those lessons to use on our data!
Curious about SQL? Check out the entire series here:
CAST
A Data Type gives the database information about how the information in a column is intended to be used and what operations can be performed on it. For instance, COUNT
works with any data type, but SUM
only works for numerical data.
Often while importing or creating the database, numbers and dates are treated as strings. This limits the functions that can be performed on them. To improve this, we can modify the data type as :
CAST(column_name AS data_type)
or
column_name :: data_type
TRIM
Sometimes, we are met with a situation where we need to remove certain characters from the fields. The TRIM
function is used to remove characters from the beginning and end of a string. Here's an example:
SELECT column,
TRIM(position 'characters' FROM column)
FROM Tablename
The TRIM
function takes 3 arguments. First, the position, whether you want to remove characters from the beginning ('leading'), the end ('trailing'), or both ('both'). Next, specify all characters to be trimmed. Any characters included in the single quotes will be removed from the beginning, end, or both sides of the string. The last argument is the column name.
LEFT / RIGHT
To retrieve a given length of characters from either the left or right side of the string and present as it a separate string.
Usage:
LEFT(column, number_of_characters)
SUBSTR
LEFT
and RIGHT
both create substrings of a specified length, but they only do so starting from either side of an existing string. If one wants to start in the middle of a string, SUBSTR
can be used. The syntax is -
SUBSTR(column, starting_character_position, number_of_characters)
POSITION / STRPOS
POSITION
and STRPOS
take a substring and return the position (counting from left) where that substring first appears in the target string.
POSITION( substring IN column )
OR
STRPOS( column , substring )
CONCAT
Strings can be combined from several columns together using CONCAT
. Simply order the values you want to concatenate and separate them with commas as:
CONCAT ( col1,col2,col3…..)
Hardcoded values can be concatenated by enclosing them inside single quotes as:
CONCAT( Numerator , '/' , Denominator )
CONCAT’s objective can also be achieved by ||
operator.
col1 || col2 || col3 ………
Parsing Dates
If the date is written as a string we can surely convert it to date data type using CAST
. But this only works when it is in an SQL identified format. But what if it is other formats like MM/DD/YYYY? We first need to convert it to a string in an acceptable format and then cast it. For example,
(SUBSTR(date, 7, 4) || ‘-’ || LEFT(date, 2) || ‘-’ || SUBSTR(date, 4, 2))::date AS cleaned_date
EXTRACT
A lot of times we need to take into account a specific part of the date, like sales in this month, admissions in this year. In such cases, we extract information from the date column as :
EXTRACT('year' FROM cleaned_date) AS year,
EXTRACT('month' FROM cleaned_date) AS month,
EXTRACT('day' FROM cleaned_date) AS day,
EXTRACT('hour' FROM cleaned_date) AS hour,
EXTRACT('minute' FROM cleaned_date) AS minute, EXTRACT('second' FROM cleaned_date) AS second, EXTRACT('decade' FROM cleaned_date) AS decade,
EXTRACT('dow' FROM cleaned_date) AS day_of_week
NOW
SQL provides a wide variety of functions to retrieve the current date, time, and timestamp. Fun fact: They can be printed without the FROM clause.
SELECT CURRENT_DATE AS date,
CURRENT_TIME AS time,
CURRENT_TIMESTAMP AS timestamp,
LOCALTIME AS localtime,
LOCALTIMESTAMP AS localtimestamp,
CURRENT_TIME AT TIME ZONE 'PST' AS time_pst,
NOW() AS now
Handling Missing Values
The most commonly found discrepancy in data is missing values. There are a lot of ways to handle this, a few listed here :
❓ Delete the rows with missing values
❓ Replace with mean/median/mode
❓ Assign a unique category
❓ Predict the missing values
We’ll focus on first and third here,
NOT NULL
When we want to remove the rows with empty fields from our result we use NOT NULL that only prints the rows that have some value for the column specified.
SLELCT *
FROM Tablename
WHERE column NOT NULL
COALESCE
COALESCE
is used to replace the NULL values with a specific value so they don’t hinder further analysis.
COALESCE(column, new value)
This was it for this iteration of the series, next part will explore Windows Functions in SQL.
Happy cleaning! ✨
Next up in the series :
Interested in daily data analysis content? Follow me on Twitter and Medium.