Convert NORMALIZE (Teradata) to Bigquery
Introduction
NORMALIZE is a Teradata extension to the ANSI SQL standard. When a SQL table includes a column that has a PERIOD data type, NORMALIZE combines values that meet or overlap in that column, to form a single period that consolidates multiple, individual period values.
To use NORMALIZE, at least one column in the SQL SELECT list must be of Teradata’s temporal PERIOD data type.
Bigquery currently doesn’t support the PERIOD data type in a table. Instead, we need to divide a Teradata PERIOD data field into two parts: eg) start_date and end_date
Prerequisites
- Basic knowledge of the Teradata data warehouse and its SQL syntax
- Good understanding of Bigquery and its SQL syntax
Conversion Logic
Example Scenario:
Assume in the below example, we are going to Normalize on the employee_id.Here period column is split into two parts (start_date and end_date)
The sample data looks as below
Teradata SQL:
SELECT NORMALIZE ON MEETS OR OVERLAPS emp_id, duration
FROM poc_aa_cs.Normalize_Emp_Details
ORDER BY 1,2;
Output:
Bigquery SQL:
SELECT S2.emp_id, MIN(S2.start_date) as new_start_date, MAX(S2.end_date) as new_end_date
FROM
( SELECT S1.*, SUM(GroupStartFlag) OVER (PARTITION BY emp_id ORDER BY start_date ROWS UNBOUNDED PRECEDING) As GroupID
FROM ( SELECT emp_id, start_date, end_date,(CASE WHEN start_date <= LAG(max_end_date) OVER (PARTITION BY emp_id ORDER BY start_date, end_date) THEN 0 ELSE 1 END) AS GroupStartFlag
FROM (select emp_id, start_date, end_date, max(end_date) OVER (PARTITION BY emp_id ORDER BY start_date, end_date ROWS UNBOUNDED PRECEDING) as max_end_date from `poc-env-aks.poc_aa_cs.Normalize_Emp_Details` )) S1
) S2
GROUP BY S2.emp_id, S2.GroupID
ORDER BY 1,2;
Translation Explanation:
- This LAG function returns the values for a row at a given offset above (before) the current row in the partition.
- You can use the LAG function to identify each row that begins a new period by determining if a period meets or overlaps with the previous period (0 if yes and 1 if no )
- When this flag is cumulatively summed up, it provides a group identifier that can be used in the outer Group By clause to arrive at the desired result.
- The output of S1 is as follows
- The output of S2 is as follows
After these changes, Bigquery SQL generates the same output as Teradata.