Modern SQL — Not so commonly used, yet powerful features

Anup Moncy
Data Engineering
Published in
3 min readOct 30, 2023

Most data platforms have evolved and so has the way we write SQLs. Although the basic fundamentals remain the same. I thought it would be useful to write about not so talked about SQL features.

Please note, not every functions are available in all DBs and there can also be syntax differences

TL/DR:

  1. Sequence
  2. Explode (and Explode with sequence)
  3. Lag/Lead
  4. Try Divide (and more)
  5. Recursive CTEs
  6. Pivot/Unpivot

Sequence

If you have a date range, or a list of items and need them as seperate values, this option generates a sequence of values in a struct field given a range as inputs

Traditional Option:

CREATE SEQUENCE My_Sequence AS INT
START WITH 1
INCREMENT BY 1
MINVALUE 1
MAXVALUE 7
CYCLE;

# This returns
1
2
3
4
5
6
7

Databricks option:

sequence(start, stop [, step] )

Example:

> SELECT sequence(1, 5);
[1,2,3,4,5]

> SELECT sequence(5, 1);
[5,4,3,2,1]

> SELECT sequence(DATE'2023-01-01', DATE'2023-03-01', INTERVAL 1 MONTH);
[2023-01-01,2023-02-01,2023-03-01]

Explode

If we have a range of values like date and need a row for every value in the range, this option is best suited. Given an input string, array, struct, explode created multiple rows for each of the items in the list

> SELECT explode(array(10, 20)) AS elem, 'Spark';
10 Spark
20 Spark

> SELECT explode(map(1, 'a', 2, 'b')) AS (num, val), 'Spark';
1 a Spark
2 b Spark

> SELECT explode(array(1, 2)), explode(array(3, 4));
Error: UNSUPPORTED_GENERATOR.MULTI_GENERATOR

# reference - Databricks help docs

explode with sequence:

SELECT explode(sequence(DATE'2023-01-01', DATE'2023-03-01', INTERVAL 1 MONTH)),"Test";

Result:
2023-01-01 Test
2023-02-01 Test
2023-03-01 Test

lead and Lag analytic window function

Lag gets the previous value for the partition, Lead gets the next value for the partition

-- Lag
SELECT deptno,
empno,
ename,
job,
sal,
LAG(sal, 1, 0) OVER (PARTITION BY deptno ORDER BY sal) AS sal_prev
FROM emp;

DEPTNO EMPNO ENAME JOB SAL SAL_PREV
---------- ---------- ---------- --------- ---------- ----------
10 7934 MILLER CLERK 1300 0
10 7782 CLARK MANAGER 2450 1300
10 7839 KING PRESIDENT 5000 2450
20 7369 SMITH CLERK 800 0
20 7876 ADAMS CLERK 1100 800
20 7566 JONES MANAGER 2975 1100
20 7788 SCOTT ANALYST 3000 2975
20 7902 FORD ANALYST 3000 3000
30 7900 JAMES CLERK 950 0
30 7654 MARTIN SALESMAN 1250 950
30 7521 WARD SALESMAN 1250 1250
30 7844 TURNER SALESMAN 1500 1250
30 7499 ALLEN SALESMAN 1600 1500
30 7698 BLAKE MANAGER 2850 1600

-- Lead
SELECT deptno,
empno,
ename,
job,
sal,
LEAD(sal, 1, 0) OVER (PARTITION BY deptno ORDER BY sal) AS sal_next
FROM emp;

DEPTNO EMPNO ENAME JOB SAL SAL_NEXT
---------- ---------- ---------- --------- ---------- ----------
10 7934 MILLER CLERK 1300 2450
10 7782 CLARK MANAGER 2450 5000
10 7839 KING PRESIDENT 5000 0
20 7369 SMITH CLERK 800 1100
20 7876 ADAMS CLERK 1100 2975
20 7566 JONES MANAGER 2975 3000
20 7788 SCOTT ANALYST 3000 3000
20 7902 FORD ANALYST 3000 0
30 7900 JAMES CLERK 950 1250
30 7654 MARTIN SALESMAN 1250 1250
30 7521 WARD SALESMAN 1250 1500
30 7844 TURNER SALESMAN 1500 1600
30 7499 ALLEN SALESMAN 1600 2850
30 7698 BLAKE MANAGER 2850 0

Try Divide:

Prevents divide by zero

> SELECT try_divide(3, 2);
1.5

> SELECT try_divide(2L, 2L);
1.0

> SELECT try_divide(INTERVAL '3:15' HOUR TO MINUTE, 3);
0 01:05:00.000000

> SELECT try_divide(3 , 0)
NULL

Recursive CTEs

Recursive CTEs allow you to write subqueries or CTEs that call itself.

WITH recursive categories AS (
SELECT category_id, parent_category_id
FROM categories
WHERE parent_category_id IS NULL
UNION ALL
SELECT c.category_id, c.parent_category_id
FROM categories c
JOIN recursive categories ON c.parent_category_id = recursive categories.category_id
)
SELECT category_id, parent_category_id FROM categories;

This query will select all categories, including their parent categories, in a hierarchical order.

Pivot and Unpivot

PIVOT operator transforms one table into another by rotating the row data into column data. This can be useful for summarizing data by category or dimension.

SELECT
customer_id,
SUM(sales) AS sales_by_region
FROM orders
PIVOT (
SUM(sales)
FOR region IN ('North America', 'Europe', 'Asia')
) AS pivot_table;

UNPIVOT operator is the opposite of the PIVOT operator. It transforms column data into row data. This can be useful for normalizing data or preparing data for visualization.

SELECT
customer_id,
region,
sales
FROM UNPIVOT (
sales_by_region
FOR region IN ('North America', 'Europe', 'Asia')
) AS unpivot_table;

Looking forward to: AI in SQLs (Databricks limited feature):

I am yet to explore this, but surely will write a full article about this.

--

--