Modern SQL — Not so commonly used, yet powerful features

Published in

Data Engineering

3 min readOct 30, 2023

Most data platforms have evolved and so has the way we write SQLs. Although the basic fundamentals remain the same. I thought it would be useful to write about not so talked about SQL features.

Please note, not every functions are available in all DBs and there can also be syntax differences

TL/DR:

Sequence
Explode (and Explode with sequence)
Lag/Lead
Try Divide (and more)
Recursive CTEs
Pivot/Unpivot

Sequence

If you have a date range, or a list of items and need them as seperate values, this option generates a sequence of values in a struct field given a range as inputs

Traditional Option:

CREATE SEQUENCE My_Sequence AS INT
START WITH 1
INCREMENT BY 1
MINVALUE 1
MAXVALUE 7
CYCLE;

# This returns
1
2
3
4
5
6
7

Databricks option:

sequence(start, stop [, step] )

Example:

> SELECT sequence(1, 5);
 [1,2,3,4,5]

> SELECT sequence(5, 1);
 [5,4,3,2,1]

> SELECT sequence(DATE'2023-01-01', DATE'2023-03-01', INTERVAL 1 MONTH);
 [2023-01-01,2023-02-01,2023-03-01]

Explode

If we have a range of values like date and need a row for every value in the range, this option is best suited. Given an input string, array, struct, explode created multiple rows for each of the items in the list

> SELECT explode(array(10, 20)) AS elem, 'Spark';
 10 Spark
 20 Spark

> SELECT explode(map(1, 'a', 2, 'b')) AS (num, val), 'Spark';
 1   a   Spark
 2   b   Spark

> SELECT explode(array(1, 2)), explode(array(3, 4));
  Error: UNSUPPORTED_GENERATOR.MULTI_GENERATOR

# reference - Databricks help docs

explode with sequence:

SELECT explode(sequence(DATE'2023-01-01', DATE'2023-03-01', INTERVAL 1 MONTH)),"Test";
 
Result:
2023-01-01 Test
2023-02-01 Test
2023-03-01 Test

`lead and Lag` analytic window function

Lag gets the previous value for the partition, Lead gets the next value for the partition

-- Lag
SELECT deptno,
       empno,
       ename,
       job,
       sal,
       LAG(sal, 1, 0) OVER (PARTITION BY deptno ORDER BY sal) AS sal_prev
FROM   emp;

    DEPTNO      EMPNO ENAME      JOB              SAL   SAL_PREV
---------- ---------- ---------- --------- ---------- ----------
        10       7934 MILLER     CLERK           1300          0
        10       7782 CLARK      MANAGER         2450       1300
        10       7839 KING       PRESIDENT       5000       2450
        20       7369 SMITH      CLERK            800          0
        20       7876 ADAMS      CLERK           1100        800
        20       7566 JONES      MANAGER         2975       1100
        20       7788 SCOTT      ANALYST         3000       2975
        20       7902 FORD       ANALYST         3000       3000
        30       7900 JAMES      CLERK            950          0
        30       7654 MARTIN     SALESMAN        1250        950
        30       7521 WARD       SALESMAN        1250       1250
        30       7844 TURNER     SALESMAN        1500       1250
        30       7499 ALLEN      SALESMAN        1600       1500
        30       7698 BLAKE      MANAGER         2850       1600

-- Lead
SELECT deptno,
       empno,
       ename,
       job,
       sal,
       LEAD(sal, 1, 0) OVER (PARTITION BY deptno ORDER BY sal) AS sal_next
FROM   emp;

    DEPTNO      EMPNO ENAME      JOB              SAL   SAL_NEXT
---------- ---------- ---------- --------- ---------- ----------
        10       7934 MILLER     CLERK           1300       2450
        10       7782 CLARK      MANAGER         2450       5000
        10       7839 KING       PRESIDENT       5000          0
        20       7369 SMITH      CLERK            800       1100
        20       7876 ADAMS      CLERK           1100       2975
        20       7566 JONES      MANAGER         2975       3000
        20       7788 SCOTT      ANALYST         3000       3000
        20       7902 FORD       ANALYST         3000          0
        30       7900 JAMES      CLERK            950       1250
        30       7654 MARTIN     SALESMAN        1250       1250
        30       7521 WARD       SALESMAN        1250       1500
        30       7844 TURNER     SALESMAN        1500       1600
        30       7499 ALLEN      SALESMAN        1600       2850
        30       7698 BLAKE      MANAGER         2850          0

Try Divide:

Prevents divide by zero

> SELECT try_divide(3, 2);
 1.5

> SELECT try_divide(2L, 2L);
 1.0

> SELECT try_divide(INTERVAL '3:15' HOUR TO MINUTE, 3);
 0 01:05:00.000000

> SELECT try_divide(3 , 0)
  NULL

Recursive CTEs

Recursive CTEs allow you to write subqueries or CTEs that call itself.

WITH recursive categories AS (
  SELECT category_id, parent_category_id
  FROM categories
  WHERE parent_category_id IS NULL
  UNION ALL
  SELECT c.category_id, c.parent_category_id
  FROM categories c
  JOIN recursive categories ON c.parent_category_id = recursive categories.category_id
)
SELECT category_id, parent_category_id FROM categories;

This query will select all categories, including their parent categories, in a hierarchical order.

Pivot and Unpivot

PIVOT operator transforms one table into another by rotating the row data into column data. This can be useful for summarizing data by category or dimension.

SELECT
  customer_id,
  SUM(sales) AS sales_by_region
FROM orders
PIVOT (
  SUM(sales)
  FOR region IN ('North America', 'Europe', 'Asia')
) AS pivot_table;

UNPIVOT operator is the opposite of the PIVOT operator. It transforms column data into row data. This can be useful for normalizing data or preparing data for visualization.

SELECT
  customer_id,
  region,
  sales
FROM UNPIVOT (
  sales_by_region
  FOR region IN ('North America', 'Europe', 'Asia')
) AS unpivot_table;

Looking forward to: AI in SQLs (Databricks limited feature):

I am yet to explore this, but surely will write a full article about this.

ai_generate_text function

Learn the syntax of the ai_generate_text function of the SQL language in Databricks SQL.

docs.databricks.com

ai_query function

Learn the syntax of the ai_query function of the SQL language in Databricks SQL.