Advanced SQL Techniques for Data Engineers: Window Functions, CTEs, and Recursive Queries

Sallu Muharomah
EDTS
Published in
4 min readJul 29, 2024

SQL remains a cornerstone of data engineering, offering powerful capabilities to manipulate and analyze data. For data engineers, mastering advanced SQL techniques like window functions, Common Table Expressions (CTEs), and recursive queries can enhance the efficiency and effectiveness of data workflows. This article explores these advanced techniques, providing practical examples and use cases that demonstrate their value in real-world scenarios.

Window Functions

Window functions perform calculations across a set of table rows related to the current row, allowing for complex analyses without collapsing rows into a single summary. This is particularly useful for generating running totals, ranking, and calculating moving averages.

Key Window Functions:

ROW_NUMBER(): Assigns a unique sequential integer to rows within a partition.

This query provides insights into how employees are positioned salary-wise compared to their peers in the same department.

SELECT 
employee_id,
employee_name,
department,
salary,
ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) AS rank
FROM
employees;
Use Case: Determine the rank of employees by salary within each department.
Use Case: Determine the rank of employees by salary within each department.

RANK() and DENSE_RANK(): Provide ranking with handling for ties.

SELECT 
employee_id,
employee_name,
department,
salary,
RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS rank
FROM
employees
ORDER BY
department, rank;
SELECT 
employee_id,
employee_name,
department,
salary,
DENSE_RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS dense_rank
FROM
employees
ORDER BY
department, dense_rank;

Use Case: Rank employees by salary, allowing for duplicate ranks when salaries are equal

SUM(), AVG(), MIN(), MAX(): Aggregate functions applied over a window.

SELECT 
employee_id,
employee_name,
department,
salary,
SUM(salary) OVER (PARTITION BY department) AS total_department_salary,
AVG(salary) OVER (PARTITION BY department) AS avg_department_salary
FROM
employees;
Use Case: Calculate the total salary per department for comparison with individual salaries.

LEAD() and LAG(): Access data from subsequent or preceding rows.

SELECT 
employee_id,
salary,
LAG(salary, 1) OVER (PARTITION BY department ORDER BY salary) AS prev_salary
FROM
employees;
Use Case: Compare current salary with the previous salary within the same department.

Practical Applications:

  • Running Totals: Useful for financial data where cumulative totals are needed.
  • Comparative Analysis: Evaluate performance changes over time or within groups.
  • Rankings and Percentiles: Essential for performance metrics, like sales leaderboards.

Common Table Expressions (CTEs)

CTEs simplify complex queries by allowing you to define temporary result sets that can be referenced within a single query. This modular approach enhances readability and maintainability.

Syntax:

Syntax:
WITH cte_name AS (
SELECT columns
FROM table
WHERE conditions
)
SELECT columns
FROM cte_name
WHERE conditions;

Benefits:

  • Modular Query Design: Break complex queries into manageable parts.
  • Improved Readability: Make queries easier to understand and maintain.
  • Reusability: Reuse CTEs within a query for complex operations.

Example:

To compare employees’ salaries to the average salary in their department:

WITH department_averages AS (
SELECT
department,
AVG(salary) AS avg_salary
FROM
employees
GROUP BY
department
)
SELECT
e.employee_id,
e.employee_name,
e.salary,
d.avg_salary
FROM
employees e
JOIN
department_averages d ON e.department = d.department
WHERE
e.salary > d.avg_salary;

Practical Applications:

  • Data Filtering: Refine data based on intermediate calculations.
  • Hierarchical Data: Simplify hierarchical queries.
  • Aggregations: Compute and use aggregate metrics in complex queries.

Recursive Queries

Recursive queries, enabled through recursive CTEs, are essential for querying hierarchical or tree-structured data. They allow for traversing relationships and aggregating hierarchical data.

Syntax:

Example:

To compare employees’ salaries to the average salary in their department:

WITH RECURSIVE cte_name AS (
-- Anchor member: base case
SELECT columns
FROM table
WHERE initial_condition
UNION ALL
-- Recursive member: recursive case
SELECT columns
FROM table
JOIN cte_name ON table.column = cte_name.column
)
SELECT columns
FROM cte_name;

Example:

To retrieve an organizational hierarchy:

WITH RECURSIVE org_chart AS (
SELECT
employee_id,
employee_name,
manager_id
FROM
employees
WHERE
manager_id IS NULL
UNION ALL
SELECT
e.employee_id,
e.employee_name,
e.manager_id
FROM
employees e
JOIN
org_chart o ON e.manager_id = o.employee_id
)
SELECT * FROM org_chart;

Practical Applications:

  • Hierarchical Data: Query organizational structures, file systems, or product categories.
  • Pathfinding: Identify paths or relationships within graph-like data.
  • Hierarchical Aggregation: Aggregate data at various levels of a hierarchy.

Conclusion

Advanced SQL techniques such as window functions, CTEs, and recursive queries are powerful tools for data engineers, enabling complex data manipulations and analyses. Mastery of these techniques not only enhances query performance but also simplifies the management and analysis of complex datasets. By incorporating these techniques into your data engineering practices, you can handle sophisticated data tasks with greater efficiency and accuracy.

These advanced SQL tools will help you unlock new levels of data insight, streamline complex queries, and ultimately drive more informed decision-making in your data-driven projects.

--

--