Analytics Engineer Journey #2-Database II &III : Deep Dive SQL: Aggregation,DDL, DML, Metadata, Conditional Query, Subquery and CTE

Ashhab Karami
The Deep Hub
Published in
7 min readMay 1, 2024

OLTP & OLAP

OLTP (Online Transaction Processing):

  • Function:
  • OLTP systems handle day-to-day business operations and transactions.
  • They manage real-time data processing, such as recording sales, updating inventory, processing orders, and managing customer accounts.
  • OLTP ensures data integrity, consistency, and accuracy.

OLAP (Online Analytical Processing):

  • Function:
  • OLAP systems support complex data analysis and decision-making.
  • They provide aggregated, summarized, and multidimensional views of data.
  • OLAP helps users explore trends, patterns, and relationships in data.

At the simplest architecture, OLTP and OLAP takes place in here

In summary:

OLTP handles day-to-day transactions.

OLAP supports data analysis and decision-making.

Both serve different purposes and complement each other in an organization’s data ecosystem. 📊💼

OLTP Tools
OLAP Tools

BASIC SQL COMMAND

  1. IN and NOT IN: These commands are used to filter a column based on multiple conditions.
  • Example: SELECT * FROM students WHERE grade IN ('A', 'B', 'C'); This query selects all students whose grades are either 'A', 'B', or 'C'.

2. DISTINCT: This command is used to display only unique elements in a column.

  • Example: SELECT DISTINCT department FROM employees; This query retrieves the unique department names from the employees table.

3. Alias (AS): Alias is used to give a new name to a column or a table.

  • Example: SELECT first_name AS 'First Name', last_name AS 'Last Name' FROM employees; This query renames the 'first_name' column as 'First Name' and 'last_name' column as 'Last Name'.

4. EXCEPT: This command is used to retrieve all rows from one query that are not present in the result set of another query.

  • Example: SELECT * FROM students WHERE grade = 'A' EXCEPT SELECT * FROM students WHERE grade = 'B'; This query selects all students who got grade 'A' but not grade 'B'.

Aggregation Functions

In SQL, aggregation functions are used to perform calculations on a set of values and return a single result. These functions allow you to summarize data and gain insights into your dataset.

Here are some common aggregation functions:

  1. SUM: Calculates the sum of values in a column.
  • Example: SELECT SUM(sales_amount) FROM sales; This query returns the total sales amount from the sales table.

2. MIN: Finds the minimum value in a column.

  • Example: SELECT MIN(age) FROM employees; This query retrieves the youngest age from the employees table.

3. MAX: Retrieves the maximum value in a column.

  • Example: SELECT MAX(price) FROM products; This query returns the highest price among all products.

4. COUNT: Counts the number of rows in a result set.

  • Example: SELECT COUNT(*) FROM customers; This query counts the total number of customers in the customers table.

5. AVG: Calculates the average value of a column.

  • Example: SELECT AVG(salary) FROM employees; This query retrieves the average salary of all employees.

When using aggregation functions with other columns in SQL, it’s necessary to include a GROUP BY clause to specify how the data should be grouped. This ensures that the aggregation functions are applied to the appropriate subsets of data.

Group By Clause

The GROUP BY clause is used in conjunction with aggregation functions to group the result set by one or more columns. This allows you to perform aggregate calculations on each group separately.

Example: SELECT department, AVG(salary) FROM employees GROUP BY department; This query calculates the average salary for each department in the employees table.

In BigQuery, you can also use GROUP BY ALL to include all possible combinations of group-by columns, even those with NULL values.

Understanding aggregation functions and the GROUP BY clause is essential for performing insightful analysis and gaining valuable insights from your data in SQL.

SELECT
prod_code AS prod_code, -- Extract the first 3 characters from the left
SUM(partnerorder.shop_data.total_revenue) AS total_revenue
FROM gybyt
GROUP BY prod_id
-- Use HAVING (similar to WHERE, but for aggregated results)
HAVING total_revenue > 5000
ORDER BY total_revenue DESC;
query result

Explanation:

We’re selecting the prod_code (extracting the first 3 characters) and the total revenue.

The SUM function aggregates the total revenue for each product.

The GROUP BY clause groups the data by prod_id.

The HAVING clause filters the aggregated results to only include products with total revenue greater than 5000.

Finally, we order the results in descending order of total revenue.

DDL & DML

  • DDL(Data Definition Language) creates and manages the database structure.
  • DML (Data Manipulation Language) interacts with the data stored in the database.

Data Definition Language doesn’t change the objects content.

DDL Syntax

While Data Definition Language does not change the objects content, Data Manipulation Language takes part on changing/manipulating the objects content i.e changing value of specific rows, adding new rows to the table.

DML Syntax

CONDITIONAL QUERY

Suppose we have a table called students with columns student_id, name, and score. We want to categorize students as ‘Pass’ or ‘Fail’ based on their scores. If the score is greater than or equal to 60, they pass; otherwise, they fail.

#QUERY1

SELECT
student_id,
name,
IF(score >= 60, 'Pass', 'Fail') AS result
FROM students;

Explanation:

  • We use the IF function to create a custom column called result.
  • If the score is greater than or equal to 60, it’s categorized as ‘Pass’.
  • Otherwise, it’s categorized as ‘Fail’.

#QUERY2

SELECT
score_eko,
CASE
WHEN LENGTH(score_geo) < 60 THEN 'Ekonomi bagus'
ELSE 'Ekonomi jelek :('
END AS kategori_eko,
nama_geb_bau,
nilai_eko_geb_buruk
FROM par_table_raw_score_humanities;

Explanation:

  • We use the CASE statement to create a custom category (kategori_eko) based on the length of the score_geo.
  • If the length of score_geo is less than 60 characters, it’s categorized as ‘Ekonomi bagus’ (good economy).
  • Otherwise, it’s categorized as ‘Ekonomi jelek :(’ (poor economy).
  • The query retrieves other relevant columns (score_eko, nama_geb_bau, and nilai_eko_geb_buruk).

CTE & SUBQUERY

  1. Subquery:

Explanation:

  • A subquery is a query embedded within another query.
  • It’s enclosed in parentheses and used within the main query.
  • Subqueries can appear in various parts of a query, such as the SELECT, FROM, or WHERE clauses.
  • They help retrieve data based on intermediate results or conditions.

Example:

  • Suppose you want to find all customers who have made orders in the last month. You can use a subquery to filter orders based on their order dates.

2. CTE (Common Table Expression):

Explanation:

  • A CTE is a named temporary result set that you can reference within a query.
  • It’s defined using the WITH keyword and provides a cleaner way to break down complex queries.
  • CTEs are especially useful for recursive queries or when you need to reuse the same intermediate result multiple times.

Example:

  • Imagine you want to calculate the total revenue for each product category. You can create a CTE that computes the revenue per category and then join it with other tables.

Difference:

Usage:

  • Subqueries are used within a single query to filter, aggregate, or compare data.
  • CTEs are used to create named temporary result sets that can be referenced multiple times within a query.

Readability:

  • CTEs are often more readable than subqueries, especially for complex queries.
  • Subqueries can clutter the main query, making it harder to understand.

Reuse:

  • CTEs can be referenced multiple times within the same query.
  • Subqueries can only be used once in a query.

Performance:

  • Both subqueries and CTEs can impact performance, but it depends on the specific use case and database engine.

In summary:

  • Subqueries are embedded queries used within a main query.
  • CTEs are named temporary result sets that enhance query readability and reusability.

Remember, subqueries are like little helpers, while CTEs are like named assistants! 📊🔍

example subquery

A subquery is a query nested within another query. It’s typically enclosed within parentheses and used to return a single value or set of values that can be used by the outer query.

SELECT * 
FROM employees
WHERE department_id = (
SELECT department_id
FROM departments
WHERE department_name = 'Sales'
);

In this example, the inner subquery (SELECT department_id FROM departments WHERE department_name = 'Sales') retrieves the department ID for the 'Sales' department. The outer query then uses this department ID to find all employees working in the 'Sales' department.

example CTE

A Common Table Expression (CTE) is a temporary result set that can be referenced within a SELECT, INSERT, UPDATE, or DELETE statement. It’s defined using the WITH keyword and provides a more readable and modular approach to complex queries.

WITH SalesDepartment AS (
SELECT department_id
FROM departments
WHERE department_name = 'Sales'
)
SELECT *
FROM employees
WHERE department_id = (SELECT department_id FROM SalesDepartment);

In this example, the CTE named SalesDepartment retrieves the department ID for the 'Sales' department. We then reference this CTE in the main query to find all employees working in the 'Sales' department.

Key Differences:

  1. Usage: Subqueries are used within a SELECT, INSERT, UPDATE, or DELETE statement, whereas CTEs are defined using the WITH keyword at the beginning of a SELECT, INSERT, UPDATE, or DELETE statement.
  2. Clarity and Readability: CTEs provide a more readable and modular approach to complex queries, making them easier to understand and maintain, especially for larger queries with multiple subqueries.
  3. Scope: CTEs can be referenced multiple times within the same query, whereas subqueries can only be referenced once in the outer query.

--

--