Data Scientist Interview Guide: SQL Concepts and Commands

The minimal set of commands you need to be familiar to clear a data science interview and be competent in a DS role

Published in

CodeX

21 min readMay 17, 2023

Photo by Leif Christoph Gottwald on Unsplash

The initial interviews for a data science role would involve a SQL-based round that tests your understanding of basic commands and your ability to extract, transform, and create datasets that can then be used for data science or be used to build models.

This post will cover some of the basic SQL concepts and commands a data scientist should know, and you can be expected to be tested on. The post includes

SQL and its relevance in data science
Understanding relational databases
SQL basics: SELECT, FROM, and WHERE clauses
Sorting data: ORDER BY
Limiting the number of rows: LIMIT
Aggregating data: COUNT, SUM, AVG, MIN, and MAX
Grouping data: GROUP BY
Filtering grouped data: HAVING
Combining data: JOIN
Set based operations: UNION, INTERSECT, and EXCEPT
Subqueries: Writing queries within queries
Conditional statements: CASE statements
String operations and functions
Date and time operations and functions
Window Functions: RANK
Data manipulation: INSERT, UPDATE, DELETE statements
Creating and Modifying Database Tables
Query Execution plan
Optimizing SQL Queries
Next Steps

So let's dive right into it!

SQL and its relevance in data science

SQL serves as a bridge between data and insights. It allows you to interact with databases and uncover meaningful patterns. As a data scientist, SQL provides you a rich toolkit to filter, aggregate, combine, and analyze data.

In addition, SQL is ubiquitous. It is widely adopted across industries and integrated into numerous database systems. Whether you’re working with MySQL, PostgreSQL, Oracle, SQL Server, or any other database system, the core principles of SQL remain consistent, allowing for seamless knowledge transfer and collaboration.

As such learning SQL empowers you with a highly transferable skillset that can be used to analyze data.

Understanding relational databases

Relational databases are built on the concepts of tables, rows and columns.

A table represents a specific entity. E.g. in a retail database a table might represent the customer, products, or orders.
A row represents a single record in the table. For instance in the customer table, each row would represent a single customer.
A column represents value of a field for a customer. For instance, in the customer table the columns such as Customer ID, Name, Email — represent different attributes of a customer.

Here is a visual representation of two tables in the retail space — Customers, and Orders.

Table: Customers
+-----------+-----------+-------------------+------------------+
| CustomerID | Name      | Email             | Address          |
+-----------+-----------+-------------------+------------------+
| 1          | John Doe  | johndoe@email.com | 123 Main St      |
| 2          | Jane Smith| janesmith@email.com| 456 Elm St       |
| 3          | Alex Wong | alexwong@email.com | 789 Oak Ave      |
+-----------+-----------+-------------------+------------------+

Table: Orders
+----------+-----------+-----------+----------+------------+
| OrderID  | CustomerID| ProductID | Quantity | TotalPrice |
+----------+-----------+-----------+----------+------------+
| 1001     | 1         | 101       | 2        | $50.00     |
| 1002     | 2         | 102       | 1        | $20.00     |
| 1003     | 1         | 103       | 3        | $75.00     |
+----------+-----------+-----------+----------+------------+

In this example, the “Customers” table has three rows, each representing a different customer, and the “Orders” table has three rows, each representing a specific order made by a customer.

In addition, the “CustomerID” column serves as a link between the two tables, establishing a relationship between customers and their orders. These are typically referred to as key fields and are used to ‘relate’ different tables. These keys also allow us to combine information. For instance, we can retrieve orders along with the corresponding customer information by joining the “Orders” table with the “Customers” table using the common “CustomerID” column.

By visualizing tables and identifying relationships you will be able to transform and derive better insights. There are more details of relational database concepts you can dive into such as schemas, index, view etc. However, a basic understanding of relations between table would suffice for most DS/ML interviews.

SQL basics: SELECT, FROM, and WHERE clauses

SELECT, FROM, and WHERE clauses form the core of querying data. These allow data scientists to retrieve specific data from a database based on certain criteria. Lets look at them one-by-one —

SELECT Clause
It allows us to choose which information we want to include in the result set. We can select multiple columns, or perform calculations/ transformations on the selected data.

FROM Clause
It tells SQL where to look for the specified columns or expressions mentioned in the SELECT clause

For instance, if we have a “Customers” table with columns like CustomerID, Name, Email, and Address, and we want to retrieve only the Name and Email columns, the query would look like this:

SELECT Name, Email
FROM Customers;

Using the same “Customers” table, if we want to retrieve all columns, the query would be:

SELECT *
FROM Customers;

WHERE Clause
It enables us to extract only the rows that meet the specified criteria. We can use various comparison operators such as “=”, “<”, “>”, “<=”, “>=”, “<>”, and logical operators such as AND, OR to define the conditions.

For instance, if we want to retrieve customers who have placed orders with a total price above $100, we can use the following query

SELECT *
FROM Customers
WHERE CustomerID IN (
    SELECT CustomerID
    FROM Orders
    WHERE TotalPrice > 100
);

Consider another example — a “Products” table with columns like ProductID, ProductName, and Price. We want to retrieve products with a price greater than $50 and quantity less than 10. The corresponding SQL query would be:

SELECT *
FROM Products
WHERE Price > 50 AND Quantity < 10;

By utilizing these SQL syntax basics, data scientists can extract meaningful insights from databases.

Sorting data: ORDER BY

The ORDER BY clause allows us to sort the retrieved data in ascending or descending order, providing a structured view of the information. The basic syntax of the ORDER BY clause is as follows:

SELECT column1, column2, ...
FROM table_name
ORDER BY column_name [ASC|DESC];

Here, the column_name refers to the column by which we want to sort the data. The optional ASC (ascending) or DESC (descending) keyword specifies the sorting order, with ASC being the default if not specified.

Let’s consider an example where we have a “Products” table with columns like ProductID, ProductName, Price, and Quantity. We want to retrieve the product names and prices in descending order of prices. The corresponding SQL query would be:

SELECT ProductName, Price
FROM Products
ORDER BY Price DESC;

Limiting the number of rows: LIMIT

LIMIT clause allows us to restrict the number of rows returned in the result set. It is particularly useful when dealing with large tables or when we only need a specific number of rows for analysis or display purposes.

The basic syntax of the LIMIT clause is as follows:

SELECT column1, column2, ...
FROM table_name
LIMIT number_of_rows;

Here, “number_of_rows” represents the maximum number of rows we want to retrieve from the table.

For example, let’s consider a “Customers” table with columns like CustomerID, Name, Email, and Address. Suppose we want to retrieve only the first five customers from the table. The corresponding SQL query using the LIMIT clause would be:

SELECT *
FROM Customers
LIMIT 5;

In addition to specifying a fixed number, we can also use the LIMIT clause to retrieve a range of rows. For instance, to retrieve rows 6 to 10 from the “Customers” table, we can use the OFFSET clause in conjunction with LIMIT:

SELECT *
FROM Customers
LIMIT 5 OFFSET 5;

In this query, the OFFSET clause skips the first five rows, and the LIMIT clause retrieves the next five rows. This allows us to paginate through the data and retrieve rows in smaller chunks.

Aggregating data: COUNT, SUM, AVG, MIN, MAX

SQL provides a range of powerful aggregate functions such as COUNT, SUM, AVG, MIN, and MAX to perform calculations on groups of rows.

COUNT Function
The COUNT function allows us to determine the number of rows that match a specific condition or the total number of rows in a table. For instance, let’s say we have an “Orders” table with columns like OrderID, CustomerID, and TotalPrice. We want to know the total number of orders in the table. The corresponding SQL query using the COUNT function would be:

SELECT COUNT(OrderID)
FROM Orders;

SUM Function
The SUM function allows us to calculate the sum of a numeric column’s values. Suppose we have a “Sales” table with columns like SaleID, ProductID, and Quantity. We want to calculate the total quantity of products sold. The SQL query using the SUM function would be:

SELECT SUM(Quantity)
FROM Sales;

AVG Function
The AVG function calculates the average value of a numeric column. It is often used to determine average prices, ratings, or other measures. Here’s an example. Consider a “Products” table with columns like ProductID, Price, and Rating. We want to find the average price of the products. The corresponding SQL query using the AVG function would be:

SELECT AVG(Price)
FROM Products;

MIN and MAX Functions
The MIN and MAX functions allow us to retrieve the minimum and maximum values of a column, respectively. They are commonly used to identify the lowest or highest values in a dataset. Suppose we have a “Employees” table with columns like EmployeeID, Name, and Salary. We want to find the minimum and maximum salaries among the employees. The SQL queries using the MIN and MAX functions would be:

SELECT MIN(Salary),MAX(Salary)
FROM Employees;

Combining data: JOIN

JOINs allow data scientists to bring together related information from different tables based on common columns or keys. There are different types of joins you can perform based on how you want the data to be combined.

Inner Join
The INNER JOIN is the most commonly used type of join. It returns only the rows that have matching values in both tables being joined. Suppose we have two tables: “Customers” with columns CustomerID, Name, and “Orders” with columns OrderID, CustomerID, TotalPrice. We want to retrieve the customer details along with their order information. The corresponding SQL query using an INNER JOIN would be:

SELECT Customers.CustomerID, Customers.Name, Orders.OrderID, Orders.TotalPrice
FROM Customers
INNER JOIN Orders
   ON Customers.CustomerID = Orders.CustomerID;

Left Join
The LEFT JOIN returns all the rows from the left (or first) table and the matching rows from the right (or second) table. If there are no matching rows in the right table, NULL values are included. Let’s consider the same “Customers” and “Orders” tables. We want to retrieve all customers and their associated orders, including those who haven’t placed any orders yet. The SQL query using a LEFT JOIN would be:

SELECT Customers.CustomerID, Customers.Name, Orders.OrderID, Orders.TotalPrice
FROM Customers
LEFT JOIN Orders
   ON Customers.CustomerID = Orders.CustomerID;

Right Join
The RIGHT JOIN is similar to the LEFT JOIN but returns all the rows from the right (or second) table and the matching rows from the left (or first) table. If there are no matching rows in the left table, NULL values are included. Continuing with the “Customers” and “Orders” tables, suppose we want to retrieve all orders and their associated customers, including orders without corresponding customer information. The SQL query using a RIGHT JOIN would be:

SELECT Customers.CustomerID, 
Customers.Name, 
Orders.OrderID, 
Orders.TotalPrice
FROM Customers
RIGHT JOIN Orders
   ON Customers.CustomerID = Orders.CustomerID;

Full Outer Join
The FULL OUTER JOIN returns all rows from both tables, regardless of whether there are matches or not. If there are no matches, NULL values are included. Let’s say we have two tables: “Customers” with columns CustomerID, Name and “Orders” with columns OrderID, CustomerID. We want to retrieve all customers and their associated orders, including those without any matching customers. The corresponding SQL query using a full outer join would be:

SELECT Customers.CustomerID, Customers.Name, Orders.OrderID
FROM Customers
FULL OUTER JOIN Orders
   ON Customers.CustomerID = Orders.CustomerID;

Handling Null Values in Join

When performing joins in SQL, it’s important to consider the presence of NULL values in the columns used for matching. NULL represents the absence of a value, and it can impact the results of a join operation. Let’s explore how to handle NULL values in joins with an example.

Consider two tables, “Customers” and “Orders,” with the following columns:

Customers Table

CustomerID
Name

Orders Table

OrderID
CustomerID
TotalPrice

Let’s say we want to retrieve all customers and their associated orders. However, there might be cases where a customer has no corresponding order, resulting in a NULL value in the OrderID and TotalPrice columns. We can handle these NULL values using the COALESCE function or by explicitly filtering them out.

COALESCE function
The COALESCE function allows us to replace NULL values with a specified alternative value. In the context of joins, we can use COALESCE to substitute NULL values in the columns being matched. In the following example, we use the COALESCE function to replace NULL values in the OrderID column with ‘N/A’ and NULL values in the TotalPrice column with 0. This allows us to provide meaningful information even when there is no matching order for a customer.

SELECT Customers.CustomerID, 
Customers.Name, 
COALESCE(Orders.OrderID, 'N/A') AS OrderID, 
COALESCE(Orders.TotalPrice, 0) AS TotalPrice
FROM Customers
LEFT JOIN Orders
   ON Customers.CustomerID = Orders.CustomerID;

Filtering out NULL values
Another approach is to explicitly filter out NULL values using the WHERE clause. This ensures that only the rows with non-NULL values in the matching columns are included in the result. In the following query, we include the condition “Orders.OrderID IS NOT NULL” in the WHERE clause. This ensures that only the rows with non-NULL OrderID values are returned, effectively filtering out the customers without any orders.

SELECT Customers.CustomerID, Customers.Name, Orders.OrderID, Orders.TotalPrice
FROM Customers
LEFT JOIN Orders
   ON Customers.CustomerID = Orders.CustomerID
WHERE Orders.OrderID IS NOT NULL;

Set based operations: UNION, INTERSECT, EXCEPT

In SQL, set-based operations provide powerful ways to combine, compare, and manipulate data from multiple tables or queries. Lets look at some of them —

UNION
The UNION operation combines the result sets of two or more SELECT statements into a single result set. It merges the rows from each SELECT statement while removing duplicates. The columns in the SELECT statements must have compatible data types. Here’s an example:

SELECT column1, column2 FROM table1
UNION
SELECT column1, column2 FROM table2;

In this example, the UNION operation combines the rows from table1 and table2 and retrieves the unique rows based on the selected columns. The result set will contain all distinct rows from both tables.

INTERSECT
The INTERSECT operation returns the common rows that exist in the result sets of two or more SELECT statements. It retrieves only the rows that appear in all SELECT statements, eliminating any duplicates. The columns and their data types must be the same in all SELECT statements. Here’s an example:

SELECT column1, column2 FROM table1
INTERSECT
SELECT column1, column2 FROM table2;

In this example, the INTERSECT operation compares the rows from table1 and table2 based on the selected columns and returns only the rows that exist in both tables.

EXCEPT
The EXCEPT operation (also known as MINUS in some databases) returns the rows that exist in the result set of the first SELECT statement but not in the result set of the subsequent SELECT statements. It retrieves the unique rows from the first SELECT statement while eliminating any duplicates. The columns and their data types must be the same in all SELECT statements. Here’s an example:

SELECT column1, column2 FROM table1
EXCEPT
SELECT column1, column2 FROM table2;

In this example, the EXCEPT operation compares the rows from table1 and table2 based on the selected columns and returns the rows that exist in table1 but not in table2.

Subqueries: Writing queries within queries

In SQL, subqueries allow us to write queries within queries, enabling us to perform more complex and advanced data retrieval and analysis tasks.

Subqueries in the SELECT Clause
These allow us to retrieve data based on the results of an inner query. The inner query, also known as the subquery, is executed first, and its results are then used in the outer query. Here’s an example:

SELECT CustomerID, Name, (SELECT COUNT(*) FROM Orders WHERE Customers.CustomerID = Orders.CustomerID) AS OrderCount
FROM Customers;

In this query, we have a subquery within the SELECT clause. The subquery (SELECT COUNT(*) FROM Orders WHERE Customers.CustomerID = Orders.CustomerID) calculates the count of orders for each customer. The result of the subquery is then aliased as OrderCount and included as a column in the outer query, along with the CustomerID and Name columns from the Customers table.

Subqueries in the FROM Clause
Also known as derived tables or inline views, allow us to treat the subquery as a temporary table within the main query. We can perform additional operations on the subquery’s results, such as joining with other tables. Here’s an example:

SELECT Customers.CustomerID, Customers.Name, OrderSummary.TotalPrice
FROM Customers
JOIN (SELECT CustomerID, SUM(TotalPrice) AS TotalPrice FROM Orders GROUP BY CustomerID) AS OrderSummary
   ON Customers.CustomerID = OrderSummary.CustomerID;

In this query, the subquery (SELECT CustomerID, SUM(TotalPrice) AS TotalPrice FROM Orders GROUP BY CustomerID) calculates the total price of orders for each customer. The subquery is treated as a temporary table called OrderSummary in the main query, which is then joined with the Customers table based on the CustomerID column.

Subqueries in the WHERE Clause
These allow us to filter data based on the results of a subquery. The subquery is evaluated for each row in the outer query, and the rows that meet the subquery’s condition are included in the result. In the follwing query, the subquery (SELECT CustomerID FROM Orders) retrieves the CustomerID values from the Orders table. The main query then selects the CustomerID and Name columns from the Customers table but only includes the rows where the CustomerID is found in the subquery's result set.

SELECT CustomerID, Name
FROM Customers
WHERE CustomerID IN (SELECT CustomerID FROM Orders);

Conditional statements: CASE statements

In SQL, conditional statements are used to perform different actions based on specified conditions. One common type of conditional statement is the CASE statement, which allows us to evaluate conditions and return different values or execute different logic accordingly.

In the following example, the CASE statement evaluates the TotalPrice column.

SELECT OrderID, TotalPrice,
  CASE
    WHEN TotalPrice >= 1000 THEN 'High Value'
    WHEN TotalPrice >= 500 THEN 'Medium Value'
    ELSE 'Low Value'
  END AS ValueCategory
FROM Orders;

If the TotalPrice is greater than or equal to 1000, it assigns the label ‘High Value’. If it’s greater than or equal to 500, it assigns the label ‘Medium Value’. For any other values, it assigns the label ‘Low Value’. The result is displayed as the ValueCategory column in the output.

String operations and functions

In SQL, string operations and functions are essential for manipulating and extracting information from text data. The most common ones are

Concatenation
Concatenation is the process of combining strings. In SQL, the concatenation operator is typically the plus symbol (+) or the double pipe symbol (||). Here’s an example:

SELECT FirstName || ' ' || LastName AS FullName
FROM Employees;

In this query, the columns FirstName and LastName are concatenated with a space in between to form the FullName.

Length
The LENGTH function returns the number of characters in a string. It can be used to determine the length of a string column or a literal value. Here's an example:

SELECT ProductName, LENGTH(ProductName) AS NameLength
FROM Products;

This query retrieves the ProductName column and calculates the length of each product name using the LENGTH function, which is aliased as NameLength.

Substring Extraction
The SUBSTRING or SUBSTR function allows you to extract a portion of a string based on specified starting position and length. Here's an example:

SELECT SUBSTRING(ProductName, 1, 3) AS Abbreviation
FROM Products;

In this query, the SUBSTRING function extracts the first three characters from the ProductName column, resulting in an abbreviation.

Case Conversion
SQL provides functions to convert the case of strings. UPPER converts all characters to uppercase, while LOWER converts all characters to lowercase. Here's an example:

SELECT ProductName, UPPER(ProductName) AS UppercaseName, LOWER(ProductName) AS LowercaseName
FROM Products;

This query retrieves the ProductName column and creates two additional columns: UppercaseName and LowercaseName, which contain the product name in uppercase and lowercase, respectively.

Pattern Matching
SQL offers powerful pattern matching capabilities using the LIKE operator and wildcard characters such as % (matches any sequence of characters) and _ (matches any single character). Here's an example:

SELECT ProductName
FROM Products
WHERE ProductName LIKE '%apple%';

This query selects the ProductName where the string "apple" appears anywhere within it. The % wildcard matches any characters before or after "apple". Some addition wildcard you show know about are given below —

Date and time operations and functions

In data science, working with dates and times is a common task, and SQL provides a variety of operations and functions to handle date and time data effectively. Let’s explore some essential date and time operations and functions in SQL.

Date Functions
SQL offers several built-in functions to work with dates. Here are some commonly used functions:

CURRENT_DATE: Returns the current date.
CURRENT_TIME: Returns the current time.
CURRENT_TIMESTAMP: Returns the current date and time.

Example usage:

SELECT CURRENT_DATE AS CurrentDate, CURRENT_TIME AS CurrentTime, CURRENT_TIMESTAMP AS CurrentDateTime
FROM dual;

This query retrieves the current date, time, and timestamp using the respective functions.

Date Formatting
SQL provides functions to format dates according to specific patterns. The most commonly used function is TO_CHAR, which converts a date or timestamp to a formatted string. Here's an example:

SELECT OrderDate, TO_CHAR(OrderDate, 'YYYY-MM-DD') AS FormattedDate
FROM Orders;

In this query, the TO_CHAR function is used to format the OrderDate column as 'YYYY-MM-DD', resulting in a formatted date string.

Date Arithmetic
SQL allows us to perform arithmetic operations on dates. We can add or subtract intervals from dates to calculate new dates. Here’s an example:

SELECT OrderDate, OrderDate + INTERVAL '7' DAY AS NewDate
FROM Orders;

In this query, the INTERVAL keyword is used to add seven days to the OrderDate column, resulting in a new date.

Date Comparison
SQL enables us to compare dates using comparison operators such as =, >, <, >=, <=, and <> (not equal to). Here's an example:

SELECT OrderID, OrderDate
FROM Orders
WHERE OrderDate > '2022-01-01';

This query retrieves the OrderID and OrderDate from the Orders table, selecting only the rows where the OrderDate is greater than '2022-01-01'.

Date Extraction
SQL provides functions to extract specific components from a date or timestamp, such as the year, month, day, hour, minute, or second. Here’s an example:

SELECT OrderDate, EXTRACT(YEAR FROM OrderDate) AS OrderYear, EXTRACT(MONTH FROM OrderDate) AS OrderMonth
FROM Orders;

In this query, the EXTRACT function is used to extract the year and month from the OrderDate column, resulting in separate columns for the year and month values.

Window Functions

Window functions in SQL provide a powerful way to perform calculations and analysis on a specific set of rows within a defined window. They allow data scientists to perform calculations that involve aggregations, rankings, and other operations within a partitioned subset of data. Let’s explore window functions and how to use them with a detailed example.

Syntax of Window Functions
The general syntax of a window function is as follows:

<window_function> OVER (
  PARTITION BY <partition_column(s)> 
  ORDER BY <order_column(s)> 
  <window_frame_clause>
)

<window_function>: The specific window function you want to use, such as SUM, AVG, ROW_NUMBER, RANK, etc.
PARTITION BY: Specifies how to partition the data into subsets based on one or more columns. Each subset will have its own set of calculations.
ORDER BY: Specifies the order in which the rows should be considered within each partition.
<window_frame_clause>: Defines the window frame or range of rows within the partition to be included in the calculation. It can be ROWS BETWEEN <start> AND <end>, RANGE BETWEEN <start> AND <end>, or UNBOUNDED PRECEDING/FOLLOWING.

Calculating Running Total
Let’s consider a scenario where we have a table named Sales with columns Date, Product, and Revenue. We want to calculate the running total revenue for each product over time. Here's an example:

SELECT Date, Product, Revenue,
       SUM(Revenue) OVER (PARTITION BY Product ORDER BY Date) AS RunningTotal
FROM Sales;

In this query, the SUM window function is used to calculate the running total of revenue for each product. The PARTITION BY Product clause ensures that the calculation is performed separately for each product. The ORDER BY Date clause defines the order in which the rows should be considered within each partition. The result is displayed as the RunningTotal column.

Finding Rank within Categories
Suppose we have a table named Scores with columns Student, Subject, and Score. We want to find the rank of each student within their respective subject. Here's an example:

SELECT Student, Subject, Score,
       RANK() OVER (PARTITION BY Subject ORDER BY Score DESC) AS SubjectRank
FROM Scores;

In this query, the RANK window function is used to determine the rank of each student within their subject. The PARTITION BY Subject clause ensures that the ranking is done separately for each subject. The ORDER BY Score DESC clause defines the order of ranking based on the descending order of scores. The result is displayed as the SubjectRank column.

Data manipulation: INSERT, UPDATE, DELETE statements

In addition to querying data, SQL provides powerful statements for manipulating data within a database. The three primary data manipulation statements are INSERT, UPDATE, and DELETE. Let’s explore each of these statements and their usage in SQL.

INSERT Statement
The INSERT statement is used to add new records into a table. It allows you to specify the values to be inserted into specific columns. Here’s an example:

INSERT INTO Customers (FirstName, LastName, Email)
VALUES ('John', 'Doe', 'johndoe@example.com');

In this example, the INSERT statement adds a new record into the Customers table. It specifies the values for the FirstName, LastName, and Email columns.

UPDATE Statement
The UPDATE statement is used to modify existing records in a table. It allows you to update one or more columns based on specified conditions. Here’s an example:

UPDATE Customers
SET Email = 'newemail@example.com'
WHERE CustomerID = 123;

In this example, the UPDATE statement modifies the Email column for the record with CustomerID equal to 123.

DELETE Statement
The DELETE statement is used to remove records from a table based on specified conditions. It allows you to selectively delete one or more records. Here’s an example:

DELETE FROM Customers
WHERE CustomerID = 123;

In this example, the DELETE statement removes the record from the Customers table with CustomerID equal to 123.

Query Execution Order

The order of execution of commands in a SQL query typically follows a specific sequence. Here’s a general overview of the order of execution:

FROM clause:

The FROM clause specifies the tables or views from which the data will be retrieved.
The database engine identifies the source tables and establishes the initial data set for further processing.

2. JOIN/ON clause

Join data sources based on condition specified in ON clause

3. WHERE clause

The WHERE clause filters the data based on specified conditions.
Rows that satisfy the conditions in the WHERE clause are retained, and others are discarded.

4. GROUP BY clause:

The GROUP BY clause is used to group rows based on specified columns.
The data is divided into groups based on the grouping columns.

5. HAVING clause:

The HAVING clause filters the grouped data based on specified conditions.
It operates similarly to the WHERE clause but works on grouped data.

6. SELECT clause:

The SELECT clause specifies the columns to be retrieved from the resulting data set.
Expressions, functions, aliases, and calculations can be performed on the selected columns.

7. ORDER BY clause:

The ORDER BY clause sorts the result set based on specified columns.
The data is arranged in ascending or descending order according to the specified criteria.

8. LIMIT/OFFSET clause:

The LIMIT clause (sometimes accompanied by the OFFSET clause) restricts the number of rows returned from the result set.
It allows you to retrieve only a specified number of rows or skip a certain number of rows.

Creating and Modifying Database Tables

In SQL, creating and modifying database tables is a fundamental aspect of database management. Tables provide the structure to organize and store data effectively. Let’s explore how to create and modify database tables using SQL.

Creating a Table
To create a new table, you need to define its structure by specifying the table name and the columns it should contain, along with their data types and any constraints. Here’s an example:

CREATE TABLE Employees (
    EmployeeID INT PRIMARY KEY,
    FirstName VARCHAR(50),
    LastName VARCHAR(50),
    Department VARCHAR(100),
    Salary DECIMAL(10, 2)
);

In this example, the CREATE TABLE statement creates a table named Employees. It defines columns such as EmployeeID, FirstName, LastName, Department, and Salary with their respective data types. The PRIMARY KEY constraint is applied to the EmployeeID column to ensure its uniqueness.

Modifying a Table
You can modify an existing table by altering its structure. Common modifications include adding columns, modifying column definitions, or applying constraints. Here’s an example:

ALTER TABLE Employees
ADD COLUMN Email VARCHAR(100),
MODIFY COLUMN Salary DECIMAL(12, 2),
ADD CONSTRAINT CK_Salary CHECK (Salary >= 0);

In this example, the ALTER TABLE statement is used to modify the Employees table. It adds a new column Email of data type VARCHAR(100). It modifies the Salary column to have a precision of 12 and a scale of 2. Finally, a CHECK constraint is added to ensure that the Salary value is greater than or equal to 0.

Dropping a Table
To remove an entire table and its data from the database, you can use the DROP TABLE statement. Here’s an example:

DROP TABLE Employees;

This DROP TABLE statement deletes the Employees table and permanently removes all the data it contains. Exercise caution when using this statement, as data loss is irreversible.

By creating and modifying database tables, data scientists can design and adapt the structure of the database to meet their specific needs. It allows for efficient organization, storage, and retrieval of data.

Optimizing SQL Queries

Some common tips to keep in your mind if an inteviewer asks to optimizer your query is —

Minimize the Use of Wildcards in SELECT Statements

Avoid using wildcard characters (e.g., *) in SELECT statements unless absolutely necessary.
Explicitly specify the columns you need, as fetching unnecessary columns can increase query execution time and resource usage.

Avoid SELECTing Unnecessary Data

Select only the columns you need rather than retrieving all columns from a table.
Retrieving excess data consumes additional resources and can impact query performance, especially when dealing with large datasets.

Optimize JOIN Operations

Use appropriate JOIN types (e.g., INNER JOIN, LEFT JOIN) based on the relationship between tables.
Consider using appropriate JOIN hints or techniques (e.g., EXISTS, IN) when applicable.

Use WHERE Clause Effectively

Place the most selective conditions in the WHERE clause to filter out irrelevant rows early in the query execution.
Use appropriate comparison operators and functions to optimize the search conditions.

Minimize Subqueries

Evaluate if subqueries can be rewritten or replaced with JOIN operations or other optimized techniques.
Subqueries can sometimes be resource-intensive and impact query performance if used excessively.

Use Set-based Operations

Leverage the power of set-based operations like UNION, INTERSECT, and EXCEPT to combine or compare data from multiple tables efficiently.

Additionally you can also mention you would —

Analyze Query Performance — Use database profiling tools to analyze query performance and identify areas for optimization.
Use Proper Indexing — Identify the columns used frequently in search conditions and JOIN clauses and consider adding indexes on those columns.

Next Steps

SQL is a fundamental skill that empowers data scientists to extract, manipulate, and analyze data efficiently. In this blog post, we explored various essential concepts of SQL for data scientists, covering everything from basic syntax (SELECT, FROM, WHERE, etc.) to advanced techniques (subqueries, data manipulation etc).

What’s next? Get some hands on experience by solving SQL questions. Head to one of the following sites to get some practice.

Leetcode: https://leetcode.com/problemset/database/
Data Lemur: https://datalemur.com/sql-interview-questions

Credits
This post was written with help from ChatGPT. Some of the prompts used are
Write a blog post introducing SQL and its relevance in data science.
Ok, next section — Understanding relational databases and their structure. Make sure to use exmpale or diagrams if required
Ok, next section. -SQL syntax basics: SELECT, FROM, WHERE clauses
Explain set based operations: UNION, INTERSECT, EXCEPT