Advanced SQL For Data Science: A Complete Roadmap To Be A Data Scientist
This blog is a continuation of the previous blog I published last week, Introduction to SQL for Data Science: A Complete Roadmap To Be A Data Scientist.
Last week we discussed the DDL( Data Definition Language ). We learned all the commands we use mostly on a daily basis. And now let’s get into DML( Data Manipulation Language ).
Why DML?
As you already know, DML is used to manipulate data to find insights from it. So here based on our requirements we insert data into the database, retrieve existing data, delete data from existing tables, change or manipulate the data to get the results we want to find.
There are multiple commands we need to learn here, I’ll be going through them one by one.
The Basic 4:
These 4 statements are like the pillar while performing Data Manipulation in SQL. The 4 commands are: SELECT, INSERT, UPDATE, DELETE.
SELECT:
It is used to select data from a database.
# select clause
#1
SELECT CustomerName, City FROM Customers;#2
SELECT * FROM Customers;
This is how we can use the SELECT command to get the columns from a table.
Things to notice here is that in the #1, we fetch the CustomerName and City column from the Customers table, so to get data from tables, we use the FROM clause( or statement or keyword ). Also let’s say there are many columns in a table, and we want to see them all; then rather than typing all the column names, we can use an “*” with a SELECT statement, and we’ll get all the columns from the table. That is what we did in #2.
INSERT INTO:
This statement is used to insert any data or records into an existing table.
# insert into clause
INSERT INTO Customers (CustomerName, ContactName, Address, City, PostalCode, Country)
VALUES ('Amal', 'Vinod', 'Behala Bus Stand', 'Kolkata', '700034', 'India');
So when there is an existing table and we want to insert more new data into it, we use INSERT INTO. And while inserting the values into each column, we use the VALUES clause.
One thing to remember here, is that while inserting those values, we should always keep those values in the same order as the columns are. We cannot give PhoneNo as an input in the CustomerName column. Because while creating the CustomerName column varchar data type is set, and thus a varchar datatype cannot take any integer as an input. Also even if the data gets inserted into the table with the wrong order, later on while performing analysis, it might lead us to a wrong result.
UPDATE:
This is used when there is an existing table and within that table we need to modify any column or rows.
# update clause
UPDATE Fish
SET ContactName = 'Nemo The Fish', City= 'Ocean City'
WHERE FishID = 1;
This is how we can update existing rows/columns of a table.
Two things to notice here, we used SET and WHERE clauses here.
SET clause is used to set new values to the columns.
WHERE is used to find where actually we need to update the values we give with the SET clause. Soon, we’ll talk about it.
DELETE:
The DELETE statement is used to delete existing records from a table.
# delete clause
#1
DELETE
FROM Customers
WHERE CustomerName='Amal';#2
DELETE FROM Customers;
This is how we can delete any record from a table. While if we want to delete all the rows from a table, we can simply go for the #2 example. In the #2 example, all the rows from the Customers table will be omitted, so we should be very careful before running these commands.
The Filters:
In case of data analysis, there will be a lot of time when we’ll need to filter the data. In SQL, we can do filtering by using these clauses: Where, Order by, Group by, Limit, Having.
WHERE:
As we saw earlier, we use WHERE to find any particular record from a table. So basically It is used to filter records from a table, be it searching or updating or deleting.
# where clause
SELECT *
FROM Fish
WHERE FishName='Nemo';
So here we’re trying to fetch everything from ‘Fish’ table, where the FishName is ‘Nemo’.
ORDER BY:
This is used to sort the results in ascending or descending order.
# order by clause
#1 Ascending
SELECT *
FROM Fish
ORDER BY Country;#2 Descending
SELECT *
FROM Fish
ORDER BY Country DESC;
By running this command we’ll be getting the complete Fish table, but it will be sorted by the Country column.
By default the order we get here is ascending; to get descending order, we need to add DESC in the end.
LIMIT:
This clause is used to specify the number of records we want to get in the output.
# limit clause
SELECT *
FROM Customers
WHERE Country='Germany'
LIMIT 1;
Basically we use the LIMIT clause when we want some specific output like top 1 customer in German, or maybe the lowest 3 results, or the second highest mark in the class.
To get more ideas about LIMIT, check here.
Group by:
This statement groups rows that have the same values into summary rows, like find the number of students in each class. This clause is most of the time used with Aggregate Function( which we’ll be discussing in a bit ).
# group by clause
SELECT COUNT(CustomerID), Country
FROM Customers
GROUP BY Country;
This command will group all the CustomerID based on different Countries. And this is the power of GROUP BY; in the life of a data analyst, this is one of the most used statements.
To learn more about GROUP BY, go here.
HAVING:
This clause works just the same as the WHERE clause. So why keep both?
It is just because while using aggregate functions, we cannot use the WHERE clause.
# having clause
SELECT COUNT(CustomerID), Country
FROM Customers
GROUP BY Country
HAVING COUNT(CustomerID) > 5
ORDER BY COUNT(CustomerID) DESC;
This is how we can use the HAVING clause. One thing to remember, HAVING clause is always used after using GROUP BY clause, while WHERE clause is used before using GROUP BY.
In this example, we’re fetching the number of customers in each country, sorted high to low (Only include countries with more than 5 customers).
To learn more about HAVING clause, go here.
Aggregate Function:
In database management an aggregate function is a function where the values of multiple rows are grouped together as input on certain criteria to form a single value of more significant meaning.
There are different aggregate functions in SQL: count(), sum(), avg(), min(), max().
COUNT():
When we want to get the total number of rows of a specified column from a table, we use the COUNT function.
# count func
SELECT COUNT(ProductID)
FROM Products;
Like this, we can count the total number of rows in the ProductID column from the Products table.
AVG():
We use this function to get the average value of any numeric column from a table.
# avg func
SELECT AVG(Price)
FROM Products;
This command will tell us the average price in the Product table.
SUM():
This function returns the total sum of a numeric column.
# sum func
SELECT SUM(Quantity)
FROM OrderDetails;
This command tells us the total Quantity in the OrderDetails table.
MAX() and MIN():
These functions are used when we want to get the max or min of any column from a table.
# max and min func
#1 MAX
SELECT MAX(Price) AS MaxPrice
FROM Products;#2 MIN
SELECT MIN(Price) AS MinPrice
FROM Products;
Thus, we can get the max or min price from the Price column.
One thing to notice here, we used AS, a keyword. It is used to rename a column. Here the output will be an integer( as we’re getting the max or min value in the price column ), so to give the output a better format, we use AS. So that the output has a header and it becomes easier to understand.
To learn more about the AS keyword, here.
To learn more about Aggregate Function, go here. The aggregate function may seem confusing at first, so I’ll suggest you practice as much as possible .. then only the concept will be clear.
Joins:
This clause is used to combine two or more tables together, to perform different operations.
There are different types of joins: Inner, Full Outer, Left, Right.
Inner join:
This returns records only that have matching values in both tables.
# inner join
SELECT Orders.OrderID, Customers.CustomerName
FROM Orders
INNER JOIN Customers ON Orders.CustomerID = Customers.CustomerID;
This command joins the Orders and the Customers table. We can see the OrderID is from the Orders table, as we are selecting that column using Orders.OrderID; just like that the CustomerName is from the Customers table.
Now each join occurs on the basis of a particular column on both the tables, a column which exists in both the tables. Here it is CustomerID. Both the Orders and Customers table has the same column, and based on that we’re performing the inner join over here.
Also if there is the same column in both the tables, however there are no values within them which actually match in both the tables, then with the INNER JOIN we’ll not get any answer. For that kind of situation, it’s better opt for FULL OUTER JOIN.
FULL OUTER JOIN:
This join returns all records when there is a match in left (table1) or right (table2) table records.
# full outer join
SELECT Orders.OrderID, Customers.CustomerName
FROM Orders
FULL OUTER JOIN
Customers ON Orders.CustomerID = Customers.CustomerID;
This is how we can use FULL OUTER JOIN.
LEFT JOIN:
The LEFT JOIN returns all records from the left table (table1), and the matching records from the right table (table2). The result is 0 records from the right side, if there is no match.
# left join
SELECT Orders.OrderID, Customers.CustomerName
FROM Orders
LEFT JOIN
Customers ON Orders.CustomerID = Customers.CustomerID;
RIGHT JOIN:
This is the opposite of LEFT JOIN. The RIGHT JOIN returns all records from the right table (table1), and the matching records from the left table (table2). The result is 0 records from the left side, if there is no match.
# right join
SELECT Orders.OrderID, Customers.CustomerName
FROM Orders
RIGHT JOIN
Customers ON Orders.CustomerID = Customers.CustomerID;
Windows Function:
Window functions applies aggregate and ranking functions over a particular window( set of rows ).
OVER clause is used with window functions to define that window. OVER clause does two things:
- Partitions rows into a form set of rows. (PARTITION BY clause is used)
- Orders rows within those partitions into a particular order. (ORDER BY clause is used)
# windows function
SELECT duration_seconds,
SUM(duration_seconds) OVER (ORDER BY start_time) AS
running_total
FROM table_2012;
What this command does is it creates an aggregation on running_total without using GROUP BY.
We can see after OVER we used another command within brackets, this is called a subquery; which means a query inside a query. Here the whole command is a query, and within that we used this query — thus, subquery.
I’m very sure, this windows function is going to confuse a lot, so here’s some reference. I’m very sure once you check these out, you’ll be super confident about Windows Function.
First check this out, video.
Then, check this one out, video.
The Conditional:
There will be some time, when we’ll have some condition in front of us to get some insights. And at that time, we should go for the CASE WHEN statement of SQL.
The CASE expression goes through conditions and returns a value when the first condition is met (like an if-then-else statement).
Once a condition is true, it will stop reading and return the result. If no conditions are true, it returns the value in the else clause. If there is no else part and no conditions are true, it returns NULL.
# case when
SELECT OrderID, Quantity,
CASE
WHEN Quantity > 30 THEN 'The quantity is greater than 30'
WHEN Quantity = 30 THEN 'The quantity is 30'
ELSE 'The quantity is under 30'
END AS QuantityText
FROM OrderDetails;
This statement goes through the OrderDetails table and based on the conditions given, it returns the result.
So we start giving the conditions by using CASE, then give the conditions using WHEN..THEN, and finally end the conditions using END. In case we want our output to have a header, we use the AS keyword( like in this example ).
To learn more about CASE WHEN, check here.
STORED PROCEDURE:
A STORED PROCEDURE is a prepared SQL code that we can save, so the code can be reused over and over again. So if we have an SQL query that we write over and over again, we can save it as a stored procedure, and then just call it to execute it.
# stored procedure
#1 creating a stored procedure
CREATE PROCEDURE SelectAllCustomers @City nvarchar(30)
AS
SELECT * FROM Customers WHERE City = @City
GO;#2 executing the stored procedure
EXEC SelectAllCustomers @City = 'London';
The #1 command creates a stored procedure which selects Customers from a particular City from the “Customers” table.
The #2 command execute that first command, also it takes an input. Like here as we gave London as city, it’ll run the #1 command on city = London only, and get us the output.
To learn more about the STORED PROCEDURE, check here.