(Simple) SQL

I did the mistake of jumping into backend coding using the Python Django Framework. This framework is amazing and handles a lot of stuff for you. And it contains one of the most versatile ORM (object related mappers — https://www.fullstackpython.com/django-orm.html) shipped with a framework. This helps to have an easy onboarding into the world of data management with a data layer built on top of a database. On the other hand i had no impetus to actually look at the SQL queries.

This had two implications:

  1. I couldn’t solve problems on the SQL statement level and
  2. didn’t understand how my django queries worked from a performance perspective and why they did so.

After going into a new job i had to use Spring, a commonly used web of the Java enterprise world. Interestingly enough they didn’t trust the underlying ORM enough to generate and manage the migration files through it. That meant that i finally had to learn how SQL works. In this Blog post i want to give you three different common structures about how you can manipulate and analyse data stored in a SQL based database. I will start with a common one table manipulation.

Common Version

humans
+----------+--------+---------+
| name | age | gender |
+----------+--------+---------+
| Julia | 4 | f |
| Karl | 12 | m |
| Ylvi | 3 | f |
| Ludwig | 19 | m |
+----------+--------+---------+

There are two commonly used ways of manipulating the data of one table to get insights. A standard select statement as well as a group by version where we can group rows based on a column.

Statement:

SELECT *
FROM humans
ORDER BY age asc
LIMIT 2

Output:

Ylvi 3 f
Julia 4 f

The above statement executed the following steps:

  1. We select every column (because of the asterisk ) from the table humans
  2. We then order it by the age of them in an ascending order
  3. We limit the output to two rows, so the top two rows will be shown

Grouping helps us to aggregate data or to find min/max values.

Statement:

SELECT gender, MIN(age), AVG(age)
FROM humans
GROUP BY gender

Output:

f 3 3.5
m 12 15.5

The above statement executed the following steps:

  1. From the humans table we group all rows based on the gender column
  2. We then select the groups gender, the minimum and average age of the groups members.

Joining

With the JOIN function you build bigger tables by combining two (or more) tables. It is based on a join condition (like a common identifier). You have several options to join tables together:

  1. INNER JOIN

The Inner Join is used to join the intersection of two tables. Only those rows which fulfill the given condition will be part of the new table.

  1. LEFT JOIN or RIGHT JOIN

The join statement follows directly the from statement from table1 (left/right) join table2 on… and if you imagine written like this, the LEFT or RIGHT points toward the table left or right from the JOIN statement. The effect of the statement is the following:

It will join all rows as the same as INNER JOIN does plus all other rows from the left or right table (as specified).

  1. FULL OUTER JOIN

The full outer join just joins all rows without exception.

lastName
+--------------+----------+
| lastName | userId |
+--------------+----------+
| Krueger | 1 |
| Schippel | 1 |
| Markle | 3 |
| Merkel | 5 |
+--------------+----------+
firstName
+--------------+----------+
| firsName | userId |
+--------------+----------+
| Julia | 1 |
| Karl | 2 |
| Ylvi | 3 |
| Ludwig | 4 |
+--------------+----------+

The following statements will show you how the statements join both tables together. It is important to understand that the order of execution of the statement is the following: Get the data FROM table1 and JOIN it with table2. Afterwards we select the columns we want to display.

Statement

SELECT firstName.firstName, lastName.lastName
FROM firstName
INNER JOIN lastName
ON firstName.userId = lastName.userId

Output

Julia Krueger
Julia Schippel
Ylvie Markle

The above statement joins only the rows of both tables which fulfill the given condition. It will only return valid firstname + lastname combinations.

Statement:

SELECT firstName.firstName, lastName.lastName
FROM firstName
RIGHT JOIN lastName
ON firstName.userId = lastName.userId

Output:

Julia Krueger
Julia Schippel
Ylvie Markle
Null Merkel

In the above statement, besides joining the rows based on the condition it also joins every other row from the (literally) right table. It fills the column given by the left table with NULL values.

Statement

SELECT firstName.firstName, lastName.lastName
FROM firstName
FULL OUTER JOIN lastName
ON firstName.userId = lastName.userId

Output:

Julia Krueger
Julia Schippel
Ylvie Markle
Karl NULL
Ludwig NULL
Null Merkel

The FULL OUTER JOIN joins every row (fulfilling the condition or not) of the two tables. It will fill empty columns with NULL again.

Windowing

The GROUP BY function will help you to aggregate data of a specific group but has some side effects. I.e. you can’t easily create a new table with all original rows and an extra column with the aggregated values (It is possible with a subquery structure). But Modern SQL languages provide a solution for this problem. With Windowing you can assign aggregated data for groups to the respective row. In the following examples we try to rank students of different schools based on their scores.

students
+----------+----------+-------+
| name | school | score |
+----------+----------+-------+
| Lydia | Oxford | 15 |
| Dave | Oxford | 30 |
| Tanja | Eton | 45 |
| Carol | Eton | 5 |
+----------+----------+-------+

Statement

SELECT name, school, score, 
RANK() OVER (ORDER BY score) as rank
FROM students
ORDER BY 4

Output

Tanja Eton 45 1
Dave Oxford 30 2
Lydia Oxford 15 3
Carol Eton 5 4

To use the windowing function you normally have to use a windowing function like RANK() which outputs the rank of the row based on a value (like the score in our example). The rank is created while it iterates over the ordered score. Afterwards we order it by the newly created rank column.

Statement

SELECT name, school, score, 
RANK() OVER (PARTITION BY school ORDER BY score) as rank
FROM students
ORDER BY school, 4

Output

Tanja Eton 45 1
Carol Eton 5 2
Dave Oxford 30 1
Lydia Oxford 15 2

The PARTITION BY basically works similar to the GROUP BY and just resets the rank for every group.

Statement

SELECT name, school, score, 
AVG(score) OVER (PARTITION BY school) as average_score
FROM students
ORDER BY school, name

Output

Carol Eton 45 25
Tanja Eton 5 25
Dave Oxford 30 22.5
Lydia Oxford 15 22.5

If the PARTITION BY is used alone we can compute the average of the scores of every school and add it to every row connected to the respective school.

Conclusion

The shown examples only give a small picture about the functionality of the language. It can become fairly complex when you create multi level statements to query and analyse data from different tables. If you want to have an in-depth look at how to use SQL for analytical purposes i can recommend you the following website: https://community.modeanalytics.com/sql/