Useful Database/SQL Tricks for Data Engineering

Dakai Zhou
Towards Dev
Published in
4 min readFeb 22, 2022

--

Writing SQL statements is part of the daily routine for data analysts, data engineers, and many data scientists. Here are some SQL tricks that might be helpful for you. All the tricks will be introduced based on PostgreSQL.

STRING_AGG()

STRING_AGG(expression, separator [order_by_clause]) is an aggregate function that concatenates a list of strings(string cells in one column) and places a separator between them, where [order_by_clause] is optional and it is for specifying the order of concatenated results.

For example, you have a table as displayed on the left, and you want to merge the second row and the third row.

For this transformation, we can use STRING_AGG() to achieve this. You will get a table shown upright.

Copy Data Between Remote DB Servers Without Creating Temporary File

Apart from a common task to copy data from one database to another within the same database server, you may across a situation where you want to copy data from table-a in remote server-A to table-b in remote server-B. Furthermore, you might also not want to create a temporary file in the system. If you are facing such a problem, you can find a solution here.

Instead of creating a file in a file system to store data, you can use in-memory text streams buffer io.StringIo() in Python.

At the very first step, the connection to the remote database server needed to be set. Also, two built-in functions copy_to() and copy_from() in package psycopg2 will be used. Functions copy_to() and copy_from() require file-like object to write data into and file-like object to read data from. This is the intuition we use io.StringIo() here, because it has read(), readline() and write() methods. Then, we can construct our functions:

Multiple Counts in One SQL Query

If you want to do multiple counts on one table with no indexes, say a count for total, a count for the owner as employ_1, a count for the owner as employ_2, and a count for the owner as employ_3. Your first thought might be a nested select…count… SQL query like the one blew:

This SQL query is correct and has no obvious drawbacks when my_table is small. However, when my_table becomes a large table, the SQL query above will be slow, because it will scan my_table from the first record to the last record multiple times, in this case, 4 times. This will significantly negatively influence your efficiency when the table gets larger and larger. We should reduce the number of table scans to 1 to let our SQL query more scalable.

In order to reduce table scans, we can use SUM() to replace COUNT(). In the SQL query below, the query scans the table only once.

ROLLUP()

In some cases, if you want to count total and subtotals, ROLLUP() is a better option, especially when you want to aggregate with date/time.

ROLLUP() is an extension of the GROUP BY, it allows you to include extra rows that represent the subtotals. For example, you have a table with columns year, month, day, and the number of products, the following query will give you the total number of products for each day, each month, each year, and the whole time period in the table:

CUBE()

If you want more grouping sets, you can use CUBE() instead. Just replace the ROLLUP() with CUBE() in the above query, you will get all combinations.

GROUPING SETS()

Not everyone needs all those grouping sets, you can use GROUPING SETS() to create custom ones. Just replace COBE()/ROLLUP() with GROUPING SETS((year, month), (month)), you will only get subtotal for each month of each year and subtotal for each month regardless of year.

LAG()

You might want to compare the number of sales this month with the previous month. An additional column for the number of sales in the previous month needs to be added. LAG() will do this for you. The following query will give you a table with columns years, month, num_of_sales, and pmonth_num_of_sales.

LAG(column, number of rows back, default value) OVER(partition, order). LAG() takes three parameters. The column to be returned, the number of rows back from the current row from which to obtain a value(default 1), the default value for value to be returned if row back goes beyond the scope of the partition(default null). It follows a OVER() function, where you define the partition and the orders. If no partition is provided, it will consider the whole data set as one partition.

DISTINCT Clause

What about you want to select multiple columns, but group only on one column? For example, given a table like below, and want to select the latest sick leave of each person(note: person 1 changed department in the time period). MAX() can be used to select the latest date, but a GROUP BY must be used afterward. Columns person, department, seniority have to be put after GROUP BY. In this way, three records will be returned, person 1 will have 2 records, one from when he was in department B, one from the time after he changed to department A. This result is not what we want.

What we want is to ignore the difference of department of the same person. For similar situations, we can use the DISTINCT clause to achieve this.

DISTINCT ON (expression) only keeps the first row of each set of rows where the given expression evaluates to equal(here is person). If ORDER BY is not used, the first row of each set would be unpredictable. In this example code, because of the DESC, only the row from each set that has the latest sick_leave date is kept.

--

--