Data Engineering Good Practices to Database Queries
Optimization is essential for reducing the usage of system resources that are
required in order to fulfill a task while also allowing your application to handle more queries simultaneously. Thus, you can improve the overall system performance and help users get results faster. Poorly written queries can cause loss of service for the users, consume too many system resources or take too long to execute.
This article shows a collection of good practices to help you make better project decisions, some of their drawbacks and real use cases.
Indexes
Use relevant indexes
This technique is extremely useful when the table contains a specific column that is used as a reference (source of truth) for further queries. In that scenario, the column is a good candidate to have an index applied.
CREATE INDEX index_name ON table_name (column);
With this simple example is already possible to obtain significant gains in speed, but keep in mind that for this and other examples showed in this article there is a multitude of variations and fine-grained parameters that can be applied.
Drawbacks
It is important to mention that there is no silver bullet solution for optimization problems.
Creating more indexes in a table can speed up when reading data, but will slow down the writing of new rows to the table. Hence, it is important to understand what kind of operations will be performed.
Table Join Ordering
As many other topics in this article, deciding the order of join clauses is hard.
The general rule
It is best to join tables starting with the one which will produce the smallest amount of rows after the filtering is applied. More details for this technique can be found in the book SQL Tuning: Generating Optimal Execution
Plans.
Complexity
In simpler systems it is easier to keep track of the tables or leave to the RDBMS automatically optimize join clauses, but as the number of tables being joined grows, the quantity of potential query plans — a sequence of steps used to access data in a RDBMS — grows exponentially and in fact, finding the optimal order to join a set of tables is NP-hard.
There is an ongoing effort in database research field to find good heuristics when building a query plan, you can read more in-depth about what how to manipulate it in the Docs.
Remove Order By statements
When developing long and complex queries it is common to use ORDER BY
statements to visualize the return of intermediary query results. If that is the case, after the development phase is finished remember to remove all the ORDER BY
statements from sub-queries since it will only slow down your query execution time. In later sections we will discuss exceptional scenarios where this clause is useful.
String Standardization
String transformations are a common operation in databases specially when trying to compare/filter out some values in a query. Generally, this kind of operation is not costly, but it is not advisable to use them in large amount of values or in sub-queries that are called frequently.
This article won’t go deep into the importance of standards in software engineering, but a real example might be enough to convince you to adopt a standard when storing strings in your database: It was observed a speed up of around 6 times after removing the transformations regarding string conversion to lowercase.
Always adding UPPER()
or LOWER()
to convert text during a query will heavily impact performance. But how to apply this transformation after the project is deployed? How to change large amount of values that are already inserted in the database with minimal changes? The next topic shows an approach to handle this situation.
Anonymous Code Blocks
DEF: Anonymous blocks are PL/SQL blocks which do not have any names assigned to them. They need to be created and used in the same session therefore they will not be stored in the server as database objects. Since they do not need store in the database, they need no compilation steps. They are written and executed directly, and compilation and execution happen in the same process.
Sometimes there are unexpected outcomes that are not mapped during the planning phase of a project. In those situations anonymous code blocks can be helpful to apply data transformations without making big changes to the database. The following query is used to exemplify a real case:
DO $$
DECLARE ROW RECORD;
BEGIN
FOR ROW IN
SELECT
TABLE_SCHEMA,
TABLE_NAME,
COLUMN_NAME
FROM
INFORMATION_SCHEMA.COLUMNS
WHERE
TABLE_SCHEMA = 'public'
AND TABLE_NAME = ANY('{"table0", "table1", "table2"}')
AND DATA_TYPE = 'text'
AND COLUMN_NAME != 'sku' -- columns to ignore
LOOP
EXECUTE FORMAT('UPDATE %I.%I SET %I = LOWER(%I)',
ROW.TABLE_SCHEMA,ROW.TABLE_NAME,ROW.COLUMN_NAME,ROW.COLUMN_NAME);
END LOOP;
END
$$;
The above query is used to set columns of text
type from table0
, table1
, and table2
that are not sku
to lowercase.
Part of the refactor on string standardization mentioned in the previous topic came from applying this anonymous code block given that the requirement of storing lowercase strings in the database was not described during the planning phase of the project.
With Clauses
When does one actually need to use a `WITH` clause? Well, there are a few unique use cases. Most of them are geared towards ease of query development and maintenance. The main use cases for the `WITH` clause are twofold:
1. You want to use the result from a sub-query multiple times
WITH CTE AS
(SELECT FOO FROM BAR)
SELECT
*
FROM T1
JOIN T2 ON T1.ID = T2.ID
WHERE T1.COL1 IN (SELECT FOO FROM CTE)
AND T2.COL2 NOT IN (SELECT FOO FROM CTE)
2. You want to build cascading sub-queries
WITH CTE AS
(SELECT FOO FROM BAR)
, CTE2 AS
(SELECT
T2.*,
CTE.FOO
FROM CTE
JOIN T2 ON T2.ID = CTE.ID)
, CTE3 AS
(SELECT
T3.*,
CTE2.*
FROM CTE2
JOIN T3 ON T3.COL2 = CTE2.SOMETHING)
, ...
The second approach is tempting and might be useful for implementing complex business logic in pure SQL. But it can lead to a procedural mindset and lose the power of sets and joins.
The first approach can be used to both improve code readability, since the way to read it gets closer to a sequential manner as in Literate programming as well as code maintainability, given that the queries and the database tends to scale up with time and breaking large computations into smaller parts makes it easier to troubleshoot.
It is important to keep in mind that the result of a WITH
clause is not stored anywhere in the database schema. It only exists while the query is being executed.
Materialized Views
When you need an intermediate computation to be persistent, i.e. stored in the database, it might be interesting to use a materialized view. The main thing that sets a materialized view apart is that it is a copy of a query result that does not run in real-time.
The drawbacks are that it will use a little more storage space, since it creates a new table, and you have to make sure that the view has the most recent data. You can refresh it manually, or set it to refresh on schedule or by triggers in order to reduce the risk of viewing obsolete data.
Another real example is that using a materialized view combined with good indexes led to a speed up of around 10 times.
Window Functions
Similar to an aggregate function — a function that transforms a set of rows into a single row — , a window function operates on a set of rows. However, it does not reduce the amount of rows returned by the query.
The term window define the set of rows on which the window function is executed. This is useful especially when you have to apply calculations over data with intrinsic ordering like a time-series.
It also allows you to control how the window will be applied over the partition of rows selected. The figure bellow shows the possible values for delimiting a window frame:
Here is a simplification of a real use case where the goal is to calculate the total sales for each product cumulatively, suppose you have a table PAST_SALES
like the following:
+-----+---------+-------+
| SKU | MONTH | SALES |
+-----+---------+-------+
| A | 2021-10 | 100 |
| A | 2021-11 | 200 |
| A | 2021-12 | 300 |
| B | 2021-10 | 150 |
| B | 2021-11 | 250 |
| B | 2021-12 | 350 |
+-----+---------+-------+
Applying a query like this would return us the cumulative values of SALES
in the CUMULATIVE_SALES
column. Note the way that the rows are delimiting the window frame.
SELECT
SKU
MONTH
SUM(SALES) OVER (
PARTITION BY
SKU
ORDER BY
SKU,
MONTH ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS CUMULATIVE_SALES
FROM PAST_SALES
Query Result
+-----+---------+------------------+
| SKU | MONTH | CUMULATIVE_SALES |
+-----+---------+------------------+
| A | 2021-10 | 100 |
| A | 2021-11 | 300 |
| A | 2021-12 | 600 |
| B | 2021-10 | 150 |
| B | 2021-11 | 400 |
| B | 2021-12 | 750 |
+-----+---------+------------------+
With this we are able to solve complex business logic with SQL since you can apply many aggregate functions to a window and handle the ordering of the rows as you wish.
Conclusion
In this article, we’ve discussed some basic query optimization techniques in PostgreSQL. The most important thing is to learn the principles on how to use them and to grasp the nuances of working with main database objects.
At Datarisk, we have a dedicated development team keen on custom data processing solutions and technologies. We are ready to help you create a robust solution. You can check our other posts here , and also visit our site to see how we solve real world problems.
The examples in this article were written in PostgreSQL.