Understanding Algorithmic Time Efficiency in SQL Queries

Luke SJ Howard
Learning Data
Published in
13 min readDec 13, 2023

--

By Luke SJ Howard (https://www.linkedin.com/in/lukesjhoward/)

In the dynamic realm of database management and query optimisation, the efficiency of SQL queries plays a pivotal role in determining the overall performance of applications. At the heart of this efficiency lies the concept of algorithmic complexity, a measure that evaluates how an algorithm’s execution time or space requirements scale with input size. Understanding these complexities is crucial for database administrators, developers, and anyone keen on unlocking the true potential of their data.

Top Secret Tip: When an interviewer asks you how you would improve the performance of a query, form your answer around the contents of this article, and you will blow them away.

It is well known that most data professionals are self-taught in their field. I am self-taught and have relied on courses and certifications to get me where I am today. However, what if we were not self-taught? What is it that separates those who studied computer science at University from those of us who are self-taught? Or what do those in the FAANG internships teach that propel interns into those six-figure salaries? The answer is their ability to communicate technical concepts to another individual without walking them through any datasets, codebase or algorithm.

In this article, I want to delve into the intricacies of algorithmic efficiency, with a particular focus on Big O notation, a concise language used within the software development field to express the upper bounds of an algorithm’s performance. By exploring the nuances of different time complexities, I aim to show why specific SQL queries are considered more optimal than others and how this understanding can impact the speed and scalability of database operations.

The Basics of Algorithmic Complexity

Big O notation provides a standard way to describe the upper bound of an algorithm’s growth rate, focusing on its worst-case scenario. Expressing efficiency in terms of Big O allows us to abstract away the fine details, focusing solely on the algorithm’s scalability concerning input size. Common notations include O(1), O(log n), O(n), O(n log n), and more, each representing different growth rates.

CREDIT: frontendly.io

Time Complexity vs. Space Complexity

In the SQL universe, two fundamental aspects of algorithmic complexity are time complexity and space complexity. Time complexity gauges the relationship between the input size and the time it takes for an algorithm to complete, while space complexity focuses on the algorithm’s memory or storage requirements. Both considerations are equally important when crafting SQL queries for optimal performance; however, this article will focus primarily on time complexity.

Modern data warehouse platforms like Snowflake, Redshift and Big Query are ushering in a new approach to data management and queries. Suddenly, data storage is cheap, while computing power is noticeably more expensive. Therefore, optimising one’s queries is now essential to a data professional’s role in enterprise-level organisations to reduce the costs of running a data estate.

The duration of time that a query takes to execute is directly proportional to the financial cost of the query.

As I delve into the intricacies of SQL query efficiency, keep in mind that the goal is to strike a balance between minimising execution time and conserving compute resources, ultimately delivering a responsive and scalable database system.

Constant Time Complexity: O(1)

Ok, let’s start with the ‘Holy Grail’ — the Grand Architect of the Universe’s gift to us humble SQL developers. In the context of an SQL statement, let’s consider a scenario where you have a database table and want to retrieve a specific record based on its primary key. The primary key uniquely identifies each record in the table.

CREDIT: frontendly.io

If you use a SQL statement like:

This query, designed to fetch records with a specific primary key value, exemplifies O(1) time complexity. Here’s why:

Direct Index Access: When a table has a primary key, the database typically creates an index on that column. This index allows for rapid, direct access to the record associated with the specified primary key. The database engine doesn’t need to scan the entire table; it can pinpoint the exact location of the desired record.

Constant Execution Time: Regardless of how many records populate the table, the execution time remains constant. Whether the table has 100 or 100 million records, the database engine can swiftly locate the requested record using the primary key index, resulting in a consistently quick response.

Therefore, this query type has O(1) constant time complexity. Regardless of how many records are in the table, the execution time remains constant because the database engine can directly locate the record using the indexed primary key.

The time the database takes to retrieve any record doesn’t increase proportionately to the table size.

In this case, having O(1) time complexity is optimal because the query’s efficiency is not dependent on the number of records in the table. It’s a constant time operation, making it highly efficient, especially when fast and predictable query performance is essential.

Logarithmic Time Complexity: O(log n)

The significance of indexes cannot be overstated. When a search operation involves indexed columns, the time complexity takes the form of O(log n), where ’n’ represents the size of the dataset. This logarithmic time complexity indicates a remarkable efficiency, especially in comparison to linear searches.

CREDIT: frontendly.io

Consider the following SQL query as an example:

The database engine efficiently locates the desired record by employing an index on the relevant column ( in our case, it is ‘indexed_column`), achieving logarithmic time complexity. This approach is particularly advantageous for scenarios where rapid and precise data retrieval is essential.

Swift Data Retrieval: Leveraging indexes intelligently allows the database engine to navigate the data precisely. Rather than scanning the entire dataset sequentially (row by row until it reaches the end of the table), it can strategically ‘divide and conquer’ through the indexed structure. This intelligent search methodology enables swift retrieval of specific data, making O(log n) time complexity a highly favourable choice for particular queries.

Logarithmic Growth: The logarithmic time complexity suggests that the search time increases at a slower logarithmic rate as the dataset grows. In practical terms, as the volume of data expands, the additional time required for a search operation is significantly less than one would experience with a linear search (sneaky handover to ‘O(n)’ queries).

Linear Time Complexity: O(n)

The concept of linear time complexity denoted as O(n), introduces us to a straightforward yet critical aspect of algorithmic efficiency. Unlike the logarithmic growth seen in O(log n) searches with indexes, O(n) signifies that the execution time scales linearly with the size of the dataset. Let me show you a scenario.

CREDIT: frontendly.io

Take a look at this simple query:

Without the aid of an index, the database engine resorts to a linear scan, sequentially examining each record in `non_indexed_column` to find the specified value. While suitable for smaller datasets, this approach becomes progressively less efficient as the dataset grows in size.

Linear Growth: With O(n) time complexity, the execution time of a query increases in direct proportion to the size of the dataset. Each additional data point contributes incrementally to the overall time required to complete the operation. This linear relationship is a fundamental characteristic of queries that involve full table scans or operations without the benefit of indexes.

Practicality for Smaller Datasets: Linear scans are often practical and efficient when dealing with smaller datasets. In these cases, the incremental increase in execution time is manageable, and the simplicity of the linear relationship ensures straightforward query processing.

Inefficiency with Large Datasets: However, as the volume of data grows, the efficiency of linear scans diminishes. The time required to examine each record in the dataset sequentially becomes a limiting factor, making these queries inefficient for larger datasets.

Understanding the nuances of O(n) time complexity is crucial for making informed decisions about a query plan, especially when dealing with datasets of varying scales and future-proofing the scalability of your applications’ back end.

Linearithmic Time Complexity: O(n log n)

As I take us further down the rabbit hole of time complexities in SQL queries, the concept of O(n log n) introduces us to an intriguing balance between logarithmic and linear efficiency. This time complexity frequently emerges in sorting and searching algorithms, showcasing an intermediate level of efficiency that combines elements of both logarithmic and linear scales.

CREDIT: frontendly.io

Here is another SQL query I would like to share:

Sorting operations often involve O(n log n) time complexity, where the dataset is divided, sorted, and merged in a manner that balances the efficiency of logarithmic searches and the practicality of linear processing.

Combining Logarithmic and Linear Scales: O(n log n) time complexity arises when an algorithm strategically combines logarithmic and linear operations. In the context of SQL queries, this often occurs in sorting and searching scenarios, where the efficiency of logarithmic searches is harmoniously merged with the linear aspects of data processing.

Sorting Algorithms: Sorting a dataset is a common scenario where O(n log n) complexity is prevalent. Algorithms like Order By exhibit this time complexity as they divide the dataset into smaller parts, sort them individually, and then merge them back together.

Moderate Level of Efficiency: O(n log n) strikes a balance between the efficiency of O(log n) and the practicality of O(n). While not as swift as O(log n) in certain operations, it remains more efficient than pure linear scans, especially for larger datasets. This intermediate level of efficiency makes O(n log n) a favourable choice for diverse sorting and searching tasks.

Polynomial Time Complexity: O(n²), O(n³), …

As I take you ever closer to the Area-51 equivalent of algorithmic complexities, the emergence of polynomial time complexities introduces a complex dimension to our understanding.

A polynomial is an expression composed of variables, constants, and exponents that are combined using mathematical operations such as addition, subtraction, multiplication, and division.

Polynomial notations, represented as O(n²), O(n³), and beyond, signify a substantial increase in execution time as the dataset grows.

CREDIT: frontendly.io
CREDIT: frontendly.io

Exponential Growth with Power Terms: The notation O(n²) implies quadratic complexity, O(n³) cubic, etc. The exponent represents the degree of the polynomial and indicates the relationship between the input size (’n’) and the growth rate of execution time. As the exponent increases, the efficiency diminishes exponentially.

Nested Loops and Multiple Joins: When operations involve nested loops or multiple joins, polynomial time complexities often arise in SQL queries. These scenarios lead to a compounding effect on execution time, especially as the complexity of nested operations increases. Each join introduces an additional level of nesting, and the impact on complexity depends on how the joins are structured and the conditions under which they are applied.

JOINS are one of the most compute-hungry functions and will likely be your most frequent culprit for high compute time.

Diminished Efficiency, Especially for Large Datasets: Unlike logarithmic or linear growth, polynomial growth is less sustainable, particularly for large datasets. While certain operations may naturally result in polynomial complexities, the associated inefficiency becomes more pronounced as the dataset expands.

Let me give you another example:

As the example shows, queries with nested loop operations can exhibit quadratic or higher polynomial complexities. The compounding effect of nested operations contributes to a substantial increase in execution time, especially as the dataset grows.

Warranting Careful Query Optimization

The presence of polynomial time complexities underscores the importance of meticulous query optimisation. You must carefully evaluate and optimise queries to scale efficiently with increasing dataset sizes. Techniques such as index optimisation, query refactorisation, and a well-thought-out database design become crucial in mitigating the impact of polynomial complexities on overall performance. The consequences can be increased compute time due to underperforming queries, leading to increased costs.

Exponential Time Complexity: O(2^n)

As we pass through the dark and evil parts of the SQL universe, we encounter the nightmare realm of exponential time complexity, denoted as O(2^n). This horrifying form of query time complexity represents a significant departure from polynomial, logarithmic, or linear growth, showcasing a rapid and often impractical increase in execution time as the dataset size grows.

CREDIT: frontendly.io

Doubling Impact: In O(2^n) complexity, the execution time doubles with each additional element in the dataset. This results in an exponential increase in the time required for query execution, making it particularly challenging for large datasets.

Common in Recursive Operations: Exponential complexity is often associated with recursive algorithms where the operation repeatedly branches into two or more subproblems. Recursive SQL queries or operations involving a recursive structure can lead to exponential time complexities.

Practical Challenges for Large Datasets: While exponential time complexities might be manageable for small datasets, they pose significant challenges for larger datasets. The rapid growth in execution time makes queries with O(2^n) complexity impractical in scenarios where efficiency and scalability are paramount.

Here, I have written a pretty simple SQL query as an example, showing exponential complexity:

Given that a lot of people might get confused by this query, I’ve run it inside my database to show you what the output would look like:

As you can see, the recursive statement loops through and counts from 1 to 10

The query uses a recursive common table expression (CTE) to generate a sequence of numbers. The recursive structure leads to an exponential increase in the number of rows.

Given the challenges associated with exponential time complexity, you should cautiously approach queries with O(2^n) complexity. In some cases, alternative algorithms or optimisations should be considered to reduce the impact on performance. Recursive operations should be carefully evaluated and avoided where possible, and the query structure should be optimised to minimise exponential growth if you really must write queries like this.

Factorial Time Complexity: O(n!)

If you have ever wondered what the deepest, darkest fear of a SQL Administrator or SQL Developer looks like, this is it. Factorial time complexity, denoted as O(n!), introduces us to a computational challenge characterised by an astronomical increase in execution time as the dataset size ’n’ grows. This complexity is represented by the factorial function, which involves multiplying all positive integers up to a given value. For example, O(5!) translates to 5 × 4 × 3 × 2 × 1.

CREDIT: frontendly.io

So, what does that even look like as a query? In fairness, the query does not look too scary:

The query involves what is called Cartesian products or permutations of four tables. (I have put in ‘tableN’ to denote that this list can continue forever depending on how many ’N’ tables you have). The factorial function in this query multiplies the execution time by 4 × 3 × 2 × 1, resulting in a factorial time complexity.

Explosive Growth: The factorial function results in explosive growth as ’n’ increases. Each additional element in the dataset multiplies the execution time by the entire product of preceding positive integers, leading to an impractical and often unmanageable rise in complexity.

Common in Permutation Problems: Factorial time complexity is frequently encountered in algorithms dealing with permutations, where all possible arrangements of a set are considered. In SQL, operations involving Cartesian products or permutations of multiple tables may exhibit factorial time complexities.

Impracticality for Large Datasets: The sheer magnitude of growth makes queries with O(n!) complexity impractical for large datasets. The computations required become astronomical, rendering such queries unfeasible and expensive for real-world scenarios.

Mitigating Factorial Complexity

Mitigating the impact of factorial time complexity involves careful query design and optimisation. You should make every effort to explore alternative algorithms, indexing strategies, and result-set limitations to address the exponential growth in execution time associated with factorial complexities.

My advice is never to perform this type of query in a production environment.

Let’s Summarise

An optimal Big O notation refers to the most favourable or efficient time or space complexity an algorithm can achieve in the worst-case scenario. In Big O notation, smaller complexities are generally considered more optimal because they indicate better performance.

For time complexity, the order of optimality is usually in the following hierarchy, from most optimal to least optimal:

1. O(1) — Constant time complexity

2. O(log n) — Logarithmic time complexity

3. O(n) — Linear time complexity

4. O(n log n) — Linearithmic time complexity

5. O(n²), O(n³), … — Polynomial time complexity

6. O(2^n) — Exponential time complexity

7. O(n!) — Factorial Complexity

An algorithm with O(1) time complexity is considered optimal for many scenarios because it indicates that the execution time does not depend on the input size.

For space complexity, similar optimality considerations apply, with O(1) being the most optimal, followed by O(log n), O(n), and so on.

It’s important to note that the definition of “optimal” can vary depending on the specific requirements and constraints of a problem. The optimal Big O notation for a given situation is often influenced by factors such as available resources, the nature of the problem, and the desired trade-offs between time and space efficiency.

The best query is never the one that performs the fastest but rather the one that scales the best as its input size increases.

The contents of external submissions are not necessarily reflective of the opinions or work of Maven Analytics or any of its team members.

We believe in fostering lifelong learning and our intent is to provide a platform for the data community to share their work and seek feedback from the Maven Analytics data fam.

Happy learning!

-Team Maven

--

--

Luke SJ Howard
Learning Data

Helping Data Analysts Become Data Engineers | Business Intelligence | SQL + PowerBI |