SQL: From Fear To Love. A Data Scientist’s Journey

Vachan Anand
DataJoey
Published in
7 min readNov 15, 2022

As a fresh graduate planning to have a career in data science, I thought being proficient in python was the core programming skill I would require to launch a successful career. I preferred something other than SQL and always liked manipulating data in python.

However, my ways of working changed as soon as I got some industry experience. I realised why SQL was still one of the widely used languages and, more importantly, where it fits in as a data scientist.

Photo by Ashkan Forouzani on Unsplash

In this blog, we will go through the importance of SQL in data science and look into a strategy to make efficient use of it.

SQL for Data Science

One of the key learnings in my career has been to separate myself away from technology (or programming languages) and focus more on solving problems effectively. As a data scientist, my work usually involves exploring datasets and building machine learning models. Although python works well for developing the machine learning model, SQL has a use case in exploration and feature engineering, a crucial step before building any model for the following reasons.

  • Firstly, manipulating data in jupyter notebooks using python can be very slow for large datasets with terabytes worth of data.
  • We could speed up data manipulation using python with technologies like Spark on cloud platforms. However, using these technologies comes at a price. They are great if the organisation has an appetite and, more importantly, a willingness to use such advanced technology; otherwise, the tools require setting up and managing infrastructure to support the project, which incurs additional costs.
  • On the contrary, most organisations already have a database/warehouse holding datasets. Traditionally the computing of such systems is correlated to the size of the datasets, i.e., on-premise databases/warehouses systems with large datasets would have the capacity to process the data and hence have significant computational power.
  • Currently, if we speak of cloud computing, databases/warehouses on most platforms offer isolation of storage from the computation. Therefore manipulation of data can be done on these systems with minimal increases in costs.

Now that we understand where SQL fits in a data science project let us look at a strategy advantageous to write complex queries to create essential features.

Divide and Conquer

As a student, my biggest misconception was using SQL only as a bunch of select and join statements. I always leaned onto programming languages like python, C++ or even java to manipulate the data as desired. However, working on real-world problems, I realised the importance of SQL in solving complex issues, and one of the strategies that became handy was to divide and conquer.

Photo by Anthony Intraversato on Unsplash

There are a couple of ways we can employ the strategy to our advantage. In this blog, we are going to look at two of such practices as follows:

  • Common Table Expressions (CTE)
  • Temporary Tables

These objects are beneficial in a couple of ways, as described below :

  1. Rather than solving the entire problem all at once, which could be difficult, we break a complex query into small chunks that are easy to solve. It can be easily achieved in SQL using CTEs and temporary tables.
  2. As a data scientist, the purpose of the role is not just to perform experiments and test hypotheses but also to have the ability to productionise the machine learning model. It means we should be able to develop and maintain the code with ease. Maintaining the scripts can be easier if the code is reader-friendly. As with CTEs & temporary tables, we break the complex query into small more reader-friendly queries; the scripts are easy to maintain and, more importantly, collaborate and hand over to other data scientists.

Now that we understand the strategy’s importance, let us look at each object in greater detail.

Common Table Expressions (CTE)

A common table expression is one of the handiest SQL objects I like to use. A CTE is a temporary result set that we can create in SQL for ease of use. It is conducive to solving problems as follows :
For instance, below, we can see a sample CTE for one of the HackerRank challenges.

  • Write a query to print the hacker_id, name, and the total number of challenges created by each student. Sort your results by the total number of challenges in descending order
  • If more than one student created the same number of challenges, then sort the result by hacker_id
  • If more than one student created the same number of challenges and the count is less than the maximum number of challenges created, then exclude those students from the result.

In the SQL above, we create 2 CTEs as follows

  1. Firstly, we use STUDENT_CHALLENGE to get the number of challenges per hacker.
  2. Then we create CHALLENGE_COUNT_ONE that uses the preceding CTE to get counts of challenges as we need to remove students with the same number of challenges. We use the result in the filter condition towards the end of the query.

As we can see, we can use CTEs to build complex queries one step at a time. It helps to make the code modular and reader-friendly.

Before we can build our CTEs, we must understand the rules for defining them. Here is the list of rules for common table expressions on the SQL server.

Temporary Tables

Another way to solve the complex problem with the divide and conquer approach is by using temporary tables. As the name suggests, temporary tables are physical tables created in the database but with a limited lifecycle. For the HackerRank problem above, we can solve the same problem as follows:

We can observe from the SQL above that the script now gets split into three small queries, with each component easy to understand.

One important thing to know while using the temporary tables is their validity. The visibility of the temporary tables depends on their kind as follows:

  • A temporary table created within a stored procedure is visible only as long as the stored procedure is running.
  • Local temporary tables are visible in the current session and are dropped automatically at the end.
  • On the other hand, global temporary tables are visible to all active sessions. They are automatically dropped when the session that created it ends, and all the other sessions previously using the temporary table stop using the tables.

To know more about the temporary tables, check the following document.

CTE vs Temporary Table

Now that we have looked into CTEs and Temporary Tables let us discuss when to use one over the other. The rule of thumb I use while deciding the kind of object is as follows:

Case 1: Usually speaking, Common Table Expressions perform better than Temporary Tables. For instance, for the above examples, the average runtime for queries is 158 ms for CTE versus 173.2 ms using the temporary table.

  • We evaluated CTE and Temporary Tables on the same dataset to get a fair performance estimate.
  • Additionally, we average the run time for ten runs to remove the noise.
  • Finally, we clear the cache after each run to get an accurate performance measure.
  • It is also worth noting from the queries above that where CTE requires Read IOPS, temporary tables being physical tables stored in the database, they need both Read/Write IOPS.

Case 2: Although CTEs tend to perform better at a basic level, temporary tables have advantages too. Usually, the intermediate result sets could be significant with complex queries on large datasets. Moreover, several of these intermediary set often needs to be joined with each other to get the desired results. With these complex joins, partitioning the dataset becomes crucial to optimise performance.

Since CTEs are not physical tables, they can not be partitioned. Therefore, in such cases, temporary tables tend to perform better owing to partitioning.

Case 3: While building a feature set for a machine learning problem or pipelines for a data engineering project, more often than not, we require to repeat a portion of queries several times. In such cases, it becomes optimal to get the results of a base query once and use the result in conjunction with other data rather than running the same query several times.

Since the scope of the CTE is within the single execution of a SELECT, INSERT, UPDATE, DELETE, or MERGE statement, queries need to be written and executed again if required by a different feature or part of the data pipeline. However, with temporary tables, the scope is with the current session and can be extended across the entire database based on the kind of temporary table.

Conclusion

Generally speaking, developers/data scientists can write CTEs and temporary tables together for optimal performance. Both objects have their advantages and become powerful when used concurrently. Additionally, both constructs are maintainable and make queries reader-friendly.

In this blog, we looked into using CTE and temporary tables to split the problem into sub-problems and solve them individually. In the next blog, we will look into ways to optimise query performance using techniques like partitioning and indexing.

--

--

Vachan Anand
DataJoey

A consultant with an interest in Data Science, Data Engineering and Cloud Technology.