Best Practices for Gnarly SQL Queries

RUN
6 min readSep 8, 2016

by Michael Discenza, Senior Data Science Manager

At RUN, many of the batch data processing jobs that we run and many data analysis projects require us to write complex SQL. We have a lot of normalized tables, which decrease the amount of storage space we need for our data, but also increase the complexity of queries we end up writing.

For example, a table of “events” collected from our pixel servers might have both ad impression events and ad click events and another table might have information about the campaigns which those impressions were a part of. To get any information about the click through rate of impressions with certain characteristics, we would have to 1) self-join the events table and then 2) join those impressions to the campaign table that has information about the settings used to serve those impressions, and finally 3) aggregate the clicks and impressions matched to each other along the variables of interest.

These joins have been the base for all of our data preparation for building models to optimize for click through rate, among other KPIs. If we want to incorporate additional information into the analysis about the users, user segments, time of day, etc. we would need to do additional joins with other dimension tables, aggregations followed by self joins, and more aggregations. As you might imagine, layering on all of these additional operations can quickly lead to some pretty lengthy (and gnarly) queries. Though this isn’t necessarily something we’re happy about, we’ve got a bunch of 100+ line queries in our code base.

Long queries take a lot of time to write and a lot of time to debug. For batch processing jobs that do aggregations for surfacing in a UI or for data prep for automated model fitting, writing the queries is a healthy, if time consuming investment. When we’re writing queries for doing data analysis purposes, getting to a working query that can surface the insights that we want can often suck up an entire morning or afternoon of coding time.

In our effort to minimize the time needed to write and debug these complex queries, we’ve compiled a series of best practices. Most of our SQL queries run in Hive or Amazon’s Redshift managed data warehouse, so we’ll stick to that syntax for our examples, but these principles can be applicable to whatever flavor of SQL you’re using.

1. Comments are your friend- even though SQL is meant to be readable, it is verbose, leaving comments to describe what you’re doing and how that relates to the end goal of query can really help you out in the end. Naming a .SQL file something descriptive, yet concise often can help others know you and others know what a query does even without having to read through and understand every operation.

2. Develop your queries on smaller chunks of data — Toy examples are your friend. One of the most annoying things about doing analysis and work with “big data” is waiting to get results. You can test your queries on small subsets of data rather than using cluster resources and wasting time waiting for queries to run across entire datasets.

3. Do sanity checks. It’s great when you get a 200 line SQL query to finally work, but even then you’re not out of the thick — you want to make sure that you’re getting the result that you expect. One of the best ways to check that queries with complicated joins and group-bys make sense is by figuring out for a dummy set of data, how many entries or rows of data you might expect in a results table. This will help you make sure that you’re doing the right kinds of joins. You can create a dummy table by inserting a CSV of some fake data or modify some of the real data so you can anticipate the results of joins or group-bys.

4. Yes, you can do test-driven SQL development — Especially if your queries are part of application code where there is a test framework in place, you can and should assert on the results of the queries run on your toy datasets from the “sanity checks” we talked about in the previous tip. This is not a bad idea if you’re developing mission critical functions that might need to be modified in the future.

5. Use temporary tables — Rather than trying to get all of your queries correct and debugging them as sub queries, work incrementally. This way you will have an easier time knowing where you introduced and error. You can also examine reach result set incrementally and determine if the results are as expected — say the number of rows given and input, and you can track these test cases passes with comments. The more transformations and aggregations that you layer on the data, the harder it is to anticipate the expected response and validate that it is correct.

6. Use common table expressions only later in the game — It’s better to develop and debug with temporary tables, but if you want to combine all of your operations into one single query you can combine these statements into common tables after. I actually often keep the sets of multiple queries. An example here is refactoring this these multiple queries with temporary tables into this common table query

7. Adhere to indentation convention — Capitalizing keywords and indenting properly, though not strictly speaking necessary for your queries to run, can save help make spotting errors easier and save time in the long run. With your team agree on what to do with parenthesis opening them on the previous line, closing on a new line, etc. — so that you have visual consistency.

8. Name your temporary tables well — It’s always good to name your descriptively — albeit without getting too long (like Java class names). You can actually quite literally describe the tables by the actions that were used to get to it. For example, I have a These tables will eventually be able to be refactored into common table expressions so, “WITH table_a AS( <query goes here>)” and table_a can be replaced with the meaningful name of your temp table query that can be accessed later in the query.

9. Use views- In both Redshift and Postgres, views abstractions of queries that are run every time the view is referenced in a query. This is a good way to encapsulate instruction-heavy, but compute-light operations/pre-processing. This breaks up complicated queries and makes them easier to read.

10. Rely on your scripting language or workflow manager to break long queries into different queries. Views are helpful organizational devices, but if intermediate parts of compute-heave queries are used for other operations, splitting long queries into intermediate resulting tables can be useful. In our workflow manager, Luigi, we might make the creation of a source table for two different queries a separate task so both of the queries can be set to depend on it in the dependency graph. You might have redundant copies of the data in slightly different forms, but this intermediate data can be removed once it gets to be a certain age, and additional data storage is a small price to pay for increased development speed and avoiding duplicate queries in your codebase.

Michael Discenza — Linkedin

--

--

RUN

RUN is the leader in mobile-focused programmatic advertising technology.