Redshift Code Compilation

Philip Ingram
2 min readJul 27, 2015

--

At Curriculet, we recently added a new product to our lineup providing teachers with greater insight into what their students are reading and how they are mastering different Common Core standards. We decided after much research that Redshift would allow us to aggregate large quantities of data on the fly, avoiding the need for a complicated caching system. Along the way, however, we ran into some weird performance issues due to the way Redshift compiles queries.

What it is

As documented by Amazon:

Amazon Redshift generates and compiles code for each query execution plan. The compiled code segments are stored in a least recently used (LRU) cache and shared across sessions in a cluster. Thus, subsequent executions of the same query, even in different sessions and often even with different query parameters, will run faster because they can skip the initial generation and compilation steps. The LRU cache persists through cluster reboots, but is wiped by maintenance upgrades.

How it performs

Turns out, it varies greatly. On our system and for our largest table, a query could take as long as 17.04 seconds or as quick as 109 milliseconds. What mattered more than the number of rows or columns being queried was whether Redshift had to compile the query before running it. For us, 17 seconds or 5 seconds or even 1 second was too long to keep our users waiting (remember that the time measured is the time for Redshift to execute the query and does not include the roundtrip to our server, parsing of the data, or the template compilation).

As Amazon documents, Redshift can reuse compiled code from one query to the next even with different parameters. So… what the hell does that mean?

I had to find out.

This is what one of our typical queries looks like:

These are the things that caused a re-compilation:

  • Adding /removing a column to the SELECT clause
  • Adding / removing a condition to the WHERE clause

… and the ones that didn’t:

  • Changing the metric_date in the WHERE clause
  • Changing the quantity of student_ids in the WHERE clause by less than 10%

The solution

In the end, what made the most sense was to cause the query to compile before the user made the query. In our case, there is one entry point for a teacher browsing to the Redshift data pages. When the teacher hits that entry page, we send a request to the service below, which calls the Redshift fetching service for each of the teacher’s classrooms. Of course, this solution isn’t perfect since the teacher could browse directly to the Redshift data pages without triggering the pre-compilation, but this method sufficiently solved our performance problem.

Please post any comments or suggestions below! You can also find out more about me at http://pingram.co/

--

--