Querying Your (Data) Universe on Redshift
It surely has been a while since my last post. Fortunately, my current role also gave me the opportunity to work a cool, new tool, Amazon Redshift. At its core, it’s a column store, finding similar products in BigQuery, Druid, and Vertica. That said, it’s pricing model, platform support, and scalability make it a great addition to a startup. As there are plenty cover on Redshift, I will focus on query optimization in this post.
As a column store, values in the same column of a data table are kept adjacent to each other. As a result, or rather by design, it’s much faster to access a single column of values, for averages, adjustments, summation, etc. This behavior encourage it to be used for analytics applications. By the same token, it is also much faster to retrieve a few columns of all rows in your queries, comparing to retrieving a few rows of all columns.
Another key feature that makes Redshift a great analytics data store is how it handles its queries. Naively, Redshift can be considered as a cluster of EC2 instances, coupled together by a leader node, while all other instances are workers. The leader node compiles the queries into binaries and distribute them to the workers for much individual retrieval. Finally, the resulted set is aggregated and delivered by the leader node. My headcanon analogize this structure as a single layered neural network, though that may be an overkill. As we scale to more data and nodes, the distributed query will scale effectively as well.
Just as any distributed data store solutions, how the data is sharded greatly affects how Redshift performs. While users can define the distribution key while we CREATE TABLE, I would in fact recommend against it for general cases. We saw great query and loading improvements using the defaulted even distribution. This method utilizes all nodes during queries and minimize the chance of a node becoming bottleneck. There is an exception to this rule. For example, if a table is used as a reference and often fully loaded into memory, improvements can be gained by retrieving it all from a single node and skipping the aggregation step.
If you are sharper than me, you may realize a bottleneck within this design. There is a constant overhead in every query as the leader node creates the binary. This is an unavoidable behavior. To reduce the effects of this overhead, Redshift caches the Last Recently Used (LRU) queries. The cache is also variable agnostic, such that WHERE username = ‘Superman’ and WHERE username = ‘Batman’ both uses the same binary. To my disappointment, LRU sizing is controlled by an internal algorithm that is not released by the Amazon team. What we can confirm, regardless, is that the most often used queries will perform better than the less often used.
As our database clients grow, we found the need to reduce the impact of background operations on customer facing queries. Redshift provided a wonderful way to parallelize and scope queries using their Workload Management (WLM). It is a configuration that divides the memory resources into query groups as well as multiple queues within each group. This allows parallelization between and within queues depending on the query group you can assign at query time. It does take some optimization, however, as memory are distributed based on the WLM hierarchy. If you introduced too many groups and queues, it’s possible to use up the memory allocated, requiring Redshift to write temporary results to disk, greatly slowing down the query.
Despite the stumbles, I’m enjoying Redshift so far. As we use it to power our analytics product, I will share other tools and tips that I learn on the way. In the meantime, have fun playing it.
Originally published at www.whosbacon.com.