Building A Scalable Interactive Analytics Backend

By Anurag Nilesh, Principal Engineer at Eightfold.ai

Published in

Engineering at Eightfold.ai

6 min readMay 6, 2023

According to a study by Gartner, diverse organizations perform 12% better than non-diverse ones, with more innovation and better financial returns. Eightfold.ai offers a Talent Diversity solution to our customers to track and analyze their diversity goals and check for any existing bias in the hiring process across different steps like recruiter screening, hiring manager screen, onsite etc. We built an interactive analytics dashboard which allows the users to slice and dice the hiring process data by gender, ethnicity, candidate quality etc. for a department or business unit within subsecond.

The requirements we had for choosing backend infrastructure for this dashboard were:

A scalable system which will be able to handle petabytes of data
Most of the queries should finish within subsecond
Efficient to do aggregation on a large amount of data

With these requirements in mind, the data load pattern and the query patterns for the dashboard, we looked at different options like MySQL, PostgreSQL and AWS Redshift. We chose AWS Redshift as it is a scalable and high-performance solution because it has Massively Parallel Processing(MPP) architecture and columnar storage. Redshift distributes the data to multiple nodes and when fetching data, it can run tasks in parallel to speed up the queries.

A column-oriented DBMS stores data table by column rather than rows. Columnar databases are well suited for OLAP where you need to fetch only certain columns at a time and not the entire row. The predicate filtering and joins can be done much faster without needing to load the entire row in memory. The other benefit is that efficient compression is applied to storing the column data as they are of uniform type.

While working with Redshift, we learned a few tips and tricks on how to make sure that we deliver on our product goals.

Tip #1: Use sortkey for faster range filtering and joins.

While creating a table a sortkey can be specified, which can be a combination of multiple columns. When the data is initially loaded into an empty table, the rows are stored in disk based on the sorted order of columns specified in the sort key. This helps in speeding up queries where you would be doing range filtering or equality filtering on a column which is part of sortkey as it would efficiently skip over large amounts of data while doing table scans. The other benefit is in while doing joins with other tables, if the sortkey column is one of the join columns too then Redshift would do a sort-merge join instead of a slower hash join.

Tip #2: Use distkey to make use of parallel computing feature of Redshift.

While creating a table a distkey can be specified, which can be a single column only. The data of a table is distributed among all the nodes in the Redshift cluster based on the value of distkey. Choose a column with high cardinality. Avoid using a column like customer_id as distkey as that would mean that all the data of a single customer would be a colocated on a single node and hence Redshift won’t be able to make use of parallel computation.

If it is a small dimension table which needs to be joined with other fact tables often, set the dist style to ALL to make sure a copy of the entire table is distributed to all nodes to avoid shuffling data across nodes and hence, reduce the processing time.

Tip #3 Consider adding dynamic case clause as an additional column

Previously, in one of our queries, we used a case clause to aggregate data dynamically based on the customer configuration, which resulted in different query plans being executed across customers although the query structure was same. Now, we optimized to pre-compute the value for that case clause and add that as a new column in the table. This helped in getting the same query plan to execute across different customers and hence cutting down the compilation time in query execution.

Example:

Tip #4: Consider using a SQL view to optimize the compilation time of the query.

If there are multiple queries which involve joining the same tables, consider creating a view of those tables and use that view in the queries instead to optimize the compilation time of the queries.

Tip #5: Inspect the query plan to design faster queries.

Look at the results of ‘explain <query>’ to see how the tables are getting joined and if there is a better sortkey/distkey that can be done to replace hash joins with merge joins and avoid sequential scan on tables.

For example, see the difference in query plan below. table1_optimized has specified sortkey as (join_col1, join_col2) whereas table1 had specified sortkey only as join_col1.

Tip #6: Debug the query execution using system tables like svl_compile.

We were seeing that some queries taking multiple seconds although the execution time was still within a second.

Timeline and execution time graph of the query seen in AWS Redshift console

While debugging this issue, we looked at how the query planner works for Redshift and translates the query plan into steps, segments and streams. Redshift maintains a compile query cache so that it can save on this compilation time and recompiles only the segments which have changed compared to a previous similar compiled query.

You can query the Redshift system table to find out if a particular query was recompiled or not. If the compile column has value 1 for a segment, then it means that the query segment was compiled.

select * from svl_compile where query=’2963480';

Tip #7 Reclaim space with VACUUM command

While doing data load, we store the data into a temporary table and then start a transaction in which we empty the table and copy the data from the temporary table. After the end of the transaction, we run the VACUUMcommand on the table. This command resorts rows and reclaims space due to deleted rows. This helps in ensuring that all the queries after the data load can still run efficiently.

Tip #8 Optimize the query plan by running ANALYZE command.

Redshift runs an ANALYZE command on the tables automatically to generate statistical metadata that is used by the query planner to generate optimal plans. It is recommended to run this command routinely, so you can run it at the end of data load like VACUUM.

We hope that these tips will help you in making the most out of the Amazon Redshift for your analytics use cases. You can find more documentation on the best practices of using Redshift and we would recommend going through them. Let us know if these tips helped you or if you have other tips that you would like to share with us :)

About Eightfold.ai

Eightfold.ai delivers the Talent Intelligence Platform™, the most effective way for companies to identify promising candidates, meet diversity hiring goals, retain top performers, and serve their recruitment needs at lower total cost. Eightfold’s patented Artificial Intelligence-based platform enables more than 100 enterprises to turn talent management into a competitive advantage. Built by top engineers out of Facebook, Google and other leading technology companies, Eightfold is based in Mountain View, California.

Building A Scalable Interactive Analytics Backend

By Anurag Nilesh, Principal Engineer at Eightfold.ai

About Eightfold.ai

Written by Manav Mehra