Optimizing the performance and the scalability of embedded dashboards

Migrating our data warehouse to Snowflake

Yuting Huang
Shippeo Tech Blog
8 min readJun 13, 2022

--

Photo by Isaac Smith on Unsplash

As the data team from a SaaS company offering advanced analytics on transportation and real-time visibility, we commit to constantly improving the performance of the dashboards embedded on our platform. In this article, we discuss the challenges faced, why we decided to migrate our data warehouse from PostgreSQL to Snowflake, the insightful conclusions we have drawn on data modeling and configuration on Snowflake, and what the final achievements were.

1. Our business context and analytics architecture

In the Data-as-a-Service team, we build multiple data products, among which, are embedded dashboards for our end users to use. The challenge is to offer various analytics products to monitor their supply chain from numerous perspectives (including operational performance, tracking compliance and data quality, and feature usage and value) with a high data refresh rate relying on a single data source: our data warehouse.

Here is an overview of our embedded analytics architecture:

Overview of Shippeo’s embedded Analytics Architecture

In a nutshell, we load the data from the operational database to the data warehouse (previously PostgreSQL, now Snowflake) through the Debezium and Kafka connection (you can see more info on the data loading process in this article from our data engineer). Within the data warehouse, the preparation of data marts is completed with the help of the dbt (data transformation tool). Then, we connect Tableau to our data warehouse and build all the dashboards on Tableau. Finally, we embed those dashboards on our platform, where customers can access them.

2. What are the pain points and where are they coming from?

There are three main pain points that we pay the most attention to in order to offer the best data products to our customers.

❗️Dashboard performance with higher demand for complex analysis

With the continuous deployment of new features and transportation flows on our platform, coupled with the increasing maturity of our customers, the demand for delivering more complex analytics is growing. Inevitably, more complex analyses mean more elaborate modeling of the data warehouse to ensure dashboard responsiveness.

❗️Scalability with fast-growing data volume

The business expansion keeps bringing more and more customers, so the number of users and the volumes of ingested and generated data are increasing accordingly. We need to make sure that our analytics architecture will be able to scale seamlessly.

❗️Data operations optimization

With the growing number of external and internal users, data operations have become crucial to avoid the slowness or unavailability of dashboards. Therefore, we need a tool to simplify the data warehouse management process by allocating proper resources for each kind of usage. This ensures the relevant computation capacity enabling multiple simultaneous dashboard accesses and data loading in parallel.

❓Where are the pain points coming from?

After having performed all the possible optimizations for Tableau dashboards, and for Tableau server settings (to process queries more quickly, and with more in parallel), the improvements were hardly noticeable, confirming the fact that the bottleneck was indeed caused by the performance limitation of PostgreSQL. Of course, increasing PostgreSQL CPU resources indefinitely couldn’t be the long-term solution, and limiting the data horizon could not be considered either.

PostgreSQL was originally built as an OLTP database, meaning it’s optimized to do operational work, not complex analytics, which is the case for our dashboards. Therefore, migration from an OLTP database to an OLAP database dedicated to analytics queries seemed to be the best solution.

3. Migration from PostgreSQL to Snowflake ❄️

Snowflake is built as an OLAP database and with a fully managed cloud service, it can scale up resources automatically when needed and comes with a powerful engine to optimize query performance.

So how did we come up with an optimized data infrastructure on Snowflake?

The art of balance of cost and performance

1. Why is denormalizing not always the best solution on Snowflake?

Snowflake is an OLAP system and the data is mostly used for analysis on the dashboard in our case, so we assumed that a denormalized model, namely a flat table containing all columns and lines, would be faster and more efficient compared to a dimensional model also known as star-snowflake schemas.

The dimensional models reduce redundancy and eliminate inconsistent dependency by dividing the data into multiple tables. Conversely, the denormalized models produce redundancy by combining the data into a single table to make data retrieval faster. It’s commonly used in OLAP systems as it performs better by avoiding complex joins.

The results were as interesting as expected, with a dashboard loading time around 10% faster for the denormalized model in Shippeo’s use case. However, the denormalized model takes almost 1500% more time than the dimensional model when generating or updating the data marts in Snowflake.

Considering the high update frequency of our dashboard service, a better balance between performance and cost was the main principle to apply. Actually, joins are not as expensive as we imagined on Snowflake. Instead, table scanning consumes the most resources (when no clustering is used), whilst clustering smaller fact tables and then joining them to dimension tables is actually less costly than clustering a fully denormalized table. Therefore, we decided, in the end, to go with the dimensional model, which shows a better balance between performance and cost in our case.

2. What to cluster and why?

To be more specific in the costs generated by the denormalized table, we want to emphasize that it is the clustering that generates the high cost.

According to the documentation of Snowflake:

A clustering key is a subset of columns in a table (or expressions on a table) that are explicitly designated to co-locate the data in the table in the same micro-partitions.

Source: https://docs.snowflake.com/en/user-guide/tables-clustering-keys.html

For example, if we create a cluster on a date-based field, it’s going to be more efficient when running time-series analysis over a specific horizon. However, although clustering can substantially improve the performance and reduce the cost of some queries, the compute resources used to perform clustering consume credits and it takes more time to generate the table with the clustering.

Besides, the micro-partitions have a minimum size that prevents making clusters with lower granularity keys useless. For instance, when there is already a clustering on one date-based field, if we compute a clustering based on date plus any other string field, the number of micro-partitions may not change, which means adding the clustering on a second field doesn’t really help. Therefore, we need to limit the number of columns to be clustered and ensure a good query performance at the same time.

3. How to optimize warehouse size and warehouse clusters?

Snowflake warehouse provides the required resources, such as CPU, memory, and temporary storage, to perform the following operations in a Snowflake session.

Source: https://docs.snowflake.com/en/user-guide/warehouses.html

Warehouse size specifies the set of computing resources available in a warehouse. Resizing a warehouse (Scale-up) is to be addressed with bigger machines, but that is not to say bigger is always better. After the test with different sizes, we found out that even though it’s faster to run a single query with the bigger warehouse, the total loading time when opening a dashboard is actually longer because the bigger the machines are, the longer the provisioning time is.

Multi-cluster warehouse (Scale-out) is more designed for handling queuing and performance issues related to large numbers of concurrent users and queries and this is definitely the case for embedded analytics made for a large volume of users. Considering this is an option not included in the base contract with Snowflake, we need to keep in mind that there is also an art of balance here because more clusters enable more parallel queries but cost more credits at the same time. Therefore, the customization of the number of clusters should be always based on the real business need. In our case, we took the option and the option has turned out useful and cost-effective.

Snowflake warehouse default configuration interface

4. Optimize auto suspension: user access habits analysis

Snowflake works with a per-use pricing framework and when there is no usage, it offers the possibility to suspend the warehouse if it is inactive for a specific period (1 minute by default). Auto-suspend ensures that you do not leave a warehouse running and consuming credits when there are no incoming queries. On the other side, suspending the warehouse drops the existing cache layers, which make queries faster. So in our use case, having the warehouse active means faster queries, while when there is no prior activity, the first queries are relatively slower.

Given the Snowflake pricing methodology, one of our recommendations would be to analyze users’ access habits (access rush hour & access frequency) before setting up the auto-suspension. This is what we did to better assess when the user experience needs to be kept at a certain level of satisfaction.

4. Where do we stand now after the migration?

Now if we look back again at our three pain points: dashboard performance, scalability, and data operations, what are the achievements after the migration?

Photo by krakenimages on Unsplash

✅ One of the most obvious results is that the average load time when opening a dashboard was cut in half for all of our customers and even more noticeably for large-scale customers. This indeed contributes a lot to customer satisfaction and enhances user stickiness to our data product.

Dashboard average loading time evolution

✅ The mastery of various properties of the data warehouse and the significant improvement in the performance for large-scale customers are also strong proof of our scalability with the expansion of the company with more clients, more data, more use cases, and more complex analytics.

✅ By establishing dedicated data warehouses with the customized configurations according to different use cases, we won’t be trapped by the disturbance between use cases, because the data loading processes are all independent. In other words, we can freely work with production data without compromising any key feature on the platform.

📝 After a thorough phase of investigation and testing, we have confirmed that a data warehouse from the new generation with an agile and efficient analytics infrastructure is able to give businesses a competitive advantage and provide customers with fast and easy access to business intelligence, while seamlessly scaling to meet growing demands. At the same time, we also need to keep in mind that there are some limitations of the current pay-per-use model which require extra care. As of now, we are continuously tracking the performance of the dashboards and the cost spent on Snowflake with our internal monitoring tool to keep delivering data products in an efficient, safe and convenient way.

--

--