StarRocks Engineering

Dive deep into the world of data engineering, analytics, and open-source analytical frameworks. Explores the trailblazing technology behind StarRocks’ exceptional performance.

How to build an extremely fast analytical database -Part4

--

In the realm of databases, there are primarily two categories: transactional databases that focus on handling transactions and analytical databases that prioritize analysis. For an analytical database, performance is of utmost importance. In this article, I will share insights on how to create an exceptionally fast analytical database based on CPU technology. (Note: this article does not discuss heterogeneous computing such as GPU or FPGA, nor does it delve into new hardware accelerators like RDMA or NVM.)

This article will explore nine perspectives: precomputation vs. on-the-fly computation, scalability, data streaming, resource utilization, optimization, resilience, architecture vs. details, approximation vs. precision, and performance testing. Through these angles, we will delve into the process of building a lightning-fast analytical database.

Part4:Resource Utilization

The essence of performance optimization is to optimize the utilization of various system resources. When we review the various methods for query optimization discussed in Part Three, they are essentially focused on optimizing the usage of CPU, memory, I/O, and network resources:

  • “Read Data Less And Fast” at the storage layer is essentially optimizing I/O resources.
  • “Transfer Data Less And Fast” at the computing layer is essentially optimizing network resources.
  • “Process Data Less And Fast” at the computing layer is essentially optimizing CPU and memory resources.

If two implementations of solving a problem have the same algorithmic complexity, the one that is considered more optimal is the one that minimizes unnecessary work and utilizes fewer system resources.

Part5:Manual Tuning vs. Adaptive Tuning vs. AI Tuning

Due to the diverse hardware conditions, modeling approaches, usage scenarios, and query patterns of users, as well as the varying characteristics of user data, optimizing a database can have different outcomes in different scenarios. What may be beneficial for scenario A could be detrimental for scenario B. Therefore, database developers often introduce various configurations and variables to provide tuning flexibility to DBAs or users. The same query can exhibit several times or even dozens of times difference in performance under different configurations.

However, manual tuning not only incurs high labor costs but also poses significant challenges. It requires the tuner to have a profound understanding of both the database principles and the business scenarios to identify the possible optimal configurations. To address this challenge, databases adopt two main strategies. One approach is for the database kernel to adapt autonomously by minimizing the exposure of configurations or variables to users. Adaptive algorithms or strategies can automatically select the most suitable algorithms or strategies for different contexts. Another approach is to introduce machine learning, leveraging large amounts of historical data to automatically determine the most appropriate configurations or variables. For instance, aspects such as indexes, execution algorithms, scheduling policies, cardinality estimation in optimizers, cost models, partitioning strategies, and sorting key selection can theoretically be automatically optimized and continuously improved based on historical data. Professor Andy Pavlo from CMU has even founded a startup called OtterTune specifically to address this challenge.

Both internal adaptive strategies and external machine learning-based tuning can be employed simultaneously. In essence, they both aim to gather more information based on users’ real data and queries to narrow down the context and achieve more accurate and targeted optimization. StarRocks has made various attempts in adaptive strategies and will consider leveraging machine learning for tuning in the future.

Part6:Extreme Elasticity in Serverless

The various optimization approaches mentioned earlier have a common assumption: that physical resources remain relatively fixed over a short period. However, with the gradual popularity and development of Serverless databases, we can now rapidly scale up and scale out computing resources in a short time, increasing them several times or even dozens of times. In a scenario where the query cost remains constant, we can accelerate queries by quickly provisioning additional resources. This is because Cost(10 computing nodes, each taking 100 seconds) = Cost(100 computing nodes, each taking 10 seconds).

To achieve this, a database needs to have certain characteristics. These include being cloud-native, separating storage and compute, providing sub-second elasticity, possessing robust multi-core and multi-node scalability, offering resource isolation capabilities, enabling workload prediction, and automatically resolving data skew.

In my opinion, optimizing extreme elasticity will be crucial in the future. The ability to rapidly scale resources up or out to meet demand can significantly enhance query performance and overall system efficiency.

Part7:Architecture vs. Details

If we truly want to build an ultra-high-performance system or database, it is crucial to have both a well-designed architecture and impeccable engineering implementation. Exceptional performance comes from a combination of excellent architecture and meticulous engineering details. As mentioned earlier in the article, the importance lies not only in the vectorized execution architecture but also in the engineering details. The MPP (Massively Parallel Processing) distributed computing mechanism primarily focuses on the architectural aspect, while the CBO (Cost-Based Optimizer) and the Pipeline single-node parallel execution framework require both a reasonable architecture and meticulous engineering details.

Some may argue that this is an obvious answer, but it is not always apparent to everyone. If you agree with this viewpoint, then when building an OLAP database, you would not choose languages like Java to implement the query execution layer and storage layer. You would also not design the entire system from the bottom-up, focusing on details and algorithms like ClickHouse does.

Part8:Approximate vs. Exact

As an analytical database, there is another commonly used acceleration technique known as approximate computation. Approximate computation can satisfy user requirements based on two main points:

  1. Exploratory Analysis: When analysts are conducting exploratory analysis, they may only need a general understanding of data scale, characteristics, and trends. They do not necessarily require precise data.
  2. Large Data Volume: When dealing with large datasets, the cost of performing exact computations can be prohibitively high. For example, calculating exact distinct counts or percentiles can be computationally expensive.

In OLAP databases, there are generally two implementation approaches for approximate computation:

  1. Using Approximate Data Structures: This approach involves leveraging approximate data structures to perform calculations. For example, HyperLogLog can be used to estimate approximate distinct counts, and TDigest can be used to estimate approximate percentiles. (Link: https://blog.bcmeng.com/post/tdigest.html)
  2. Sampling-based Statistical Estimation: In this approach, a subset of the original data is sampled and used to estimate metrics for the entire dataset. For example, if the original dataset has 100 million rows, a sample of 1 million rows can be used to estimate metrics such as maximum values or cardinality for the entire 100 million rows. One drawback of this approach is that different data distributions have different statistical characteristics, leading to potential estimation errors.

Part9:Performance testing

For a large-scale open-source database, given the collaborative nature of development and the varying knowledge backgrounds of each developer, it is inevitable that code leading to performance regressions may occur. Therefore, to maintain a continuous performance advantage, it is essential to conduct thorough performance testing to prevent performance regressions. StarRocks has encountered numerous cases of performance regressions during its development process. Database performance testing can be divided into two major categories:

  1. Self-performance regression testing or benchmark testing:
  • On one hand, this type of testing helps prevent performance degradation between versions, ensuring that each new version maintains or improves performance.
  • On the other hand, it can help identify new optimization opportunities by analyzing the causes of each performance regression that occurs

2. Competitor performance comparison testing:

  • This type of testing aims to assess the performance of the database in comparison to its competitors.
  • By comparing the performance with competitor databases, it is possible to identify areas where the database excels and achieves world-leading performance in certain query types.
  • Additionally, it can highlight query types that require further improvement and uncover potential performance optimization opportunities.

--

--

StarRocks Engineering
StarRocks Engineering

Published in StarRocks Engineering

Dive deep into the world of data engineering, analytics, and open-source analytical frameworks. Explores the trailblazing technology behind StarRocks’ exceptional performance.

Kaisen Kang
Kaisen Kang

Written by Kaisen Kang

kangkaisen.com StarRocks Query Team Leader, StarRocks PMC, Apache Kylin PMC(Inactive), Apache Doris PMC (Retired)

No responses yet