Vivo’s Journey to a High-Performance Lakehouse with StarRocks
Author: Xiaolong Guo, Senior Data Engineer, vivo
Vivo is a leading global technology company dedicated to creating innovative smart devices and intelligent services that enhance users’ digital lives. The company operates globally, with a strong presence in markets like China, India, Southeast Asia, and Europe.
Vivo’s big data platform leverages StarRocks as its next-gen analytics engine to overcome limitations in Presto and ClickHouse, achieving 3–5x faster queries. The migration from Presto to StarRocks has improved ad-hoc analysis performance by 65%, boosted query success rates to 98%, and unified syntax compatibility while enabling real-time data processing with Flink CDC and materialized views.
Challenges in Vivo’s Big Data Multidimensional Analysis Scenarios
In terms of ad-hoc analysis, our solution is primarily based on two major computing engines: Spark and Presto. As a crucial component of Vivo’s big data development and governance platform, the ad-hoc analysis module is critical due to its high-frequency usage. Through user research, we discovered that the existing system suffers from long query response times, directly impacting work efficiency. Additionally, insufficient syntax compatibility causes inconvenience for users. Therefore, optimizing this module to improve query speed and enhance syntax compatibility has become a top priority for our team.
For agile BI, when using Presto as the data query engine, we encountered feedback from users regarding long query response times. Moreover, we found that optimizing the performance of these queries posed significant challenges, as there was no effective solution available.
Regarding the R&D efficiency tool platform, it initially operated on MySQL combined with an in-memory computing model when data volumes were relatively small. However, as data volumes grew, this approach could no longer meet business needs. We attempted to adopt ClickHouse as a solution, but encountered numerous issues during implementation, ultimately leading to its failure. Specific problems included low query performance, frequent memory overflows in in-memory computing, excessively long data cleaning and processing times (often taking several hours or even a full day), and complex real-time update and deletion processes. The computing logic was cumbersome, requiring users to write extensive Java code when building reports, consuming substantial development effort. This has become a common pain point.
Additionally, due to historical reasons, we have long lacked effective authentication and control mechanisms for user permission management and computing resource allocation.
Pain Points of Presto
The main issues we encountered with Presto include:
- Weak multi-level caching capabilities;
- Limited Cost-Based Optimizer (CBO) capabilities;
- Lack of materialized views and table-level acceleration solutions;
- No physical isolation mechanism;
- Coordinator is deployed as a single point, posing a single point of failure risk;
- Even after JVM tuning and cluster expansion, recurring issues persist periodically;
- Low community activity with few available performance optimization solutions.
Pain Points of ClickHouse
Although ClickHouse performs well in wide table application scenarios, its acceleration capabilities under a Lakehouse architecture are limited, and its high storage costs during data ingestion have become a major concern for users. Additionally, ClickHouse lacks strong Join processing capabilities, including critical optimizations such as CBO (Cost-Based Optimization), Reordering, and Runtime Filter, which have yet to be implemented. As a result, SQL tuning often requires manual SQL adjustments, significantly degrading the user experience.
Furthermore, while ClickHouse offers some real-time update and deletion features, their performance has not seen fundamental improvements, further impacting usability. SQL compatibility remains at a moderate level, and cluster scaling (both up and down) is complex — especially when expanding, shrinking, or adjusting replicas for execution nodes. The recovery and replacement of failed machines require extensive manual intervention. Even with some available solutions, these challenges still significantly affect operational efficiency and negatively impact user experience.
Advantages of StarRocks
After extensive research on various components, functional and performance testing, industry case studies, and technical guidance from the StarRocks community, we ultimately selected StarRocks as our next-generation multidimensional analysis engine, establishing it as the sole standard for Lakehouse acceleration.
StarRocks offers significant advantages in Lakehouse acceleration, supporting standard SQL and being compatible with the MySQL protocol. Its powerful Join processing capabilities, default-enabled CBO optimizations, multi-level caching mechanisms, and intelligent materialized view acceleration make it excel in complex query scenarios. Additionally, StarRocks provides a comprehensive resource isolation mechanism, and its latest version 3.3.5 introduces CG Group CPU hard isolation, further enhancing the stability and controllability of resource management.
StarRocks is highly user-friendly for operations and maintenance (O&M). Whether scaling the cluster up or down, the process is simplified to just executing a few commands and monitoring the system, greatly reducing O&M costs.
It is evident that StarRocks effectively compensates for the shortcomings of Presto and ClickHouse, addressing the OLAP challenges and pain points we faced.
Our OLAP team has evolved through multiple components, including Druid, Kylin, Presto, and ClickHouse, while currently maintaining multiple OLAP systems, leading to a dispersed allocation of development resources. To improve R&D efficiency, we will focus on StarRocks and ClickHouse moving forward, requiring every developer to deeply learn and master these two core components.
Technical Solution for StarRocks Service Implementation
The diagram above illustrates the overall architecture of the Vivo Big Data Platform, where StarRocks plays a key role in both the query layer and the data storage layer.
- Query Layer: This layer primarily supports applications such as ad-hoc analysis, BI reporting, and Lakehouse integration, covering both infrastructure and business systems. The current query engines include Spark, ClickHouse, Druid, Presto, and StarRocks, with StarRocks set to fully replace Presto in the future.
- Data Processing Layer: Spark is used for offline computing, while Flink is used for real-time computing. In terms of data storage, the platform utilizes Hive, Paimon, ClickHouse, StarRocks, Druid, and HBase. Currently, Druid is only used for infrastructure monitoring and certain advertising services, with plans to gradually migrate these workloads to StarRocks or ClickHouse.
- Data Storage Medium Layer: This includes HDFS, a self-developed object storage system, and HDD/SSD file systems. On the right side of the diagram is the company’s self-developed service platform, which handles data development, task scheduling, and real-time computing. It is also responsible for database creation, table creation, and materialized view management in StarRocks.
Lakehouse Query Acceleration Optimization
Our data warehouse primarily uses Hive (90% ORC format, migrating remaining Text tables to achieve 99% ORC adoption) and Paimon (100% ORC), with StarRocks as the core query accelerator. We conducted extensive ORC optimizations — including data reading logic, predicate pushdown, and cache mechanisms — alongside HDFS slow-node fixes and metadata caching improvements, achieving significant performance gains while maintaining 100% Presto and 85% Spark syntax compatibility.
We upgraded StarRocks from 3.2.5 to 3.3.5, resolving deadlocks and materialized view refresh issues, improving multi-fact table support, and introducing a partition-based refresh strategy for better performance. To enhance stability, we implemented system process management, automated crash recovery, and a robust monitoring & alerting system. Additionally, we optimized authentication, SQL compatibility, and audit logging, while enabling secure encrypted data processing, accelerating StarRocks’ replacement of Presto in analytical workloads.
Outcomes and Benefits of Adopting StarRocks
Ad-Hoc Analysis Migration Journey
The migration progressed from planning (Jan 2024) through performance tuning, phased rollouts, to full Presto replacement (Jul 2024), achieving 70–80% StarRocks adoption with continuous optimizations in metadata caching, resource isolation, and stability.
Benefits of Introducing StarRocks for Ad-Hoc Analytics
StarRocks now handles 70% of ad-hoc queries, significantly outperforming Spark (30%) in query speed and stability. P50 latency has improved by 65.06%, reducing from 63.77s to 22.30s, with an overall 3x speed increase. As adoption grows to 80%, we anticipate 4–5x performance improvements. Additionally, query success rates have reached 98%, far surpassing Presto in stability.
Benefits of Introducing StarRocks for Agile BI
We have migrated 25% of Agile BI workloads to StarRocks, achieving 250,000+ successful queries per month with a 99%+ success rate. StarRocks now supports 12 business domains and 600+ users, significantly enhancing efficiency:
- Slow queries (>30s) dropped from 2.99% to 1.32%
- P90 query latency reduced from 16s (Presto) to under 5s, a 4x performance boost
As StarRocks adoption expands, we expect further improvements in query performance and system stability, reinforcing our strategy to fully replace Presto.
Benefits of Introducing StarRocks for R&D Efficiency Tool Platform
We built a near real-time business database, reducing data availability from hours (or T+1) to under 3 minutes. Leveraging materialized views, development efficiency improved by 30%, with potential gains of 50%+ as users become more proficient. Previously, report development required extensive Java coding, whereas now it only requires configuring real-time tasks and materialized views, significantly reducing development costs.
Using Flink CDC, real-time data is synchronized to StarRocks raw data and incrementally processed through different data warehouse layers for efficient report querying. P95 query latency is now just 400ms, and Flink-StarRocks integration further enhances real-time metric capabilities. This initiative has been highly recognized within the company, optimizing costs and enhancing enterprise data infrastructure value.
Future Plans
We plan to have full migration completed by the end of 2025.
During the migration, we will continue enhancing StarRocks on Kubernetes (K8s) and refining best practices from Agile BI and ad-hoc analytics, expanding their adoption across more business scenarios. Ultimately, we plan to replace Presto clusters in advertising, AI, DMP, and other business areas with StarRocks. Based on the success of our ad-hoc analytics and Agile BI projects, we are confident in fully transitioning from Presto to StarRocks.
Want to learn more about StarRocks? Join our Slack community and start the conversation!