Jian Wang, Jiaqi Gu, Yi Yang, Isabel Tallam, Lakshmi Narayana Namala, Kapil Bajaj | Real Time Analytics Team
In this blog post series, we’ll discuss Pinterest’s Analytics as a Platform on Druid and share some learnings on using Druid. This is the first of the blog post series with a short history on switching to Druid, system architecture with Druid, and learnings on optimizing host types for Mmap.
A Short History on Switching to Druid
Historically, most of the analytical use cases in Pinterest were powered by Hbase, which was then a well-supported, key value store in the company. All the reporting metrics were precomputed in an hourly or daily batch job, transformed into a key value data model, and stored in Hbase. This approach worked fine for a while, but eventually the cons of the Hbase-based precomputed key value look up system became increasingly visible:
- The key value data model doesn’t naturally fit into the analytics query pattern, and more work is needed on the application side to do aggregation
- Cardinality explodes any time a new column is added
- It’s too expensive to precompute all filter combinations, leading to limited filter choices in the UI
- Hbase cluster stability and operation cost became increasingly unmanageable as data sets grew larger
For all these reasons, we evaluated and decided to adopt Druid as Pinterest’s next-gen analytical data store. Since then, we have onboarded many critical use cases including reporting partner and advertiser business metrics, organic pin stats, experiment metrics, spam metrics analysis, etc. The infrastructure consists of more than 2,000 nodes in a multi-cluster setup (with the largest offline use case covering over 1 PB data) and the largest online use case serving 1000+ QPS (with under 250 ms P99 latency).
The ingestion is a lambda architecture where:
- the daily/hourly batch pipeline transforms input data on S3 into Druid formatted index files and writes back to S3 with corresponding metadata records inserted into MySQL
- a streaming pipeline consumes topics from Kafka that the upstream client services put events into in real time
Queries come through a proxy service called Archmage. Most of the micro services in the company communicate with each other via Thrift RPC calls, while Druid only accepts HTTP, so we added a thrift server Archmage on top of Druid to handle the discrepancy. Meanwhile, the proxy also eases emitting service metrics, query rewrite, dark read, rate limiting, etc.
Learnings on Optimizing Host Types for Mmap
As we know, when a query comes in, the historical process retrieves portions of data from the segment to query through Mmap. This operates on an idealized abstraction of a byte array backed up by the segment file by providing some offset and length. The operating system is responsible for making sure the offset referenced in the byte array is in the RAM. If it isn’t, it will trigger a major page fault to load one or a series of pages containing the data into RAM, which is a costly disk operation. On the other hand, if the offset referenced in the byte array has been accessed before and not evicted, it resides in the disk cache (also called the page cache) part of the RAM. Then, a minor page fault is triggered to link the page in the page cache to the page in the virtual memory of the Druid process, which is a much more lightweight operation and doesn’t involve disk read. During the process of onboarding different use cases, we evaluated different host types for varying workloads and decided to use either memory optimized or IO optimized host types, depending on how much disk access is needed.
Memory Optimized Host Types
For low latency use cases in Pinterest, we try to minimize disk reads by provisioning hosts with memory size large enough to put all the frequently accessed segments in the page cache. We pick memory optimized hosts, e.g., R series if using AWS. The rule of thumb is to have at least 1:1 mapping between size of segments on disk and available RAM for page cache. Each historical node’s available page cache size is calculated as the total RAM in the host minus the Druid process’ max heap and direct memory to allocate, and minus RAM used by other sidecars in the host. Over time, with queries coming in, initial disk reads caused by major page faults go away as all the segments are gradually cached into page cache.
When new segments are downloaded each hour or each day, this process will start to kick in again, and we will see degradation until those segments are fully cached. To make the performance more predictable, we adopted a pre-cache approach. Because of the way OS manages page cache, if a segment file on disk is traversed by a read system call by any process (even from non Druid process), OS will load it into the page cache part of the RAM if there’s still space left so that any later read system call to the segment file can just trigger a minor page fault and avoid a disk read. We leveraged this mechanism by letting a thread pool in the historical process traverse the entire segment file as soon as one is first downloaded into local disk by the historical process.
IO Optimized Host Types
Memory optimized host types are not a good fit when:
- Data volume is too large for us to afford to provision enough RAM to cache every segment
- Use case itself is offline and doesn’t require sub second latency
We got the best practice for use cases with above characteristics when we onboarded an offline use case for experiment metrics that contained ~1PB data.
Initially we tried memory optimized host types with large EBS volumes. We thought with larger RAM, we can at least benefit from caching more segments than host types; smaller RAM and the performance turned out to be not satisfying. Digging deeper, we found that for each query, there were many segments to scan, way greater than the amount able to fit into page cache, and they are different almost every time. So whatever was in the page cache before would be evicted, and the performance is mainly determined by the disk read performance. Druid abstracts a segment into a byte array and relies on mmap to fetch (on-demand) only the specific portion of the byte array from the disk but not all of the segment data (e.g., a certain portion of the bitmap index for a certain dimension given a filter). This translates to loading one or a series of pages for one segment and then moving on to another segment until all segments needed for the query are processed. For most company architecture, 4KB is the minimal page size.
In summary, the rule of thumb is to use host types with high 4KB page size random read IOPS if disk accesses are expected. For AWS, host types with on-instance SSD work the best: i3 is better than i3en and both are way better than any other instance types attaching an EBS disk. We eventually picked i3en, which has a balance between price and performance and made one of the most cost effective use cases in the company.
We are going to continuously improve the architecture to meet new use case requirements:
- add AZ aware segment assignment to data nodes to increase availability
- evaluate alternatives of ZK based segment discovery to scale better for large use cases
- evaluate alternative ingestion solutions to increase efficiency or lower maintenance cost
We also plan to explore more host types to increase cost effectiveness for different scenarios. We will be exploring host types with enhanced EBS throughput, host types with AWS Graviton processors, etc.
We have learned a lot from the discussions in the Druid Guild with Ads Data team and the feedback from the open source community when we started contributing back our work. We would also like to thank all the teams that have worked with us to onboard their use cases to the unified analytics platform: Insights team, Core Product Data, Measurements team, Trust & Safety team, Ads Data team, Signal Platform team, Ads Serving, etc. Every use case is different and the platform has evolved a lot since its inception.