Getafix : Using Popularity to Curtail Memory Costs in Interactive Analytics Engines
by Mainak Ghosh, Ashwini Raina, Le Xu, Xiaoyao Qian, Indranil Gupta, Himanshu Gupta
Joint work between University of Illinois Urbana-Champaign and Oath
In Proceedings of EuroSys’18: Thirteenth EuroSys Conference 2018
Getafix is a data replication scheme that cuts memory costs in interactive analytics engines.
Getafix tracks data segment popularity in a cluster and runs a modified best-fit algorithm to find a memory efficient allocation. It is built upon our optimal solution to the static version of the scheduling problem, which yields a solution that has the least makespan and replication factor. In practice, it can cut cluster memory costs by 2X without sacrificing query performance and save firms as much as $10 million annually. It also automatically tiers popular data to powerful nodes in a heterogeneous cluster and greatly simplifies sys admin work.
Frequently Asked Questions
What are interactive analytics engines?
Interactive analytics engines are used for running real-time analytics on massive scale. Applications include usage analytics, revenue reporting, spam analytics, ad feedback, and others. Some of the popular interactive analytics engines are Druid, Amazon’s Redshift, Google’s Mesa, Presto and Linkedin’s Pinot.
In interactive data analytics engines, data is continuously ingested from multiple pipelines including batch and streaming sources, and then indexed and stored in a data warehouse. This data is immutable. The data warehouse resides in a backend tier, e.g., HDFS or Amazon S3. As data is being ingested, users (or programs) submit aggregation based queries to analyze data in an interactive way.
What are the tradeoffs with using less memory?
Reducing memory in these systems impacts performance. We measure this tradeoff by running experiments with an uniform replication strategy where all segment are replicated 4, 7 and 10 times. The curve has a knee when all segments are replicated 7 times. We do not observe a steady improvement in performance because over-replication often leads to query hotspots because of data colocation. Selective Replication schemes like Scarlett improve on both performance and memory because popular segments now have more replicas to divide the query load. Getafix improves on Scarlett by reducing memory by 2X with less than 5% impact on performance at median.
How does Getafix do it?
Getafix smartly colocates popular segment replicas with unpopular segments, thereby ensuring it gets a larger share of cluster compute. This is unlike Scarlett which alleviates query hotspots by creating more replicas of popular segments. Getafix models the problem as a colored variant of the well-known balls and bins problem. There are several other system design decisions in the paper involving: query routing, segment load balancing, minimizing network transfers, segment replica deletion and garbage collection..
How much dollar cost can Getafix save?
Scarlett, a popular data replication scheme used in batch systems like Hadoop, has a data replication factor of 4.2 in interactive analytics engines. Getafix has a replication factor of 1.9. In a public cloud deployment, where working set data size is 100 TeraBytes, Getafix can reduce memory usage by approximately 230 TB (100 TB × (4.2–1.9)). This amounts to cost savings of 230 × 1000 GB × $0.005/GB/hour = $1150 per hour ($0.005/GB/hour is the approximate pricing of memory on AWS). Annually, this would amount to $10 million worth of savings.
How does Getafix automatically tier a cluster?
Today system administrators manually configure clusters into tiers by grouping machines with similar hardware characteristics into a single tier. They use hardcoded rules for placing segments within these tiers, with recent (popular) segments assigned to the hot tier. Eschewing this manual approach, Getafix continuously tracks changes in segment popularity and cluster configuration, to automatically move popular replicas to powerful HNs, thereby creating its own tiers. Below figure shows the auto-tiering efficiency of Getafix on a heterogeneous cluster. Plot on the right is using Getafix with heterogeneity optimizations which automatically tiers the cluster into three tiers with 75% accuracy.