Cost Reduction in Goku

Pinterest Engineering
Pinterest Engineering Blog
5 min readDec 17, 2021

Monil Mukesh Sanghavi | Software Engineer, Real Time Analytics Team;

Rui Zhang | Software Engineer, Real Time Analytics Team;

Hao Jiang | Software Engineer, Real Time Analytics Team;

Miao Wang | Software Engineer, Real Time Analytics Team;

In 2018, we launched Goku, a scalable and high performant time series database system, which served as the storage and query serving engine for short term metrics (less than one day old). In early 2020, we launched GokuL (Goku long term), which extended Goku’s capability by supporting long term metrics data (i.e. data older than a day and upto a year). Both of these completely replaced OpenTSDB. For GokuL, we used 3 clusters of i3.4xlarge ssd backed ec2 instances which, over time, we realized are very costly. Reducing this cost was one of our primary aims going into 2021. This blog post will cover the approach we took to achieve our ambition.

Background

We use a tiered approach to segregate the long term data and store it in the form of buckets.

Tier Bucket size Raw data Roll up data Roll up time interval Roll up aggregators TTL 0 2 hours yes no N/A N/A 24 hours (12 buckets) 1 6 hours yes yes 15 min Sum,avg,count,max, min 30 hours (5 buckets) 2 1 days yes yes 15 min Sum,avg,count,max, min 5 days (5 buckets) 3 4 days yes yes 15 min Sum,avg,count,max, min 24 days (6 buckets) 4 16 days no yes 15 min Sum,avg,count,max, min 80 days (5 buckets) 5 64 days no yes 1 hour Sum,avg,count,max, min 384 days (6 buckets)
Table 1: table of tiered approach

Tiers 1–5 contain the data stored on the GokuL (long term) clusters. GokuL uses RocksDB to store its long term data, and the data is ingested in the form of sst files.

Query Analysis

We analyzed the queries going to the long term cluster and observed the following:

  1. There are very few metrics (approximately ~6K) out of a total 10B for which data points older than three months were queried from GokuL.
  2. More than half of the GokuL queries had specified rollup intervals of one day or more.

Tier 5 Data Analysis

We randomly selected a few shards in GokuL and analyzed the data. We observed the memory consumption of tier 5 data was much more than all the other tiers (1–4) combined. This was despite the fact that tier 5 contains only one hour of rolled up data, whereas the other tiers contained a mix of raw and 15 minute rolled up data.

SST File size for each bucket in MiB Shard Id Tier 1 Tier 2 Tier 3 Tier 4 Tier 5 Percentage of tier 5 over sum of others 5 * 6 hour buckets 5 * 1 day buckets 6 * 4 day buckets 5 * 16 day buckets 6 * 64 day buckets xx 47 155 556 950 2048 135.09% xx 54 183 670 1331 3379 170.94% xx 37 118 420 816 2048 166.62% xx 46 168 613 1126 3072 177.61% xx 41 138 507 1126 3277 205.52% xx 49 154 550 1126 2355 142.08% xx 32 108 386 698 1638 151.06% xx 44 154 554 1126 3072 185.36% xx 63 179 587 1229
Table 2: SST File size for each bucket in MiB

Solutions

It was inferred from the query and tier 5 analysis that tier 5 data (which holds six buckets of 64 days of data each) was the least queried as well as the most disk consuming. We planned our solutions to target this tier as it would give us the most benefits. Mentioned below are some of the solutions which were discussed.

Namespace

Implementation of a functionality called namespace would store configurations like ttl, rollup interval, and tier configurations for a set of metrics following that namespace. Uber’s M3 also has a similar solution. This would help us set appropriate configurations for the select sete.g. set a lower ttl for metrics which do not require longer retention, etc). The time to production for this project was longer, and hence we decided to make this as a separate project in the future. This is a project being actively worked upon.

Rollup Interval Adjust for Tier 5 Data

We experimented with changing the rollup interval of tier 5 data from one hour to one day and observed the change in the final sst file(s) size for the tier 5 bucket.

Table 3

The savings that came out of this solution were not strong enough to support putting this into production.

On Demand Loading of Tier 5 Data

GokuL clusters would only store data from tiers 1–4 on startup and would load the tier 5 buckets as necessary (based on queries). The cons of this solution were:

  • Users would have to wait and retry the query once the corresponding tier 5 bucket from s3 had been ingested by the GokuL host.
  • Once ingested, the bucket would remain in GokuL unless thrown away by an eviction algorithm.

We decided not to go with this solution because it was not user friendly.

Tiered Storage

We decided to move tier 5 data into a separate hdd based cluster. While there was some notable difference observed in the query latency, it could be ignored because the number of queries hitting this tier were much less. We calculated that tier 5 was consuming approximately 1 TB of each of the 650 hosts in the GokuL cluster. We decided to use the d2.2xlarge instance to store and serve the tier 5 data in GokuL.

Table 4

The cost savings that came out of this solution were huge. We replaced around 325 i3.4xlarge instances with 111 d2.2xlarge instances, and the cost reduction was huge. We reduced nearly 30–35% of our costs with this change.

To support this, we had to design and implement tier-based routing in the goku root cluster, which routes the queries to short term and long term leaf clusters. This was one of the solutions that gave us a huge cost savings.

In the future, we can evaluate if we can reduce the number of replicas and compromise on availability in opposition to the low number of queries.

RocksDB Tuning

As mentioned above, GokuL uses RocksDB to store the long term data. We observed that the RocksDB options we were using were not optimal for Goku’s data that has high volume and low QPS.

We experimented with using a stronger compression algorithm (ZSTD with level 5), and this reduced the disk usage by 40%. In addition to this, we enabled the partitioned index filter wherein only the top level index is loaded into memory. On top of this, we enabled caching with higher priority for filter and index blocks so that they use the same cache as the data blocks and also minimize the performance impact.

With both the above changes, we noticed that the latency difference was not large and the reduction in data space usage was approximately 50%. We immediately put this into production and shrunk the size and cost of our GokuL clusters by another half.

What’s Next

Namespace

As mentioned, we are actively working on the implementation of the namespace feature, which will help us reduce the long term cluster costs even further by reducing the ttl for most of the current metrics that do not need the high retention anyways.

Acknowledgements

Huge thanks to Brian Overstreet, Wei Zhu and the observability team for providing and supporting solutions on the table.

To learn more about engineering at Pinterest, check out the rest of our Engineering Blog, and visit our Pinterest Labs site. To view and apply to open opportunities, visit our Careers page.

--

--