Scaling Cache Infrastructure at Pinterest

Published in

Pinterest Engineering Blog

9 min readDec 10, 2020

Kevin Lin | Software Engineer, Storage and Caching

Demand on Pinterest’s core infrastructure systems is accelerating faster than ever as more Pinners come to Pinterest to find inspiration. A distributed cache layer fronting many services and databases is one of our core storage systems that sits near the bottom of Pinterest’s infrastructure stack, responsible for absorbing the vast majority of backend traffic driven by this growth.

Pinterest’s distributed cache fleet spans an EC2 instance footprint consisting of thousands of machines, caching hundreds of terabytes of data served at more than 150 million requests per second at peak. This cache layer optimizes top-level performance by driving down latency across the entire backend stack and provides significant cost efficiency by reducing the capacity required for expensive backends. We’ll do a technical deep dive into the infrastructure that supports Pinterest’s cache fleet at scale.

Application Data Caching

Every API request incoming to Pinterest internally fans out to a complex tree of RPCs through the stack, hitting tens of services before completion of its critical path. This can include services for querying core data like boards and Pins, recommendation systems for serving related Pins, and spam detection systems. At many of these layers, the result of a discrete unit of work can be cached in a transient store for future reuse, as long as its input data can be described by a unique key.

At Pinterest, the most common use of the distributed cache layer is storing such results of intermediate computations with lookaside semantics. This allows the cache layer to absorb a significant share of traffic that would otherwise be destined for compute-expensive or storage-expensive services and databases. With both single-digit millisecond tail latency and an extraordinarily low infrastructure dollar cost per request served, the distributed cache layer offers a performant and cost-efficient mechanism to scale a variety of backends to meet growing Pinterest demand.

*Figure 1: Highly simplified lifecycle of an API request through Pinterest’s main API service, its dependency service backends, and the distributed cache layer.*

Offering a distributed cache tier as a service allows application developers to focus on implementing business logic without worrying about distributed data consistency, high availability, or memory capacity. Cache clients use a universal routing abstraction layer that ensures applications have a fault-tolerant and consistent view of data. The server fleet can additionally be scaled out independently of the application layer to transparently adjust memory or throughput capacity to accommodate changes in resource usage profiles.

The Backbone of Distributed Caching: Memcached and Mcrouter

Memcached and mcrouter form the backbone of Pinterest’s distributed caching infrastructure and play a critical role in Pinterest’s storage infrastructure stack. Memcached is an open source, highly efficient, in-memory key-value store written in pure C. Mcrouter is a layer 7 memcached protocol proxy that sits in front of the memcached fleet and provides powerful high availability and routing features.

Memcached is an attractive choice as a caching solution:

Thanks in part to its asynchronous event-driven architecture and multithreaded processing model, memcached is extremely efficient and easily amenable to horizontal scaling to meet capacity demands.
Extstore helps realize incredible storage efficiency wins with a secondary, warm storage tier located on the instance’s NVMe flash disk.
Memcached’s deliberately simple architecture provides flexibility in building abstractions on top of it, as well as provides easy horizontal scalability to meet increased demand. A single memcached process by itself is a simple key-value store and deliberately has no knowledge of its peers, or even the notion of a memcached cluster.
Memcached has been battle-tested for accuracy and performance over decades of development and is surrounded by an active open source community (which has also accepted several Pinterest patches into upstream).
Memcached ships with native support for TLS termination, allowing us to secure the entire fleet with mutual TLS-authenticated traffic (with additional SPIFFE-based authorization access controls built in-house).

Mcrouter was open sourced by Facebook in 2014 and played a crucial role in scaling their memcached deployment. It fits well into Pinterest’s architecture for similar reasons:

Mcrouter acts as an effective abstraction of the entire memcached server fleet by providing application developers a single endpoint for interacting with the entire cache fleet. Additionally, using mcrouter as the single interface to the system ensures universal, globally consistent traffic behavior across all services and machines at Pinterest.
Mcrouter presents a decoupled control plane and data plane: the entire topology of the memcached server fleet is organized into “pools” (logical clusters), while all request routing policies and behaviors dictating the interaction between clients and server pools are managed independently.
Mcrouter’s configuration API provides robust building blocks for complex routing behaviors, including zone-affinity routing, replication for data redundancy, multi-level cache tiers, and shadow traffic.
As a layer 7 proxy that speaks memcached’s ASCII protocol, mcrouter exposes intelligent protocol-specific features like request manipulation (TTL modification, in-flight compression, and more).
Rich observability features are provided out of the box at no cost to the client application, providing detailed visibility into memcached traffic across all of our infrastructure. Most important to us are percentile request latency, throughput sliced along individual client and server dimensions, request trends by key prefixes and key patterns, and error rates for detecting misbehaving servers.

*Figure 2: Overview of request routing from mcrouter to memcached. Every key prefix is associated with a routing policy; two examples are shown.*

In practice, mcrouter is deployed as a service-colocated, out-of-process proxy sidecar. As shown in Figure 2, applications (written in any language) send memcached protocol requests to mcrouter on loopback, and mcrouter proxies those requests to thousands of upstream memcached servers. This architecture allows us to build out robust features in a fully managed cache server fleet, entirely transparent to consuming services.

While memcached has been a part of the Pinterest infrastructure stack ever since the early days, our strategy around scaling its client-side counterpart has evolved significantly over the years. In particular, routing and discovery was first done in client-side libraries (which was brittle and coupled with binary deploys), followed by an in-house built routing proxy (which didn’t provide extensible building blocks for high availability), and finally mcrouter.

Compute and Storage Efficiency

Memcached is highly efficient: a single r5.2xlarge EC2 instance is capable of sustaining in excess of 100K requests per second and tens of thousands of concurrent TCP connections without tangible client-side latency degradation, making memcached Pinterest’s most throughput-efficient production service. This is due in part both to well-written C and its architecture, which makes use of multiple worker threads that independently run a `libevent`-driven event loop to service incoming connections.

At Pinterest, memcached’s extstore drives huge wins in storage efficiency for use cases ranging from Visual Search to personalized search recommendation engines. Extstore expands cached data capacity to a locally mounted NVMe flash disk in addition to DRAM, which increases available per-instance storage capacity from ~55 GB (r5.2xlarge) to nearly 1.7 TB (i3.2xlarge) for a fraction of the instance cost. In practice, extstore has benefitted data capacity-bound use cases without sacrificing end-to-end latency despite several orders of magnitude in difference between DRAM and SSD response times. Extstore’s built-in tuning knobs has allowed us to identify a sweet spot that balances disk I/O, disk-to-memory recache rate, compaction frequency and aggressiveness, and client-side tail response time.

High Availability

All infrastructure systems at Pinterest are highly available, and our caching systems are no exception. Leveraging rich routing features in mcrouter, our memcached fleet has a wide array of fault tolerance features:

Automatic failover for partially degraded or completely offline servers. Networks are inherently flaky and lossy; the entire caching stack assumes this to be a non-negotiable fact and is designed to maintain availability when servers are unavailable or slow. Fortunately, cache data is transient by nature, which relaxes requirements around data durability which would otherwise be required of persistent stores like databases. Within Pinterest, mcrouter automatically fails over requests to a globally shared cluster when individual servers are offline or responding to requests too slowly, and it automatically brings servers back into the serving pool through active health checks. Combined with rich proxy-layer instrumentation on individual server failures, this allows operators to identify and replace misbehaving servers with minimal production downtime.
Data redundancy through transparent cross-zone replication. Critical use cases are replicated across multiple clusters spanning distinct AWS Availability Zones (AZs). This allows total loss of an AZ with zero downtime: all requests are automatically redirected to a healthy replica located in a separate AZ, where a complete redundant copy of the data is available.
Isolated shadow testing against real production traffic. Traffic routing features within mcrouter allow us to perform a variety of resiliency exercises including cluster-to-cluster dark traffic and artificial latency and downtime injection against real production requests without impacting production.

Load Balancing and Data Sharding

One of the key features of a distributed system is horizontal scalability — the ability to scale out rather than scale up to accommodate additional traffic growth. At Pinterest, the vast majority of our caching workloads are throughput-bound, requiring scaling the number of instances in the cluster roughly linearly proportionally to the volume of inbound requests. However, memcached itself is an extremely simple key-value store, which by itself has no knowledge of other peers in the cluster. How are hundreds of millions of requests per second actually routed through the network to the right servers?

Mcrouter applies a hashing algorithm against the cache key of every incoming request to deterministically shard the request to one host within a pool. This works well for evenly distributing traffic among servers, but memcached has a unique requirement that its clusters need to be arbitrarily scalable — operators need to be able to freely adjust cluster capacity in response to changing traffic demands while minimizing the client-side impact.

Consistent hashing ensures that most of a keyspace partition maps to the same server even as the total number of eligible shards increases or decreases. This allows the system to scale out transparently to the client layer due to highly localized and predictable hit rate impact, thus reducing the chance that small changes in capacity cause catastrophic drops in cluster-wide hit rate.

*Figure 3: Consistent hashing keeps most of the keyspace server allocation intact when one node is added to an existing pool.*

The client-side routing layer maps a single key prefix to one or more such consistently hashed pools behind one of several routing policies, including AZ-affinity preference routing for cross-AZ replicated clusters, L1L2 routing for in-memory clusters backed by a fallthrough flash-based capacity cluster, and more. This allows isolating traffic and thus allocating capacity by client use case, and it ensures consistent cache routing behavior from any client machine in Pinterest’s fleet.

Tradeoffs and Considerations

All sufficiently complex infrastructure systems are characterized by (often highly nuanced) trade-offs. In the course of building out and scaling our caching systems, we weighed the costs and benefits of many trade-offs. A few are highlighted below:

An intermediary proxy layer presents a non-trivial additional amount of both compute and I/O overhead, especially for a performance-critical system with tight latency SLOs. However, the high availability abstractions, flexible routing behaviors, and many other features provided by mcrouter far outweigh the performance penalties.
A globally shared proxy configuration presents risks in change rollouts, since all control plane changes are applied across the entire fleet of tens of thousands of machines at Pinterest on deployment. However, this also ensures globally consistent knowledge of the memcached fleet topology and associated routing policies, regardless of where or how a client is deployed within Pinterest.
We operate ~100 distinct memcached clusters, many of which have different tenancy characteristics (dedicated versus shared), hardware instance types, and routing policies. While this presents a sizable maintenance burden on the team, it also allows for effective performance and availability isolation per use case, while also providing opportunities for efficiency optimization by choosing parameters and instance types most appropriate for a particular workload’s usage profile.
Leveraging a consistent hashing scheme for load distribution among a pool of upstream servers works well for the majority of cases, even if the keyspace is characterized by clusters of similarly-prefixed keys. However, this doesn’t solve problems with hot keys — an abnormal increase in request volume for a specific set of keys still results in load imbalance caused by hot shard(s) in the server cluster.

Future Work

Looking forward, we hope to continue delivering improved efficiency, reliability, and performance of Pinterest’s cache infrastructure. This includes experimental projects like embedding memcached core directly into a host application process for performance-critical use cases (to allow memcached to share memory space with the service process as well as eliminate network and I/O overhead) and reliability projects like a designing a robust solution for multi-region redundancy.

Thanks to the entire Storage and Caching team at Pinterest for supporting this work, especially Ankita Girish Wagh and Lianghong Xu.