How Pinterest runs Kafka at scale

Yu Yang | Pinterest engineer, Data Engineering

Pinterest runs one of the largest Kafka deployments in the cloud. We use Apache Kafka extensively as a message bus to transport data and to power real-time streaming services, ultimately helping more than 250 million Pinners around the world discover and do what they love.

As mentioned in an earlier post, we use Kafka to transport data to our data warehouse, including critical events like impressions, clicks, close-ups, and repins. We also use Kafka to transport visibility metrics for our internal services. If the metrics-related Kafka clusters have any glitches, we can’t accurately monitor our services or generate alerts that signal issues. On the real-time streaming side, Kafka is used to power many streaming applications, such as fresh content indexing and recommendation, spam detection and filtering, real-time advertiser budget computation, and so on.

We’ve shared out experiences at the Kafka Summit 2018 on incremental db ingestion using Kafka, and building real-time ads platforms using kafka streams. With >2,000 brokers running on Amazon Web Services, transporting >800 billion messages and >1.2 petabytes per day, and handling >15 million messages per second during the peak hours, we’re often asked about our Kafka setup and how to operate Kafka reliably in the cloud. We’re taking this opportunity to share our learnings.

Pinterest Kafka setup

Figure 1 shows the Pinterest Kafka service setup. Currently we have Kafka in three regions of AWS. Most of the Kafka brokers are in the us-east-1 region. We have a smaller footprints in us-east-2 and eu-west-1. We use MirrorMaker to transport data among these three regions. In each region, we spread the brokers among multiple clusters for topic level isolation. With that, one cluster failure only affects a limited number of topics. We limit the maximum size of each cluster to 200 brokers.

We currently use d2.2xlarge as the default broker instances.The d2.2xlarge instance type works well for most Pinterest workloads. We also have a few small clusters that use d2.8xlarge instances for highly fanout reads. Before settling on d2 instances with local storage, we experimented with using Elastic Block Store st1 (throughput optimized hard drives) for our Kafka workloads. We found that the d2 instances with local storage performed better than EBS st1 storage.

Figure 1. Pinterest Kafka setup

We have default.replication.factor set to 3 to protect us against up to two broker failures in one cluster. As of November 2018, AWS Spread Placement Groups limit running instances per availability zone per group to seven. Because of this limit, we cannot leverage spread placement groups to guarantee that replicas are allocated to different physical hosts in the same availability zone. Instead, we spread the brokers in each Kafka cluster among three availability zones, and ensure that replicas of each topic partition are spread among the availability zones to withstand up to two broker failures per cluster.

Kafka Cluster auto-healing

With thousands of brokers running in the cloud, we have broker failures almost every day. Manual work was required to handle broker failures. That added significant operational overhead to the team. In 2017, we built and open-sourced DoctorKafka, a Kafka operations automation service to perform partition reassignment during broker failure for operation automation.

It turned out that partition reassignment alone is not sufficient. In January 2018, we encountered broker failures that partition reassignment alone could not heal due to degraded hardware. When the underlying physical machines were degraded, the brokers ran into unexpected bad states. Although DoctorKafka can assign topic partitions on the failed brokers to other brokers, producers and consumers from dependent services may still try to talk to the failed or degraded broker, resulting in issues in the dependent services. Replacing failed brokers quickly is important for guaranteeing Kafka service quality.

In Q1 2018, we improved DoctorKafka with a broker replacement feature that allows it to replace failed brokers automatically using user-provided scripts, which has helped us protect the Kafka clusters against unforeseeable issues. Replacing too many brokers in a short period of time can cause data loss, as our clusters only store three replicas of data. To address this issue, we built a rate limiting feature in DoctorKafka that allows it to replace only one broker for a cluster in a period of time.

It’s also worth noting that the AWS ec2 api allows users to replace instances while keeping hostnames and IP addresses unchanged, which enables us to minimize the impact of broker replacement on dependent services. We’ve since been able to reduce Kafka-related alerts by >95% and keep >2000 brokers running in the cloud with minimum human intervention. See here for our broker replacement configuration in DoctorKafka.

Working with the Kafka open source community

The Kafka open source community has been active in developing new features and fixing known issues. We set up an internal build to continuously pull the latest Kafka changes in release branches and push them into production in a monthly cadence.

We’ve also improved Kafka ourselves and contributed the changes back to the community. Recently, Pinterest engineers have made the following contributions to Kafka:

  • KIP-91 Adding delivery.timeout.ms to Kafka producer
  • KIP-245 Use Properties instead of StreamsConfig in KafkaStreams constructor
  • KAFKA-6896 Export producer and consumer metrics in Kafka Streams
  • KAFKA-7023 Move prepareForBulkLoad() call after customized RocksDBConfigSettters
  • KAFKA-7103 Use bulk loading for RocksDBSegmentedBytesStore during init

We’ve also proposed several Kafka Improvement Proposals that are under discussion:

  • KIP-276 Add config prefix for different consumers
  • KIP-300 Add windowed KTable API
  • KIP-345 Reduce consumer rebalances through static membership

Next Steps

Although we’ve made improvements to scale the Kafka service at Pinterest, many interesting problems need to be solved to bring the service to the next level. For instance, we’ll be exploring Kubernetes as an abstraction layer for Kafka at Pinterest.

We’re currently investigating using two availability zones for Kafka clusters to reduce interzone data transfer costs, since the chance of two simultaneous availability zone failures is low. AWS latest generation instance types are EBS optimized, and have dedicated EBS bandwidth and better network performance than previous generations. As such, we’ll evaluate these latest instance types leveraging EBS for faster Kafka broker recovery.

Pinterest engineering has many interesting problems to solve, from building scalable, reliable, and efficient infrastructure to applying cutting edge machine learning technologies to help Pinners discover and do what they love. Check out our open engineering roles and join us!

Acknowledgements: Huge thanks to Henry Cai, Shawn Nguyen, Yi Yin, Liquan Pei, Boyang Chen, Eric Lopez, Robert Claire, Jayme Cox, Vahid Hashemian, and Ambud Sharma who improved Kafka service at Pinterest.

Appendix:

1. The Kafka broker setting that we use with d2.2xlarge instances. Here we only list the settings that are different from Kafka default values.

2. The following is Pinterest Kafka java parameters.

We enable TLS access for Kafka at Pinterest. As of Kafka 2.0.0, each KafkaChannel with a ssl connection costs ~122K memory, and Kafka may accumulate a large number of unclosed KafkaChannels due to frequent re-connection (see KAFKA-7304 for details). We use a 8GB heap size to minimize the risk of having Kafka run into long-pause GC. We used a 4GB heap size for Kafka process before enabling TLS.

Figure 2. The size of a KafkaChannel object with an SSL connection.