Scaling Apache Kafka at Pinterest

Apache Kafka is used at Pinterest for transporting data for real time streaming applications, logging and visibility metrics for monitoring. Hosted on AWS, Pinterest's Kafka installation uses the MirrorMaker and DoctorKafka tools for replication and high availability.

Pinterest's Kafka installation runs on over 2000 brokers, wrote Yu Yang, tech lead at Pinterest, and handles more than 800 million messages and 1.2 Petabytes per day, and is spread over three AWS regions. Their primary Kafka toolset includes Kafka's MirrorMaker and Pinterest's own DoctorKafka. MirrorMaker consumes data from a source cluster and publishes this to a target cluster, effectively creating a replica of the source cluster. Pinterest's team uses it to spread data among the three AWS regions. Most of its brokers are in the us-east-1 AWS region, although being the oldest AWS region it has its share of issues. Kafka brokers in each cluster are spread out among three availability zones, and replicas of each topic partition are spread among the zones to withstand failure of up to two brokers.

Kafka broker failures are common. Replacing failed brokers and rebalancing workloads "requires carefully creating and editing the partition reassignment files and manually executing Kafka script commands", wrote Yang in a previous article. The result was DoctorKafka, an open source tool which automates these steps. DoctorKafka can detect failures, and auto-assign workloads to healthy brokers. It's based on the master-agent model. An agent runs on each broker and collects metrics, and the central master server analyzes these metrics. The central server determines failures and runs commands for corrective action. DoctorKafka is "conservative" in the sense that it takes corrective action only when it's confident, otherwise it sends out alerts. Most large Kafka deployments use a replication strategy, either using MirrorMaker or similar tools.

Pinterest runs Kafka on AWS d2.2xlarge instances. According to Yang, they moved from c3.2xlarge instances with throughput optimized st1 EBS disks to d2 instances with local storage due to performance issues caused by EBS contention. Others however have reported opposite results in their benchmarks. Kafka also forms the backbone of Pinterest's logging infrastructure that handles 100+ TB of data per day. Services write data to disk, from where it is picked up by a log agent called Singer which writes to Kafka. Secor, another custom tool, picks up the log messages from Kafka and persists them to S3 to overcome S3's "weak eventual consistency model".

In the future, Pinterest aims to explore Kubernetes as an abstraction layer for their Kafka deployments, which some organizations are already doing. Some services at Pinterest have already moved to containers. Another goal is explore EBS again for storage, as the newer EBS offerings are better optimized.

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

InfoQ Article Contest

Rate this Article

This content is in the DevOps topic

Related Topics:

Related Editorial

Related Sponsored Content

Popular across InfoQ

The InfoQ Newsletter