BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News Grab Reduces Traffic Cost for Kafka Consumers on AWS to Zero

Grab Reduces Traffic Cost for Kafka Consumers on AWS to Zero

This item in japanese

Bookmarks

Grab took advantage of the ability of Apache Kafka consumers to connect to the broker node in the same availability zone (AZ) introduced in Kafka 2.3 and reduced the traffic cost on AWS to zero for reconfigured consumers. The change has substantially reduced overall infrastructure costs for running Apache Kafka on AWS.

Grab has created a streaming data platform around Apache Kafka that supports all company’s products. Following Kafka best practices, their initial configuration used three replicas for each Kafka partition, spanning three different availability zones within the AWS region. The team responsible for this platform observed that the cross-AZ traffic accounted for half the cost of their Kafka platform as AWS charges for the cross-AZ data transfer.

Fabrice Harbulot and Quang Minh Tran remark on the cost considerations of the initial setup:

The problem with this design is that it generates staggering cross-AZ network traffic. This is because, by default, Kafka clients communicate only with the partition leader, which has a 67% probability of residing in a different AZ.

The overall cross-AZ traffic combines newly-published messages, data replication between brokers, and messages fetched by consumers.

The default consumer configuration where consumers fetch data from the partition leader (Source: Grab Engineering Blog)

Since Apache Kafka 2.3, it’s possible to configure consumers to fetch from partition replicas. That way, if consumers fetch messages only from the brokers in the same AZ, they can do so without incurring data transfer costs.

This feature requires that both the Kafka brokers and consumers are made aware of the AZ in which they reside. For Kafka brokers, the team configured broker.rack with the value of AZ ID (az1, az2, az3, etc.), which is different than the AZ name (1a, 1b, 1c, etc.) as the latter is inconsistent across AWS accounts. They also set the parameter replica.selector.class with value org.apache.kafka.common.replica.RackAwareReplicaSelector.

On the consumer side, the team updated the internal Kafka SDK to configure the client.rack parameter with the AZ ID based on EC2 host metadata so that application teams can enable the functionality by exporting an environment variable.

The custom consumer configuration where consumers fetch data from the closest replica (Source: Grab Engineering Blog)

After rolling out the new setup to some services, the team observed cross-AZ traffic costs go down and discovered some note-worthy side effects. Firstly, the end-to-end latency was increased by up to 500 milliseconds, which is to be expected considering most consumers fetch messages from replicas. Hence, the replication time is responsible for the extra latency. Any latency-sensitive data flows should ideally always fetch from the partition leader, even with the extra cost involved.

Secondly, in the case of broker maintenance, consumers fetching messages directly from replicas may experience brokers being unavailable during downtime, so they should wait/retry until the broker in the same AZ comes back online. Lastly, the team observed a skew in the brokers' load related to the number of consumers across AZs, which means that even distribution of consumers is essential to ensure a balanced load on brokers.

About the Author

Rate this Article

Adoption
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Community comments

  • Can the replication cost be reduced?

    by Dravid Bhushan,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    So as I understand, this will reduce the cost due to consumer access of cross region data, however the replication cost will still be incurred, anyway to reduce that?

  • Re: Can the replication cost be reduced?

    by Rafal Gancarz,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Hi Dravid, thanks for your question. The technique described here applies only to Kafka consumer cross-AZ traffic (so within the region rather than between regions) when running Kafka on EC2 in AWS or another cloud provider that charges for cross-AZ traffic between compute instances.

    If cross-AZ replication traffic cost is a big concern for you, you can reconsider your availability/resiliency requirements to determine if, for some topics, you could use the replication factor of 2 rather than recommended 3, which comes with some caveats (see docs.aws.amazon.com/msk/latest/developerguide/b...).

    Another option to consider, assuming you are on AWS, is to use AWS MKS, where inter-broker traffic is free (or included in the price of the MKS service; see aws.amazon.com/msk/pricing/). Lastly, you could look into products compatible with Kafka that offer storage in S3 or other blob storage cloud services (with those, you will trade increased end-to-end latency and the lack of advanced Kafka features for lower operational costs).

  • Re: Can the replication cost be reduced?

    by Srinivas Devaki,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    warpstream seems to be removing the cost of inter-az cost as well but its not exactly kafka but kafka protocol compatible gateway/proxy

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

BT