Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News Allegro Uses Control Theory for Workload Balancing in its Apache Kafka PubSub Platform

Allegro Uses Control Theory for Workload Balancing in its Apache Kafka PubSub Platform

Allegro, the largest eCommerce platform in Poland, implemented dynamic workload balancing in Hermes, its open-source publish-subscribe message broker built on top of Apache Kafka. The new workload balancing algorithm achieves more uniform resource utilization and lower infrastructure costs across heterogeneous hardware.

Allegro has created Hermes to support its messaging needs across hundreds of polyglot microservices that make its eCommerce platform. Hermes provides the means for publishing messages via the REST interface (Hermes Frontend) but internally stores messages in Apache Kafka. It also handles message consumption from Kafka with Hermes Consumers component, providing reliability with retries, backpressure, and rate limiting. Consumers can be configured with a list of subscriber services and their HTTP endpoints that will be receiving messages from specified topics. Hermes then handles Apache Kafka consumer group subscriptions, ensuring messages from Kafka topic partitions are delivered to a configurable number of consumer instances.

Hermes Architecture (Source: Allegro Technology Blog)

One of the key features of Hermes Consumers is the ability to enable horizontal scalability for message consumption, given the unpredictable load patterns and ephemeral nature of cloud computing resources. A mechanism called a "workload balancer" is responsible for monitoring the state of the cluster and proposing adjustments in the consumer distribution.

Piotr Rżysko, senior software engineer at Allegro, explains the deficiencies of the initial design of the workload balancer:

Our first implementation of the workload balancer aimed to always assign the same number of subscriptions to each consumer. This strategy is easy to understand and performs optimally when subscriptions are equal with respect to their load. However, this is not always the case.

Specifically, the team has observed that the CPU usage across Hermes Consumer instances wasn’t evenly distributed, resulting in significant variations and forcing the team to use the highest utilization for scaling purposes to provide sufficient headroom necessary to absorb traffic fluctuations. This meant that many instances were underutilized, wasting resources and generating costs.

Uneven CPU Utilization for Hermes Consumers (Source: Allegro Technology Blog)

The team decided to improve the workload balancing algorithm while preserving the core responsibilities of the original balancer around redistributing for added/removed consumers and subscriptions. They also wanted to ensure an even number of subscriptions per consumer to support the threading model used in Hermes Consumers. Lastly, the team aimed to minimize the number of Kafka consumer group rebalances to avoid gaps in message delivery.

In the first attempt, the developers introduced weighting and stabilization delay into the algorithm and used exponentially weighted moving average (EWMA) to smooth out traffic fluctuations. The initial set of changes has narrowed down the variation in CPU utilization but fell short of the desired outcome.

The team discovered that the uneven CPU utilization was down to differences in instance types used. They employed the Control Theory concept, called a proportional controller, that works by gradually adjusting the applied change while gauging the effect on the system. After applying the feedback-based approach, the revised algorithm was able to eventually work out the appropriate weights for subscription-consumer allocations to achieve even CPU utilization across heterogeneous hardware.

About the Author

Rate this Article