Dealing with Thundering Herd at Braintree

Braintree engineer Anthony Ross explained in a recent article how introducing some random jitter into retry intervals for failed tasks solved a thundering herd issue which was impacting the efficiency of their payment dispute management API.

The thundering herd problem can occur when multiple processes are waiting on a single event. When the event happens, the processes are awakened more or less at the same time. This implies that, even if only one of them will eventually handle the event, all processes will compete for available resources anyway, impairing the system overall efficiency.

In Braintree's case, a similar problem to thundering herd was originated by the way their system dealt with failed jobs. In short, when the Disputes API received too many requests, it triggered the auto-scaling mechanism. Still, while the system was scaling in, a high number of jobs ended up in the "dead letter queue" (DLQ). Usually, jobs in the DLQ are automatically retried, only, it was not always the case:

Why can’t our jobs reach the processor service at certain points in the day? In fact, we had retries for these exact errors so why didn’t the retries work? Once we double-checked that jobs were in fact retrying, we realized something else was going on.

It turns out one of the factors involved was the retry interval for jobs in the DLQ. Specifically, there were two factors at play: the use of a static retry interval and coupling between the Disputes API and the Processor service.

Using static retry intervals was the primary cause of the thundering herd behaviour. In fact, while a static retry interval works fine in cases of low resource contention, it does not when there are too many jobs to retry at the same time. This means a portion of the jobs will fail and get back to the DLQ. The solution for this issue was adding some randomness to the retry interval, which required a patch to Ruby on Rails. Adding jitter made retries more efficient, says Ross, and helped DLQ get empty sooner.

Static retry intervals were not the only factor hampering efficiency. The second one was the interplay between retry intervals and the scaling in/out policy adopted.

Scale in and out policies are a trade-off of time and money. The faster it can scale in and out, the more cost effective it is. But the trade-off is that we can be under-provisioned for a period.

Briefly, what happened is that a number of retries were executed after such a long interval that the system had scaled out the number of available service processes and was thus unable to handle the retry overload. To improve this, the Braintree team had to re-architect their system by getting rid of the independent processor service and including part of its logic into the Disputes API itself, which enabled efficient scale out.

Do not miss the original article to get the full details, including a more accurate description of the Disputes API and the overall architecture implemented at Braintree to deal with retries.

About the Author

Sergio De Simone

Show moreShow less

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

InfoQ Article Contest

About the Author

Sergio De Simone

Rate this Article

This content is in the Dynamic Languages topic

Related Topics:

Related Editorial

Related Sponsored Content

Popular across InfoQ

The InfoQ Newsletter