BT

Chaos Engineering at Twilio

| by Hrishikesh Barua Follow 14 Followers on Dec 25, 2017. Estimated reading time: 2 minutes |

The Twilio team describes their foray into Chaos Engineering where they use Gremlin to inject failures into their homegrown queuing system shards to test for automated recovery.

Twilio provides SMS and phone gateway services via APIs that application developers can invoke from their code. A core part of Twilio's architecture is their distributed queuing and rate limiting system. It provides persistent queueing to handle system failures and latency in processing messages, avoiding message loss. The system is called Ratequeue, built by the Twilio team. Ratequeue rate limits the dequeue rates for large numbers - every phone number has its own queue - of ephemeral queues. Rate limiting is required because developers can invoke the Twilio APIs as fast as possible, but the rate at which Twilio pushes these into the phone networks has to be controlled. Ratequeue is built on top of Redis, and is horizontally sharded for isolation and load balancing. Failure of one shard does not impact the other shards. Additionally, each shard is composed of a master and its replica for HA.

When a primary shard fails in Ratequeue, the previous way to recover was for a human to intervene and manually promote the replica to master. This involved locating the host which has the same shard number as the primary and adding it to the LB. The team built two systems to remove this manual process - an automated failover system and a fault injection system to test the former. The fault injection system is presented as an example of chaos engineering, where a system is subjected to random faults at various levels to test its ability to recover from failure.

Zero data loss was a primary goal of the exercise - the others being automatic detection of failure and promotion of the new master. The team built a custom solution using Amazon Kinesis, Nagios and Lazarus - their cluster automation service. Each Ratequeue replica pushes heartbeats about its master’s health to Nagios, which in turn pushes notifications to Kinesis if a threshold is breached.  Lazarus listens on Kinesis for these events, performs its own checks for the health of the cluster and starts the process of failover if required.

To test this automated failure recovery, the team created a tool called Ratequeue Chaos that would pick a shard, kill the primary and monitor its recovery.  A service called Gremlin is used to generate and inject faults into the system, triggering the failover. Gremlin allows for controlled fault injections across the stack and is invoked by Ratequeue Chaos via its API. This process runs every four hours on Twilio’s staging environment.

The team also shared learnings during this process - have a hypothesis based testing model, a framework to run the test, and a rollback plan if this is done on production.

Rate this Article

Adoption Stage
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread
Community comments

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Discuss

Educational Content

Login to InfoQ to interact with what matters most to you.


Recover your password...

Follow

Follow your favorite topics and editors

Quick overview of most important highlights in the industry and on the site.

Like

More signal, less noise

Build your own feed by choosing topics you want to read about and editors you want to hear from.

Notifications

Stay up-to-date

Set up your notifications and don't miss out on content that matters to you

BT