BT

Your opinion matters! Please fill in the InfoQ Survey!

Testing Resiliency at PagerDuty Without a Simian Army

| by Manuel Pais Follow 8 Followers on Nov 12, 2013. Estimated reading time: 1 minute |

A note to our readers: As per your request we have developed a set of features that allow you to reduce the noise, while not losing sight of anything that is important. Get email and web notifications by choosing the topics you are interested in.

Doug Barth, from PagerDuty, talked at DevOps Days London about their approach to start testing their systems for resilience without dedicating a lot of automation effort upfront. The goal was to quickly start learning about failure points and openly discuss how to fix them with a one hour per week time box.

Automating failure testing with the same coverage as Netflix’s famous simian army was not possible due to PagerDuty’s multi-cloud environment and investing in in-house automation tools would delay initial results. Thus they opted for a manual failure testing approach nicknamed “Failure Friday”. It consists of spending one hour each Friday trying out a list of “attacks” (provoked failures) and checking how the “victim” (system being tested) reacts.

Between attacks the system is put back to a normal working state. Attacks stop if things break badly (for example requests sent to the victim not being picked up by other service instances after failure). In such case the session is halted and the system is recovered manually. A permanent fix gets tested the following Friday. Otherwise the attacks continue till the hour long session is over.

Attack strategies went from quick failure simulations such as stopping one Cassandra database instance or rebooting a server instance to more complicated simulations of network isolation (misconfigured IP tables dropping packets coming in on specific ports) or slow nodes (using netem’s network emulation).

Fixing issues in the system and raising overall awareness of the need to handle and test for failure were some of the benefits expected. But Doug also highlighted unexpected benefits such as the ease to ramp up new on-call people (dev or ops) after their exposure and understanding of the provoked failures, as opposed to just theoretical knowledge that will likely be outdated or inaccurate by the time a real non-provoked failure happens. Another unplanned benefit was the uncovering of hard to simulate component failures which led to changes in the system architecture that increase its overall testability.

In terms of practical organization Doug mentioned the importance of keeping logs and action times, tracking discoveries and issues as well as sharing dashboards and metrics. He also recommends not turning off alarms during the session in order to check that monitoring is working as expected but announce the attack sessions to everyone in order to avoid alarm escalation due to the provoked failures.

Rate this Article

Adoption Stage
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread
Community comments

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Discuss

Login to InfoQ to interact with what matters most to you.


Recover your password...

Follow

Follow your favorite topics and editors

Quick overview of most important highlights in the industry and on the site.

Like

More signal, less noise

Build your own feed by choosing topics you want to read about and editors you want to hear from.

Notifications

Stay up-to-date

Set up your notifications and don't miss out on content that matters to you

BT