Testing Resiliency at PagerDuty Without a Simian Army

by Manuel Pais on Nov 12, 2013 |

Doug Barth, from PagerDuty, talked at DevOps Days London about their approach to start testing their systems for resilience without dedicating a lot of automation effort upfront. The goal was to quickly start learning about failure points and openly discuss how to fix them with a one hour per week time box.

Automating failure testing with the same coverage as Netflix’s famous simian army was not possible due to PagerDuty’s multi-cloud environment and investing in in-house automation tools would delay initial results. Thus they opted for a manual failure testing approach nicknamed “Failure Friday”. It consists of spending one hour each Friday trying out a list of “attacks” (provoked failures) and checking how the “victim” (system being tested) reacts.

Between attacks the system is put back to a normal working state. Attacks stop if things break badly (for example requests sent to the victim not being picked up by other service instances after failure). In such case the session is halted and the system is recovered manually. A permanent fix gets tested the following Friday. Otherwise the attacks continue till the hour long session is over.

Attack strategies went from quick failure simulations such as stopping one Cassandra database instance or rebooting a server instance to more complicated simulations of network isolation (misconfigured IP tables dropping packets coming in on specific ports) or slow nodes (using netem’s network emulation).

Fixing issues in the system and raising overall awareness of the need to handle and test for failure were some of the benefits expected. But Doug also highlighted unexpected benefits such as the ease to ramp up new on-call people (dev or ops) after their exposure and understanding of the provoked failures, as opposed to just theoretical knowledge that will likely be outdated or inaccurate by the time a real non-provoked failure happens. Another unplanned benefit was the uncovering of hard to simulate component failures which led to changes in the system architecture that increase its overall testability.

In terms of practical organization Doug mentioned the importance of keeping logs and action times, tracking discoveries and issues as well as sharing dashboards and metrics. He also recommends not turning off alarms during the session in order to check that monitoring is working as expected but announce the attack sessions to everyone in order to avoid alarm escalation due to the provoked failures.

Rate this Article


Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread
Community comments

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

General Feedback
Marketing and all content copyright © 2006-2016 C4Media Inc. hosted at Contegix, the best ISP we've ever worked with.
Privacy policy

We notice you're using an ad blocker

We understand why you use ad blockers. However to keep InfoQ free we need your support. InfoQ will not provide your data to third parties without individual opt-in consent. We only work with advertisers relevant to our readers. Please consider whitelisting us.