BT

Failure Testing of Microservices

by Jan Stenberg on Feb 29, 2016 |

Failure testing should be a critical part of running your microservices, Kolton Andrus stated in his presentation at the recent Microservices Practitioner Summit. Verifying that your services behave as you expect is something you should do to prevent outages.

Andrus, a software engineer formerly at Netflix, compares failure testing with a vaccine where you inject a small amount of something harmful into a body in order to build an immunity to it. For Andrus this translates very well into the microservice world. We are injecting a little bit of something harmful into a microservice in order to see how it behaves and then we try to build an immunity to that.

The downside with failure testing is the impact it may cause. It can break things or cause some customer impact, but if we can end up in situations where the worst thing that can happen isn’t that bad and the best thing is pretty good, e.g. preventing an outage, he thinks the downside is manageable.

Doing failure testing Andrus prefers working with failure scenarios, thinking about what can go wrong and how systems can fail. By asking questions like "What are we worried about?" or "What could go wrong?" he thinks we’ll be a little bit better prepared. Thinking about how likely it is that a failure occur you can find common events within the infrastructure that you should spend time on. He notes though that we can’t prepare for everything, there will always be failures that we can’t see coming but believes that by being prepared we will be more able in mitigate the problems.

Another question Andrus thinks helps in prioritization and risk assessment is "What’s the cost of being wrong?". You can then run a cost-benefit analysis thinking not only about what could go wrong, but also what’s likely to go wrong, which can help in deciding where to spend time and money to get the best outcome.

Andrus emphasizes the importance of also testing in production. With testing only in a test environment, none of the production configuration, network or hardware is tested and he quotes James Hamilton, Distinguished Engineer at Amazon Web Services (AWS):

Those unwilling to test in Production aren’t yet confident that their service will continue operating through failures. And without production testing, recovery won't work when called upon.

If you are doing a lot of work and create mitigations but fail to test them in production, you may later discover in production that they don't work or make an outage even worse, and that's not a pleasant situation to be in.

Rate this Article

Relevance
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Couldn't agree more by David Pitt

On our team we introduced "Failure as a use case" and have even created an open source solution to help with Failure Testing. Here's a link to learn more.

keyholesoftware.com/2015/12/15/failure-as-a-use...

Thanks,
David

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

1 Discuss
General Feedback
Bugs
Advertising
Editorial
Marketing
InfoQ.com and all content copyright © 2006-2016 C4Media Inc. InfoQ.com hosted at Contegix, the best ISP we've ever worked with.
Privacy policy
BT

We notice you’re using an ad blocker

We understand why you use ad blockers. However to keep InfoQ free we need your support. InfoQ will not provide your data to third parties without individual opt-in consent. We only work with advertisers relevant to our readers. Please consider whitelisting us.