BT

Your opinion matters! Please fill in the InfoQ Survey!

Adrian Cockcroft Discusses Chaos Architecture: "Four Layers, Two Teams, and an Attitude"

| by Daniel Bryant Follow 276 Followers on Nov 17, 2017. Estimated reading time: 3 minutes | NOTICE: The next QCon is in London Mar 5-9, 2018. Join us!

A note to our readers: As per your request we have developed a set of features that allow you to reduce the noise, while not losing sight of anything that is important. Get email and web notifications by choosing the topics you are interested in.

At QCon San Francisco Adrian Cockcroft presented "Chaos Architecture", and discussed the evolution of cloud native architecture, and how chaos engineering can be applied to produce better and safer systems. Effective chaos architecture and engineering was presented as consisting of "four layers, two teams, and an attitude": infrastructure, switching, applications, and people; choas engineering and security red teams; and "break it to make it better".

Cockcroft, VP Cloud Architecture Strategy at AWS, began the talk by presenting the evolution of software architecture, from creating monolithic applications, through to microservices, and ultimately to functions ("serverless"). The first attempts at splitting monolithic code bases -- classical Service Oriented Architecture (SOA) -- resulted in coarse-grained mini-monoliths that communicated infrequently and with large payloads e.g. XML via SOAP. The emergence of the microservices architectural style brought with it an increased frequency of communication, often via a REST-like interface, and a typically involving a concise JSON or binary-encoded payload over HTTP.

Microservices to functions

Five years ago microservices were being constructed using standard building blocks -- often services and platform utilities being provided by cloud providers, for example, DBaaS, MQaaS and NoSQL DB services -- and the services themselves were the glue between the bricks that encapsulated the business logic. The emergence of Function-as-a-Service (FaaS) serverless, such as AWS Lambda, three years ago has led to the evolution of the business logic glue components now becoming ephemeral functions. This approach fundamentally alters the way system architectures are designed - when the system is idle it shuts down, and the customer pays nothing.

The co-evolution of best practice application design alongside this change in architecture is often referred to as "cloud native". Cockcroft presented a series of cloud native principles for architecture: pay-as-you-go OPEX, rather than upfront CAPEX; self-service consumption and automated provisioning via APIs; globally distributed by default; cross-zone and geographic region availability models; high-utilisation - systems under 40% utilisation should be scaled down; and immutable code deployments via robust continuous delivery pipelines.

Cockcroft switched gears halfway through the presentation, and focused on chaos engineering and chaos architecture, the essence of which he believes is captured with "four layers, two teams, and an attitude". The first two layers are infrastructure and switching, and here customer requests should be routed to specific local regions and services, data should be replicated and requests re-routed to active services during an incident, and the switching mechanism must be more reliable than the redundant components being switched between.

The next layer, applications, can be made more resilient by designing microservices to limit the "blast radius" of any incident. Circuit breakers limit damage and bulkheads prevent it spreading. Quoting Greg Hawkings from Starling Bank -- a UK-based challenger bank -- Cockcroft stated that the Do Idempotent Things To Others (DITTO) architecture, in combination with avoiding update and delete semantics, makes implementing a resilient system much easier.

The fourth layer, people, are a core component of implementing resilient systems - unexpected application behaviour often causes people to intervene and make the situation worse. Fire drills save lives in the event of a real fire, because people are trained how to react -- but who runs the "fire drill" for IT? The "two teams" presented were the chaos engineering team and the security red team.

Chaos Architecture

The chaos engineering team utilise tools such as Netflix's Simian Army, Failure Injection Testing, ChAP and Gremlin, and drill people on how to react to disaster scenarios be running game days. The security red team use tools like Safestack AVA, Metasploit, AttackIQ and SafeBreach, and proactively attempt to break into systems and coerce engineers do perform inappropriate (insecure) actions under a controlled environment. The core attitude to both these teams should be "break it to make it better".

Cockcroft concluded the talk by discussing risk tolerance, and asking what is more important for an organisation: availability - being permissive and "failing open"; or consistency and security, with the associated downtime. The mantra of "break it to make it better" can actually become "break it to make it safer", and additional resources on this approach include Todd Conklin's Pre Accident Investigation Podcast, John Allspaw's stella.report and Sydney Dekker's Drift into Failure. Failures are a system problem -- a lack of safety margin -- and the notion of an issue being caused by (a single) human error it not valid.

The slides for Adrian Cockcroft's "Chaos Architecture" (PDF, 8MB) talk can be found on the QCon SF website. The video will be made available on InfoQ over the coming months.

Rate this Article

Adoption Stage
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Failure as a Use Case by David Pitt

Agree with Mr. Cockroft's assertion that the movement towards highly distributed stateless architectures requires failure to be a feature of a platform. We've built a free MIT licensed software that introduces failure injection into the Java/Spring stack.

keyholelabs.com/trouble-maker

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

1 Discuss

Login to InfoQ to interact with what matters most to you.


Recover your password...

Follow

Follow your favorite topics and editors

Quick overview of most important highlights in the industry and on the site.

Like

More signal, less noise

Build your own feed by choosing topics you want to read about and editors you want to hear from.

Notifications

Stay up-to-date

Set up your notifications and don't miss out on content that matters to you

BT