Adrian Cockcroft Discusses Chaos Architecture: "Four Layers, Two Teams, and an Attitude"

At QCon San Francisco Adrian Cockcroft presented "Chaos Architecture", and discussed the evolution of cloud native architecture, and how chaos engineering can be applied to produce better and safer systems. Effective chaos architecture and engineering was presented as consisting of "four layers, two teams, and an attitude": infrastructure, switching, applications, and people; choas engineering and security red teams; and "break it to make it better".

Cockcroft, VP Cloud Architecture Strategy at AWS, began the talk by presenting the evolution of software architecture, from creating monolithic applications, through to microservices, and ultimately to functions ("serverless"). The first attempts at splitting monolithic code bases -- classical Service Oriented Architecture (SOA) -- resulted in coarse-grained mini-monoliths that communicated infrequently and with large payloads e.g. XML via SOAP. The emergence of the microservices architectural style brought with it an increased frequency of communication, often via a REST-like interface, and a typically involving a concise JSON or binary-encoded payload over HTTP.

Microservices to functions

Five years ago microservices were being constructed using standard building blocks -- often services and platform utilities being provided by cloud providers, for example, DBaaS, MQaaS and NoSQL DB services -- and the services themselves were the glue between the bricks that encapsulated the business logic. The emergence of Function-as-a-Service (FaaS) serverless, such as AWS Lambda, three years ago has led to the evolution of the business logic glue components now becoming ephemeral functions. This approach fundamentally alters the way system architectures are designed - when the system is idle it shuts down, and the customer pays nothing.

The co-evolution of best practice application design alongside this change in architecture is often referred to as "cloud native". Cockcroft presented a series of cloud native principles for architecture: pay-as-you-go OPEX, rather than upfront CAPEX; self-service consumption and automated provisioning via APIs; globally distributed by default; cross-zone and geographic region availability models; high-utilisation - systems under 40% utilisation should be scaled down; and immutable code deployments via robust continuous delivery pipelines.

Cockcroft switched gears halfway through the presentation, and focused on chaos engineering and chaos architecture, the essence of which he believes is captured with "four layers, two teams, and an attitude". The first two layers are infrastructure and switching, and here customer requests should be routed to specific local regions and services, data should be replicated and requests re-routed to active services during an incident, and the switching mechanism must be more reliable than the redundant components being switched between.

The next layer, applications, can be made more resilient by designing microservices to limit the "blast radius" of any incident. Circuit breakers limit damage and bulkheads prevent it spreading. Quoting Greg Hawkings from Starling Bank -- a UK-based challenger bank -- Cockcroft stated that the Do Idempotent Things To Others (DITTO) architecture, in combination with avoiding update and delete semantics, makes implementing a resilient system much easier.

The fourth layer, people, are a core component of implementing resilient systems - unexpected application behaviour often causes people to intervene and make the situation worse. Fire drills save lives in the event of a real fire, because people are trained how to react -- but who runs the "fire drill" for IT? The "two teams" presented were the chaos engineering team and the security red team.

Chaos Architecture

The chaos engineering team utilise tools such as Netflix's Simian Army, Failure Injection Testing, ChAP and Gremlin, and drill people on how to react to disaster scenarios be running game days. The security red team use tools like Safestack AVA, Metasploit, AttackIQ and SafeBreach, and proactively attempt to break into systems and coerce engineers do perform inappropriate (insecure) actions under a controlled environment. The core attitude to both these teams should be "break it to make it better".

Cockcroft concluded the talk by discussing risk tolerance, and asking what is more important for an organisation: availability - being permissive and "failing open"; or consistency and security, with the associated downtime. The mantra of "break it to make it better" can actually become "break it to make it safer", and additional resources on this approach include Todd Conklin's Pre Accident Investigation Podcast, John Allspaw's stella.report and Sydney Dekker's Drift into Failure. Failures are a system problem -- a lack of safety margin -- and the notion of an issue being caused by (a single) human error it not valid.

The slides for Adrian Cockcroft's "Chaos Architecture" (PDF, 8MB) talk can be found on the QCon SF website. The video will be made available on InfoQ over the coming months.

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

InfoQ Article Contest

Rate this Article

This content is in the QCon Software Development Conference topic

Related Topics:

Related Editorial

Related Sponsored Content

Popular across InfoQ

The InfoQ Newsletter