InfoQ Homepage Resilience Content on InfoQ
-
Why the World Needs More Resilient Systems: Tammy Butow Discusses Chaos Engineering at QCon London
At QCon London, Tammy Butow, explained why the world needs more resilient systems, and how this can be achieved with the practice of chaos engineering. Three primary prerequisites for chaos engineering were provided -- high severity “SEV” incident management, monitoring, and measuring the impact -- and a series of guidelines, tools and practices presented.
-
Bloomberg Releases Open Source “PowerfulSeal” Kubernetes-Specific Chaos Testing Tool
At the recent KubeCon North America conference, Bloomberg presented their new open source “PowerfulSeal” tool, which enables chaos testing within Kubernetes clusters via the termination of targeted pods and underlying node infrastructure.
-
Chaos Engineering at Twilio
The Twilio team describes their foray into Chaos Engineering where they use Gremlin to inject failures into their homegrown queuing system shards to test for automated recovery.
-
Werner Vogels on “21st Century [Cloud] Architectures”: Availability, Reliability and Resilience
At the AWS re:invent 2017 conference, Werner Vogels, CTO of Amazon, presented a keynote that discussed core concepts required for building “21st Century Architectures” on the cloud. Highlights of the talk included discussion of the emerging practices of evolutionary and “cloud native” architectures, the role of security becoming everyone’s responsibility, and the benefits of chaos engineering.
-
Serverless Challenges in Hybrid Environments
Sam Newman, independent consultant and author of the book "Building Microservices", talked at the Velocity conference in London on the challenges faced when hybrid systems rely on both serverless architectures and traditional infrastructure. In particular, Newman discussed how serverless changes our notion of resiliency and how the two paradigms clash at times of high load in the system.
-
Expedia's Journey toward Site Resiliency: Embracing Chaos Testing in Dev and Production at QCon SF
At QCon SF, Sahar Samiei and Willie Wheeler presented “Expedia’s Journey Toward Site Resiliency”, and discussed the building of a community of practice around resilience testing within Expedia. The results have generally been positive: Netflix’s Chaos Monkey has been running daily in production since May 15th; and resilience tests have been added to four Tier 1 service pipelines.
-
Adrian Cockcroft Discusses Chaos Architecture: "Four Layers, Two Teams, and an Attitude"
At QCon San Francisco, Adrian Cockcroft presented “Chaos Architecture”, and discussed the evolution of cloud native architecture, and how chaos engineering can be applied to produce better and safer systems. Effective chaos architecture and engineering was presented as consisting of “four layers, two teams, and an attitude”.
-
Designing Services for Resilience: Nora Jones Discusses Netflix Chaos Engineering at QCon SF
At QCon SF Nora Jones presented “Designing Services for Resilience Experiments: Lessons from Netflix”. Key takeaways from the talk included: the customer experience is a priority; designing for resiliency testability is a shared responsibility; configuration changes can cause outages; and engineers should have have explicit monitoring in place to detect antipatterns in configuration changes.
-
Choose Your Own Adventure: Chaos Engineering at QCon New York 2017
Nora Jones, senior chaos engineer at Netflix, talked about chaos engineering at QCon New York 2017. She presents different stages of chaos engineering adoption and gives stories from her previous experiences at Jet and Netflix.
-
Netflix Engineer Lorin Hochstein on Chaos Monkey 2.0
Netflix made waves when it initially announced Chaos Monkey, a tool that would terminate normally healthy VM instances in production. The goal was to embrace failure and thereby increase resiliency. Rags Srinivas caught up with Lorin Hochstein at Netflix regarding the recent upgrade to Chaos Monkey.
-
Chaos Monkey 2.0 Runs via Spinnaker
Netflix has recently made available the source code of the Chaos Monkey 2.0. The latest iteration of the resilience tool is fully integrated with Spinnaker and event tracking systems, but the SSH support has been removed.
-
-
Google Kick-Starts Git Ketch: A Fault-Tolerant Git Management System
Although development has only started, Google has announced their first commits of Git Ketch, a multi-master Git management system that replicates information across multiple Git servers for resilience and scalability. The changes are based on JGit, a Java-based Git server, although other Git servers may be part of the multi-master cluster.
-
Microsoft Makes Available Their Platform for Building Microservices
Microsoft has announced and made available the preview of Azure Service Fabric (ASF), a cloud platform including a runtime and lifecycle management tools for creating, deploying, running and managing microservices. ASF microservices can be deployed on Azure or on-premises on Windows Server private or hosted clouds. Support for Linux is to come in the future.
-
Anti-patterns for Handling Failure
Oliver Hankeln shares the anti-patterns he found for handling failure in organizations: hiding mistakes, engaging in blame game, the arc of escalation and cowardice. He then suggests corrective actions for each of them.