BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Chaos Engineering Content on InfoQ

  • Improving Incident Management through Role Assignments and Game Days

    John Arundel, principal consultant at Bitfield Consulting, shared his thoughts on how to ensure incidents are handled smoothly and quickly. He suggests assigning specific roles to each team member responding to the incident. Red team versus blue team exercises can also be leveraged to ensure the team is prepared to respond accurately and quickly.

  • Exploring Costs of Coordination During Outages with Laura Maguire at QCon London

    Laura Maguire talked at QCon London about how the coordinative efforts during outages cause a high cognitive cost. Maguire found out that coordination during anomaly response is difficult, that existing models can undermine speedy resolution, and that the strategies to control the cost of coordination are adaptive to the type of incident. Moreover, tooling has additional costs of coordination.

  • Failure Modes and Building Resilient Systems: Adrian Cockcroft at QCon SF

    Adrian Cockcroft recently shared his thoughts on how to produce resilient systems that operate successfully in spite of the presence of failures. At the recent QCon San Francisco event, he also shared what he considers are good cloud resilience patterns for building with a continuous resilience mindset.

  • Gremlin Releases Native Kubernetes Chaos Testing

    Chaos engineering platform Gremlin released native Kubernetes support for identifying, targeting, and experimenting on Kubernetes objects in order to proactively identify service weaknesses.

  • How to Integrate Infosec and DevOps Using Chaos Engineering

    Kelly Shortridge from Capsule8 talked at the Velocity conference in Berlin about how using chaos engineering can help to integrate Infosec within a DevOps culture. Shortridge discussed how distributed, immutable, and ephemeral infrastructure, or the D.I.E. model, is an organizationally friendly way to building security by design. With this model, users can continuously raise the cost of the attack

  • Gremlin Introduces Scenarios, Enabling Real-World Chaos Experiments

    The Gremlin team announced the addition of Scenarios that allow for simulation of real-world outages. Scenarios allow for planning and tracking complex chaos experiments that more closely mimic a real-world outages. The release includes prepared Scenarios that can be run out of the box or used as a starting template to build custom incidents.

  • How Did Things Go Right? Learning More from Incidents at Netflix: Ryan Kitchens at QCon New York

    At QCon New York, Ryan Kitchens presented “How Did Things Go Right? Learning More from Incidents”. Key takeaways from the talk included: recovery is better than prevention; an incident occurs when there is a “perfect storm” of events -- there is no root cause; “stop reporting on the nines”, as user happiness is more important; and there is value in learning how things go right.

  • Solo.io Announces Service Mesh Hub and Chaos Engineering Tool

    Solo.io, a cloud native software company, launched the first industry service mesh hub. The hub provides resources to help users adopt service mesh technology in hybrid and multi-cloud environments and features tools such as Istio, Linkerd, Envoy, AWS App Mesh, and HashiCorp Consul.

  • Summary of Chaos Community Day v4.0: Resilience, Observability, and Gamedays

    Earlier in the year, the fourth edition of “Chaos Community Day” was held at Work-Bench in New York City. Key takeaways from the day included: the topic of chaos engineering draws heavily from other domains, which software engineers can also learn from; understanding systems, and communicating and exchanging the related mental models, is vital for establishing resilience.

  • Chaos Engineering Kubernetes with the Litmus Framework

    Litmus is an open source chaos engineering framework for Kubernetes environments running stateful applications. Created by MayaData, Litmus enables users to run test suites, capture logs, generate reports, and perform chaos experiments.

  • QCon NY (Jun 24-28): New Talks, a Focus on the Skills That Matter & Why You Should Join Us This Year

    In the recent Stack Overflow 9th annual survey of over 90,000 software developers, we learned that non-development work remains a productivity challenge for software managers and leaders. At QCon New York, the conference for senior software developers, we have many sessions to help you learn how others have overcome those challenges.

  • Mature Microservices and How to Operate Them: QCon London Q&A

    Microservices is an architectural approach to keep systems decoupled for releasing many changes a day, said Sarah Wells in her keynote at QCon London 2019. To build resilient and maintainable systems you need things like load balancing across healthy nodes, backoff and retry, and persistence or fanning out of requests via queues. The best way to know whether your system is resilient is to test it.

  • Amplifying Sources of Resilience: John Allspaw at QCon London

    At QCon London John Allspaw presented “Amplifying Sources of Resilience: What Research Says”. Key takeaways from the talk included: that resilience is something a system does, not what a system has; creating and sustaining “adaptive capacity” within an organisation is resilient action; and learning about how people cope with surprise is the path to finding sources of resilience.

  • Gremlin Announces Free Tier for Their Chaos Experimentation Platform

    Gremlin has announced “Gremlin Free”, which provides the ability to run chaos engineering experiments on a free tier of their failure-as-a-service SaaS platform. The current version of the free tier allows the execution of shutdown and CPU attacks on hosts or containers, which can be controlled via a simple web-based user interface, API or CLI.

  • Chaos Engineering Observability: Q&A with Russ Miles

    In a new O’Reilly report, “Chaos Engineering Observability: Bringing Chaos Experiments into System Observability”, the author, Russ Miles, explores why he believes the topics of observability and chaos engineering “go hand in hand”. He argues that as engineers begin to run chaos experiments, they will need to be able to ask many questions about the underlying system being experimented on.

BT