InfoQ Homepage Resilience Content on InfoQ
-
Gremlin Announces Free Tier for Their Chaos Experimentation Platform
Gremlin has announced “Gremlin Free”, which provides the ability to run chaos engineering experiments on a free tier of their failure-as-a-service SaaS platform. The current version of the free tier allows the execution of shutdown and CPU attacks on hosts or containers, which can be controlled via a simple web-based user interface, API or CLI.
-
Designing and Building a Resilient Serverless System: John Chapin at QCon London
In a presentation at QCon London 2019, John Chapin explained the basics of serverless technologies and how to architect and build a resilient serverless system. He also ran a demo of a how a globally distributed, highly available application can be built and run in multiple regions on AWS.
-
Chaos Engineering Observability: Q&A with Russ Miles
In a new O’Reilly report, “Chaos Engineering Observability: Bringing Chaos Experiments into System Observability”, the author, Russ Miles, explores why he believes the topics of observability and chaos engineering “go hand in hand”. He argues that as engineers begin to run chaos experiments, they will need to be able to ask many questions about the underlying system being experimented on.
-
Building Production-Ready Applications: Michael Kehoe Shares Lessons Learned from LinkedIn
At QCon San Francisco, Michael Kehoe presented “Building Production-Ready Applications”. Drawing on his experience with site reliability engineering (SRE), he introduced the tenets of “production-readiness” that all engineers across the organisation should focus on as: stability and reliability; scalability and performance; fault tolerance and disaster recovery; monitoring; and documentation.
-
An Evolution of Chaos Experimentation: Kolton Andrus at ChaosConf 2018
At the inaugural ChaosConf, held in San Francisco, USA, Kolton Andrus presented an evolution of chaos experimentation over the past eight years. He argued that the human and organisational aspects of dealing with failure should not be ignored, and also suggested that tooling should support application- and request-level targeting of failure injection tests in order to minimise the blast radius.
-
Gremlin Releases Application Level Fault Injection (ALFI) Platform for Targeted Chaos Experiments
Gremlin Inc has released their second product offering in the “Failure-as-a-Service” domain– Application-Level Fault Injection (ALFI). Building upon their initial platform that facilitated engineers in creating and running chaos experiments at the infrastructure level, ALFI enables failure injection at the application level via a native language library.
-
Russ Miles: Ignored Architects and Chaos Engineering
At the recent Event-Driven Microservices Conference in Amsterdam, Russ Miles claimed that the biggest challenge for an architect is that you get ignored. You have great ideas like event-driven microservices, but the reaction too often is that it sounds good, but that it’s overly complicated for the needs at hand.
-
How to Achieve a Resilient Architecture
To manage systems at scale you must push your system almost to the breaking point, but still be able to recover – and embrace failures, Adrian Hornsby writes in two blog posts sharing his experiences from working with large-scale systems for more than a decade, and the patterns he has found useful.
-
Ben Gracewood on Learning from an Organisational Train Wreck
At the recent JAFAC conference, Ben Gracewood told the story of how POS developer Vend transformed their development organisation following catastrophic disruption and losses. He explored what happened after they reduced headcount by over 30%, what they had in place that enabled them to survive, and what they did differently as a result of the changes.
-
Learning to Bend But Not Break at Netflix: Haley Tucker Discusses Chaos Engineering at QCon NY
At QCon New York, Haley Tucker presented “UNBREAKABLE: Learning to Bend But Not Break at Netflix” and discussed her experience with chaos engineering while working across a number of roles at Netflix. Key takeaways included: use functional sharding for fault isolation; continually tune RPC calls; run chaos experiments with small iterations; and apply the “principles of chaos”.
-
Chaos Engineering at LinkedIn: The “LinkedOut” Failure Injection Testing Framework
The LinkedIn Engineering team has recently discussed their “LinkedOut” failure injection testing framework. Hypotheses about service resilience can be formulated and failure triggers injected via the LinkedIn LiX A/B testing framework or via data in a cookie that is passed through the call stack using the Invocation Context (IC) framework. Failure scenarios include errors, delays and timeouts.
-
From Darwin to DevOps: John Willis and Gene Kim Talk about Life after The Phoenix Project
IT Revolution recently published an audiobook with nearly eight hours of conversation between Gene Kim and John Willis; Beyond the Phoenix Project – the Origins and Evolution of DevOps.
-
Increasing the Resilience of APIs with Chaos Engineering
The Gremlin team has described a simple chaos experiment as a method of validating that an organisation’s APIs are resilient. Using the principles of chaos engineering and techniques like running “game days” (a fire drill for IT systems and people) can provide value, as can the appropriate use of commercial and open source tooling emerging within this space.
-
What Resiliency Means at Sportradar
Pablo Jensen, CTO at Sportradar, talked about practices and procedures in place at Sportradar to ensure their systems meet expected resiliency levels, at this year's QCon London conference. Jensen mentioned how reliability is influenced not only by technical concerns but also organizational structure and governance, client support, and requires on-going effort to continuously improve.
-
Why the World Needs More Resilient Systems: Tammy Butow Discusses Chaos Engineering at QCon London
At QCon London, Tammy Butow, explained why the world needs more resilient systems, and how this can be achieved with the practice of chaos engineering. Three primary prerequisites for chaos engineering were provided -- high severity “SEV” incident management, monitoring, and measuring the impact -- and a series of guidelines, tools and practices presented.