InfoQ Homepage Resilience Content on InfoQ
-
Gremlin Adds Automated Service Discovery for Targeting Chaos Experiments
Gremlin, a chaos engineering platform, recently announced automated service discovery. This new feature will auto discover services running within dynamic environments. These services are then available to target for chaos experiments. Gremlin has also added role based access control for their API keys.
-
Cheryl Hung on Trends in Cloud Native and DevOps for 2021
In a recent keynote for The DEVOPS Conference, Cheryl Hung, VP ecosystem for the Cloud Native Computing Foundation (CNCF), shared her top 10 predictions for cloud native in the upcoming year. This includes improvements in cross cloud support, growth in GitOps and chaos engineering practices, and an increase in the adoption of FinOps.
-
InfoQ Live March 16: Explore Ways of Reducing Uncertainty in Software Delivery
InfoQ Live, the one-day virtual event for software engineers and architects, returns on March 16th with a new edition, this time focusing on ways to reduce the uncertainty of your software development cycle.
-
Gremlin Aims to Reduce Kubernetes Noisy Neighbours through Chaos Engineering
Gremlin has released enhancements to its Chaos Engineering platform aimed at DevOps engineers interested in future-proofing Kubernetes clusters by isolating "noisy neighbours". On Kubernetes, the noisy neighbour issue occurs when multiple applications sharing a Kubernetes cluster compete for resources leading to degraded performance.
-
Gremlin Releases State of Chaos Engineering 2021 Report
Gremlin released their State of Chaos Engineering 2021 report based on a community survey and their own product data. The key findings include a positive correlation between running chaos engineering experiments and increased availability.
-
Uber Implements Disaster Recovery for Multi-Region Kafka
In a recent blog post, Uber engineers highlight how they use a replication platform to implement disaster recovery at scale with a multi-region Kafka deployment. Uber has a large deployment of Apache Kafka, processing trillions of messages and multiple petabytes of data per day. Uber's engineers provided business resilience and continuity in the face of natural and human-made disasters.
-
AWS Announces Chaos Engineering as a Service Offering
AWS has announced the upcoming release of their chaos engineering as a service offering. The Fault Injection Service (FIS) will provide fully-managed chaos experiments across a number of AWS services. The service includes pre-built templates that generate disruptions mimicking common real-world events. It can be integrated into CI pipelines via API.
-
Chaos Engineering on Kubernetes : Chaos Mesh Generally Available with v1.0
The Chaos Mesh team announced the general availability (GA) of Chaos Mesh 1.0 after it was accepted as a CNCF sandbox project in July 2020. Chaos Mesh is a tool to perform chaos engineering experiments on Kubernetes applications.
-
Chaos Conf Q&A: Adrian Cockcroft & Yury Niño Roa
In preparation for ChaosConf 2020, InfoQ sat down with Adrian Cockcroft and Yury Niño Roa to explore topics of interest in the chaos engineering community. Key takeaways included: there are clear benefits to running “game days” to develop psychological safety, and the future of chaos engineering points toward incorporating security and scaling up experiments to test larger failure modes.
-
Navigating Complex Software Projects and Leading in Uncertain Times: InfoQ Live, Sept 23rd
InfoQ Live brings together world-class practitioners such as John Willis, senior director in Red Hat's Global Transformation Office, and Sarah Wells, technical director for operations and reliability @FT, to share their valuable insights and practical advice on software engineering leadership.
-
Delivering Technology through Software Engineering Leadership: Upcoming InfoQ Live Event
InfoQ Live, the interactive virtual event designed for the modern software practitioner, returns on Sept 23rd with a new topic focus: delivering technology by software engineering leadership and by empowering teams. Join world-class practitioners and deep-dive into best practices for leading tech projects, analyzing team data dynamics, and leading teams in uncertain times.
-
An Open Source Chaos Engineering Library from AWS
AWS engineers recently wrote about an open source chaos engineering tool called AWSSSMChaosRunner that they used to test fault injection in Prime Video. Built using AWS Systems Manager that can execute arbitrary commands on EC2 instances, the team was able to mitigate latency related issues using it.
-
Gremlin Announces General Availability of Status Checks
Gremlin recently announced the general availability of Status Checks. This new feature automatically validates systems that are healthy and ready for running chaos experiments in production.
-
Chaos and Resilience Engineering: Mental Models, Tools and Experiments
In a recent InfoQ podcast, Nora Jones, co-founder and CEO at Jeli, explored the differences between chaos engineering and resilience engineering, and provided advice for planning and running effective chaos experiments, and learning effectively from incidents.
-
Applying Observability to Ship Faster
To get fast feedback, ship work often, as soon as it is ready, and use automated systems in Live to test the changes. Monitoring can be used to verify if things are good, and to raise an alarm if not. Shipping fast in this way can result in having fewer tests and can make you more resilient to problems.