InfoQ Homepage Resilience Content on InfoQ
-
Microsoft Announces Azure Chaos Studio in Public Preview
At the recent Ignite, Microsoft announced the public preview of Azure Chaos Studio, a fully-managed experimentation service to help customers track, measure, and mitigate faults with controlled chaos engineering to improve the resilience of their cloud applications.
-
Why the Most Resilient Companies Want More Incidents
According to John Egan, the incident management process is meant to be a cycle of not just the response, but also the account of root cause and the updating of internal processes and practices across the industry. Lowering the barrier to reporting incidents, holding effective incident review meetings using blameless postmortems, and giving everyone access to postmortems is what he advises.
-
Uber Implements Disaster Recovery for Multi-Region Kafka
In a recent blog post, Uber engineers highlight how they use a replication platform to implement disaster recovery at scale with a multi-region Kafka deployment. Uber has a large deployment of Apache Kafka, processing trillions of messages and multiple petabytes of data per day. Uber's engineers provided business resilience and continuity in the face of natural and human-made disasters.
-
Navigating Complex Software Projects and Leading in Uncertain Times: InfoQ Live, Sept 23rd
InfoQ Live brings together world-class practitioners such as John Willis, senior director in Red Hat's Global Transformation Office, and Sarah Wells, technical director for operations and reliability @FT, to share their valuable insights and practical advice on software engineering leadership.
-
Delivering Technology through Software Engineering Leadership: Upcoming InfoQ Live Event
InfoQ Live, the interactive virtual event designed for the modern software practitioner, returns on Sept 23rd with a new topic focus: delivering technology by software engineering leadership and by empowering teams. Join world-class practitioners and deep-dive into best practices for leading tech projects, analyzing team data dynamics, and leading teams in uncertain times.
-
Chaos and Resilience Engineering: Mental Models, Tools and Experiments
In a recent InfoQ podcast, Nora Jones, co-founder and CEO at Jeli, explored the differences between chaos engineering and resilience engineering, and provided advice for planning and running effective chaos experiments, and learning effectively from incidents.
-
Applying Observability to Ship Faster
To get fast feedback, ship work often, as soon as it is ready, and use automated systems in Live to test the changes. Monitoring can be used to verify if things are good, and to raise an alarm if not. Shipping fast in this way can result in having fewer tests and can make you more resilient to problems.
-
Improving Incident Management through Role Assignments and Game Days
John Arundel, principal consultant at Bitfield Consulting, shared his thoughts on how to ensure incidents are handled smoothly and quickly. He suggests assigning specific roles to each team member responding to the incident. Red team versus blue team exercises can also be leveraged to ensure the team is prepared to respond accurately and quickly.
-
Failure Modes and Building Resilient Systems: Adrian Cockcroft at QCon SF
Adrian Cockcroft recently shared his thoughts on how to produce resilient systems that operate successfully in spite of the presence of failures. At the recent QCon San Francisco event, he also shared what he considers are good cloud resilience patterns for building with a continuous resilience mindset.
-
Spring Cloud Introduces Pluggable Circuit-Breaker Interface
Spring Cloud incubator has introduced a new project called Spring Cloud Circuit Breaker that provides a pluggable circuit-breaker interface. This will help systems to fail fast and prevent cascading failures and system overload.
-
Mature Microservices and How to Operate Them: QCon London Q&A
Microservices is an architectural approach to keep systems decoupled for releasing many changes a day, said Sarah Wells in her keynote at QCon London 2019. To build resilient and maintainable systems you need things like load balancing across healthy nodes, backoff and retry, and persistence or fanning out of requests via queues. The best way to know whether your system is resilient is to test it.
-
Amplifying Sources of Resilience: John Allspaw at QCon London
At QCon London John Allspaw presented “Amplifying Sources of Resilience: What Research Says”. Key takeaways from the talk included: that resilience is something a system does, not what a system has; creating and sustaining “adaptive capacity” within an organisation is resilient action; and learning about how people cope with surprise is the path to finding sources of resilience.
-
Failsafe 2.0 Released with Composable Resilience Policies
Failsafe, a zero-dependency Java library for handling failures, has released version 2.0 with support for resilience policy composition and a pluggable architecture that enables custom policy service providers.
-
Designing and Building a Resilient Serverless System: John Chapin at QCon London
In a presentation at QCon London 2019, John Chapin explained the basics of serverless technologies and how to architect and build a resilient serverless system. He also ran a demo of a how a globally distributed, highly available application can be built and run in multiple regions on AWS.
-
Building Production-Ready Applications: Michael Kehoe Shares Lessons Learned from LinkedIn
At QCon San Francisco, Michael Kehoe presented “Building Production-Ready Applications”. Drawing on his experience with site reliability engineering (SRE), he introduced the tenets of “production-readiness” that all engineers across the organisation should focus on as: stability and reliability; scalability and performance; fault tolerance and disaster recovery; monitoring; and documentation.