InfoQ Homepage Resilience Content on InfoQ
-
Navigating Complex Software Projects and Leading in Uncertain Times: InfoQ Live, Sept 23rd
InfoQ Live brings together world-class practitioners such as John Willis, senior director in Red Hat's Global Transformation Office, and Sarah Wells, technical director for operations and reliability @FT, to share their valuable insights and practical advice on software engineering leadership.
-
Delivering Technology through Software Engineering Leadership: Upcoming InfoQ Live Event
InfoQ Live, the interactive virtual event designed for the modern software practitioner, returns on Sept 23rd with a new topic focus: delivering technology by software engineering leadership and by empowering teams. Join world-class practitioners and deep-dive into best practices for leading tech projects, analyzing team data dynamics, and leading teams in uncertain times.
-
Chaos and Resilience Engineering: Mental Models, Tools and Experiments
In a recent InfoQ podcast, Nora Jones, co-founder and CEO at Jeli, explored the differences between chaos engineering and resilience engineering, and provided advice for planning and running effective chaos experiments, and learning effectively from incidents.
-
Applying Observability to Ship Faster
To get fast feedback, ship work often, as soon as it is ready, and use automated systems in Live to test the changes. Monitoring can be used to verify if things are good, and to raise an alarm if not. Shipping fast in this way can result in having fewer tests and can make you more resilient to problems.
-
Improving Incident Management through Role Assignments and Game Days
John Arundel, principal consultant at Bitfield Consulting, shared his thoughts on how to ensure incidents are handled smoothly and quickly. He suggests assigning specific roles to each team member responding to the incident. Red team versus blue team exercises can also be leveraged to ensure the team is prepared to respond accurately and quickly.
-
Failure Modes and Building Resilient Systems: Adrian Cockcroft at QCon SF
Adrian Cockcroft recently shared his thoughts on how to produce resilient systems that operate successfully in spite of the presence of failures. At the recent QCon San Francisco event, he also shared what he considers are good cloud resilience patterns for building with a continuous resilience mindset.
-
Spring Cloud Introduces Pluggable Circuit-Breaker Interface
Spring Cloud incubator has introduced a new project called Spring Cloud Circuit Breaker that provides a pluggable circuit-breaker interface. This will help systems to fail fast and prevent cascading failures and system overload.
-
Mature Microservices and How to Operate Them: QCon London Q&A
Microservices is an architectural approach to keep systems decoupled for releasing many changes a day, said Sarah Wells in her keynote at QCon London 2019. To build resilient and maintainable systems you need things like load balancing across healthy nodes, backoff and retry, and persistence or fanning out of requests via queues. The best way to know whether your system is resilient is to test it.
-
Amplifying Sources of Resilience: John Allspaw at QCon London
At QCon London John Allspaw presented “Amplifying Sources of Resilience: What Research Says”. Key takeaways from the talk included: that resilience is something a system does, not what a system has; creating and sustaining “adaptive capacity” within an organisation is resilient action; and learning about how people cope with surprise is the path to finding sources of resilience.
-
Failsafe 2.0 Released with Composable Resilience Policies
Failsafe, a zero-dependency Java library for handling failures, has released version 2.0 with support for resilience policy composition and a pluggable architecture that enables custom policy service providers.
-
Designing and Building a Resilient Serverless System: John Chapin at QCon London
In a presentation at QCon London 2019, John Chapin explained the basics of serverless technologies and how to architect and build a resilient serverless system. He also ran a demo of a how a globally distributed, highly available application can be built and run in multiple regions on AWS.
-
Building Production-Ready Applications: Michael Kehoe Shares Lessons Learned from LinkedIn
At QCon San Francisco, Michael Kehoe presented “Building Production-Ready Applications”. Drawing on his experience with site reliability engineering (SRE), he introduced the tenets of “production-readiness” that all engineers across the organisation should focus on as: stability and reliability; scalability and performance; fault tolerance and disaster recovery; monitoring; and documentation.
-
Gremlin Releases Application Level Fault Injection (ALFI) Platform for Targeted Chaos Experiments
Gremlin Inc has released their second product offering in the “Failure-as-a-Service” domain– Application-Level Fault Injection (ALFI). Building upon their initial platform that facilitated engineers in creating and running chaos experiments at the infrastructure level, ALFI enables failure injection at the application level via a native language library.
-
How to Achieve a Resilient Architecture
To manage systems at scale you must push your system almost to the breaking point, but still be able to recover – and embrace failures, Adrian Hornsby writes in two blog posts sharing his experiences from working with large-scale systems for more than a decade, and the patterns he has found useful.
-
Ben Gracewood on Learning from an Organisational Train Wreck
At the recent JAFAC conference, Ben Gracewood told the story of how POS developer Vend transformed their development organisation following catastrophic disruption and losses. He explored what happened after they reduced headcount by over 30%, what they had in place that enabled them to survive, and what they did differently as a result of the changes.