InfoQ Homepage Resilience Content on InfoQ

News

RSS Feed

Newer Older

Cloud

Microsoft Announces Azure Chaos Studio in Public Preview

At the recent Ignite, Microsoft announced the public preview of Azure Chaos Studio, a fully-managed experimentation service to help customers track, measure, and mitigate faults with controlled chaos engineering to improve the resilience of their cloud applications.

Steef-Jan Wiggers
on Nov 10, 2021
Culture & Methods

Why the Most Resilient Companies Want More Incidents

According to John Egan, the incident management process is meant to be a cycle of not just the response, but also the account of root cause and the updating of internal processes and practices across the industry. Lowering the barrier to reporting incidents, holding effective incident review meetings using blameless postmortems, and giving everyone access to postmortems is what he advises.

Ben Linders
on Jun 10, 2021
Architecture & Design

Uber Implements Disaster Recovery for Multi-Region Kafka

In a recent blog post, Uber engineers highlight how they use a replication platform to implement disaster recovery at scale with a multi-region Kafka deployment. Uber has a large deployment of Apache Kafka, processing trillions of messages and multiple petabytes of data per day. Uber's engineers provided business resilience and continuity in the face of natural and human-made disasters.

Eran Stiller
on Jan 04, 2021
Culture & Methods

Navigating Complex Software Projects and Leading in Uncertain Times: InfoQ Live, Sept 23rd

InfoQ Live brings together world-class practitioners such as John Willis, senior director in Red Hat's Global Transformation Office, and Sarah Wells, technical director for operations and reliability @FT, to share their valuable insights and practical advice on software engineering leadership.

Adelina Turcu
on Sep 17, 2020
Culture & Methods

Delivering Technology through Software Engineering Leadership: Upcoming InfoQ Live Event

InfoQ Live, the interactive virtual event designed for the modern software practitioner, returns on Sept 23rd with a new topic focus: delivering technology by software engineering leadership and by empowering teams. Join world-class practitioners and deep-dive into best practices for leading tech projects, analyzing team data dynamics, and leading teams in uncertain times.

Adelina Turcu
on Sep 10, 2020
DevOps

Chaos and Resilience Engineering: Mental Models, Tools and Experiments

In a recent InfoQ podcast, Nora Jones, co-founder and CEO at Jeli, explored the differences between chaos engineering and resilience engineering, and provided advice for planning and running effective chaos experiments, and learning effectively from incidents.

Daniel Bryant
on Jul 09, 2020
Culture & Methods

Applying Observability to Ship Faster

To get fast feedback, ship work often, as soon as it is ready, and use automated systems in Live to test the changes. Monitoring can be used to verify if things are good, and to raise an alarm if not. Shipping fast in this way can result in having fewer tests and can make you more resilient to problems.

Ben Linders
on Jun 18, 2020
DevOps

Improving Incident Management through Role Assignments and Game Days

John Arundel, principal consultant at Bitfield Consulting, shared his thoughts on how to ensure incidents are handled smoothly and quickly. He suggests assigning specific roles to each team member responding to the incident. Red team versus blue team exercises can also be leveraged to ensure the team is prepared to respond accurately and quickly.

Matt Campbell
on Mar 25, 2020
DevOps

Failure Modes and Building Resilient Systems: Adrian Cockcroft at QCon SF

Adrian Cockcroft recently shared his thoughts on how to produce resilient systems that operate successfully in spite of the presence of failures. At the recent QCon San Francisco event, he also shared what he considers are good cloud resilience patterns for building with a continuous resilience mindset.

Matt Campbell
on Dec 18, 2019
Java

Spring Cloud Introduces Pluggable Circuit-Breaker Interface

Spring Cloud incubator has introduced a new project called Spring Cloud Circuit Breaker that provides a pluggable circuit-breaker interface. This will help systems to fail fast and prevent cascading failures and system overload.

Uday Tatiraju
on May 07, 2019
Culture & Methods

Mature Microservices and How to Operate Them: QCon London Q&A

Microservices is an architectural approach to keep systems decoupled for releasing many changes a day, said Sarah Wells in her keynote at QCon London 2019. To build resilient and maintainable systems you need things like load balancing across healthy nodes, backoff and retry, and persistence or fanning out of requests via queues. The best way to know whether your system is resilient is to test it.

Ben Linders
on Apr 25, 2019
DevOps

Amplifying Sources of Resilience: John Allspaw at QCon London

At QCon London John Allspaw presented “Amplifying Sources of Resilience: What Research Says”. Key takeaways from the talk included: that resilience is something a system does, not what a system has; creating and sustaining “adaptive capacity” within an organisation is resilient action; and learning about how people cope with surprise is the path to finding sources of resilience.

Daniel Bryant
on Apr 23, 2019
Java

Failsafe 2.0 Released with Composable Resilience Policies

Failsafe, a zero-dependency Java library for handling failures, has released version 2.0 with support for resilience policy composition and a pluggable architecture that enables custom policy service providers.

Uday Tatiraju
on Apr 11, 2019
Architecture & Design

Designing and Building a Resilient Serverless System: John Chapin at QCon London

In a presentation at QCon London 2019, John Chapin explained the basics of serverless technologies and how to architect and build a resilient serverless system. He also ran a demo of a how a globally distributed, highly available application can be built and run in multiple regions on AWS.

Jan Stenberg
on Mar 12, 2019
DevOps

Building Production-Ready Applications: Michael Kehoe Shares Lessons Learned from LinkedIn

At QCon San Francisco, Michael Kehoe presented “Building Production-Ready Applications”. Drawing on his experience with site reliability engineering (SRE), he introduced the tenets of “production-readiness” that all engineers across the organisation should focus on as: stability and reliability; scalability and performance; fault tolerance and disaster recovery; monitoring; and documentation.

Daniel Bryant
on Nov 12, 2018

Newer News

Older News

InfoQ Software Architects' Newsletter

News