InfoQ Homepage Fault Tolerance Content on InfoQ
-
When a Cloud Region Fails: Rethinking High Availability in a Geopolitically Unstable World
Sovereign fault domains are failure boundaries defined by legal, political, or physical jurisdiction rather than hardware topology. The article maps geopolitical events to known distributed-systems failure modes, argues multi-region should replace multi-AZ as the HA baseline for systems crossing jurisdictions, and outlines design patterns, chaos experiments, and an ALE model to justify the spend.
-
Analyzing Apache Kafka Stretch Clusters: WAN Disruptions, Failure Scenarios, and DR Strategies
Proficient in analyzing the dynamics of Apache Kafka Stretch Clusters, I assess WAN disruptions and devise effective Disaster Recovery (DR) strategies. With deep expertise, I ensure high availability and data integrity across multi-region deployments. My insights optimize operational resilience, safeguarding vital services against service level agreement violations.
-
Designing Resilient Event-Driven Systems at Scale
Learn how to design resilient event-driven systems that scale. Explore key patterns like shuffle sharding and decoupling queues to handle load spikes and failures. Understand common pitfalls like over-relying on retries and neglecting observability for robust, scalable architectures.
-
Cell-Based Architecture Adoption Guidelines
The challenges in building modern, reliable, and understandable distributed systems continue to grow, and cell-based architecture is a valuable way to accept, isolate, and stay reliable in the face of failures. Organizations must ensure that the cell-based architecture is the right fit for them and that the migration will not cause more problems than it solves.
-
Taking Advantage of Cell-Based Architectures to Build Resilient and Fault-Tolerant Systems
Cell-based architectures offer a robust approach to building resilient systems. They achieve this through the core principles of isolation, autonomy, and replication. Each cell manages its resources and makes decisions autonomously. Observability for cell-based architecture requires a tailored approach to address the unique challenges and opportunities presented by this distributed system design.
-
Article Series: Cell-Based Architectures: How to Build Scalable and Resilient Systems
In this article series, we take readers on a journey of discovery and provide a comprehensive overview and in-depth analysis of many key aspects of cell-based architectures, as well as practical advice for applying this approach to existing and new architectures.
-
Implementing Microservicilities with Quarkus and MicroProfile
Microservicilities is a list of cross-cutting concerns that a service must implement apart from the business logic. These concerns include invocation, elasticity and resiliency, among others. This article describes how Quarkus and MicroProfile may be used to implement these concerns.
-
Designing Chaos Experiments, Running Game Days, and Building a Learning Organization: Chaos Conf Q&A
The second Chaos Conf event is taking place in San Francisco over 25-26 September. In preparation for the conference, InfoQ sat down with a number of the presenters, and discussed topics such as the evolution and adoption of chaos engineering, key people and process learning from running chaos experiments, and what the biggest blockers are for mainstream adoption.
-
Resilient Systems in Banking
Resilience is about tolerating failure, not eliminating it. To build a resilient system, you must build a system that absorbs shocks, and continues or recovers. Following best practices for resilient architecture, including established cloud patterns, allowed Starling Bank to build a bank, from scratch, in a year, against a backdrop of highly public outages amongst incumbent banks.
-
Service Mesh: Promise or Peril?
Service meshes such as Istio, Linkerd, and Cilium are gaining increased visibility as companies adopt microservice architectures. The arguments for a service mesh are compelling: full-stack observability, transparent security, systems resilience, and more. But is a service mesh really the right solution for you? This article examines when a service mesh makes sense and when it might not.
-
Six Tips for Running Scalable Workloads on Kubernetes
Tips to ensure Kubernetes knows what is happening with your deployment: where best to schedule it, when is it ready to serve requests and ensuring work is spread across as many nodes as possible.
-
A Comparison between Rust and Erlang
This article will focus on a comparison between Erlang and Rust, detailing their similarities and differences. It may be interesting to both Erlang developers looking into Rust and Rust developers looking into Erlang. A final section will detail more about each of the language capabilities and shortcomings and argue for the possibility of leveraging both languages' strengths in the same project.