InfoQ Homepage Fault Tolerance Content on InfoQ
Articles
RSS Feed-
Cell-Based Architecture Adoption Guidelines
The challenges in building modern, reliable, and understandable distributed systems continue to grow, and cell-based architecture is a valuable way to accept, isolate, and stay reliable in the face of failures. Organizations must ensure that the cell-based architecture is the right fit for them and that the migration will not cause more problems than it solves.
-
Taking Advantage of Cell-Based Architectures to Build Resilient and Fault-Tolerant Systems
Cell-based architectures offer a robust approach to building resilient systems. They achieve this through the core principles of isolation, autonomy, and replication. Each cell manages its resources and makes decisions autonomously. Observability for cell-based architecture requires a tailored approach to address the unique challenges and opportunities presented by this distributed system design.
-
Article Series: Cell-Based Architectures: How to Build Scalable and Resilient Systems
In this article series, we take readers on a journey of discovery and provide a comprehensive overview and in-depth analysis of many key aspects of cell-based architectures, as well as practical advice for applying this approach to existing and new architectures.
-
Implementing Microservicilities with Quarkus and MicroProfile
Microservicilities is a list of cross-cutting concerns that a service must implement apart from the business logic. These concerns include invocation, elasticity and resiliency, among others. This article describes how Quarkus and MicroProfile may be used to implement these concerns.
-
Designing Chaos Experiments, Running Game Days, and Building a Learning Organization: Chaos Conf Q&A
The second Chaos Conf event is taking place in San Francisco over 25-26 September. In preparation for the conference, InfoQ sat down with a number of the presenters, and discussed topics such as the evolution and adoption of chaos engineering, key people and process learning from running chaos experiments, and what the biggest blockers are for mainstream adoption.
-
Resilient Systems in Banking
Resilience is about tolerating failure, not eliminating it. To build a resilient system, you must build a system that absorbs shocks, and continues or recovers. Following best practices for resilient architecture, including established cloud patterns, allowed Starling Bank to build a bank, from scratch, in a year, against a backdrop of highly public outages amongst incumbent banks.
-
Service Mesh: Promise or Peril?
Service meshes such as Istio, Linkerd, and Cilium are gaining increased visibility as companies adopt microservice architectures. The arguments for a service mesh are compelling: full-stack observability, transparent security, systems resilience, and more. But is a service mesh really the right solution for you? This article examines when a service mesh makes sense and when it might not.
-
Six Tips for Running Scalable Workloads on Kubernetes
Tips to ensure Kubernetes knows what is happening with your deployment: where best to schedule it, when is it ready to serve requests and ensuring work is spread across as many nodes as possible.
-
A Comparison between Rust and Erlang
This article will focus on a comparison between Erlang and Rust, detailing their similarities and differences. It may be interesting to both Erlang developers looking into Rust and Rust developers looking into Erlang. A final section will detail more about each of the language capabilities and shortcomings and argue for the possibility of leveraging both languages' strengths in the same project.
-
When Streams Fail: Implementing a Resilient Apache Kafka Cluster at Goldman Sachs
At QCon New York, Anton Gorshkov presented “When Streams Fail: Kafka Off the Shore”. The talk shared insight into how a platform team at a large financial institution design and operate shared internal messaging clusters like Apache Kafka, and also how they plan for, and resolve, the inevitable failure that occurs.
-
But is it Safe?
While it is rare to hear the question, "Is this software safe?", the safety aspects of software are becoming increasingly important. The proliferation of IoT devices increases the widespread impact a small problem can cause. Several techniques exist to help developers analyze and improve the safety of software they create.
-
Storm Applied Review and Q&A with the Authors
Storm is a distributed, fault-tolerant, real-time computation system that was originally developed at BackType and later open sourced by Twitter. Storm Applied is a new book from Manning that aims to provide a practical guide on using Storm, both in a development and in a production setting. InfoQ has spoken with two of the book’s authors, Sean T. Allen and Matthew Jankowski.