InfoQ Homepage Fault Tolerance Content on InfoQ
-
Architecting for High Availability
Attila Narin discusses AWS concepts: Availability Zones, RDS Multi-AZ deployments, SQS and Auto Scaling, Elastic IP, load balancing, DNS, DynamoDB, Amazon S3, etc., and EC2 best practices.
-
Designing Fault Tolerant Distributed Applications
Scott Andreas discussing creating fault tolerant distributed applications, and demoes Ordasity, a framework for building self-organizing systems with services.
-
Runaway Complexity in Big Data, and a Plan to Stop It
Nathan Marz outlines several sources of complexity introduced in data systems - Lack of human fault-tolerance, Conflation of data and queries, Schemas done wrong - and what can be done to avoid them.
-
Erlang's Open Telecom Platform (OTP) Framework
Steve Vinoski introduces Erlang’s OTP Frmework, outlining some of its main features, including several behaviors – implementations of common patterns useful for concurrent fault-tolerant applications.
-
Storm: Distributed and Fault-tolerant Real-time Computation
Nathan Marz discusses Storm concepts –streams, spouts, bolts, topologies-, explaining how to use Storms’ Clojure DSL for real-time stream processing, distributed RPS and continuous computations.
-
Anomaly Detection, Fault Tolerance and Anticipation Patterns
John Allspaw discusses fault tolerance, anomaly detection and anticipation patterns helpful to create highly available and resilient systems.
-
Techniques for Scaling the Netflix API
Daniel Jacobson covers the history of Netflix’s APIs, adaptation for the cloud, development and testing, resiliency, and the future of their APIs.
-
Architecting for Failure at the Guardian.co.uk
Michael Brunton-Spall talks about various types of system failure that can happen, sharing the lessons learned at the Guardian and measures taken to prevent and mitigate failure.
-
Building Highly Available Systems in Erlang
Joe Armstrong discusses highly available (HA) systems, introducing different types of HA systems and data, HA architecture and algorithms, 6 rules of HA, and how HA is done with Erlang.
-
Storm: Distributed and Fault-tolerant Real-time Computation
Nathan Marz explain Storm, a distributed fault-tolerant and real-time computational system currently used by Twitter to keep statistics on user clicks for every URL and domain.
-
Above the Clouds: Introducing Akka
Jonas Bonér introduces Akka, a JVM platform that wants to address the complex problems of concurrency, scalability and fault tolerance using Actors, STM and self-healing from crashes.
-
Things Break, Riak Bends
Justin Sheehy talks about failure and the need to prepare for it, giving some real life examples along with techniques implemented in Riak to make it resilient to faults.