Reliability Engineering Matters, Except When It Doesn't
Michael Nygard shares essential Reliability Engineering techniques that can keep systems from falling apart, but the discipline has some limitations to be considered.
Michael Nygard shares essential Reliability Engineering techniques that can keep systems from falling apart, but the discipline has some limitations to be considered.
Who ever has wondered what kind of software is used by Santa Claus & Co, got a hint recently in youtube. This might irritate some software engineers who have assumed, Santa Claus would only use Open Source Software.
As announced on 18th August 2011, the Irish Software Engineering Research Center (Lero) has signed a €300.000 contract for a research project with the European Space Agency (ESA). Goal of the research activities is to provide a solution framework for future space missions.
Debugging event driven applications has always been notoriously difficult. The research project Footsteps project seeks to address the problems of reproducibility by offering a logging and replay framework that records non-deterministic events such as mouse clicks and random number generation. No plugins or special browsers are needed, this done entirely with JavaScript.
MongoDB's new journaling feature improves reliability with write-ahead redo logs. Log entries are written before permanent storage is updated. When a server restarts after a crash outstanding journal files will be replayed before the server goes online. Other changes include sharding performance boosts, shell tab completion, and the addition of covering and sparse indexes.
![]()
This article draws an analogy between QoS for networks and for applications, resulting in a mapping guide between the two and introducing a production solution for Java, (J)Ruby, and (J)Python apps.
Blake Mizerany presents various ways that can lead to system failure in distributed systems and how to recover using Doozer, a highly available, consistent data store.
Steve Vinoski explains how to avoid some of the Erlang errors that can bring down a system starting from the premise that not all the crashes are welcome as the “Let It Crash” philosophy might suggest.
Arnon Rotem-Gal-Oz discusses creating a SOA implementation that maintains a good overall reliability in spite of using smaller and a larger number of components.
Jonas Bonér and Kresten Krab Thorup discuss some key aspects of Erlang like fault tolerance and reliability and how the Akka and Erjang projects try to bring them to the JVM.
In this interview at Agile 2011, Jez Humble discusses continuous delivery and the deployment pipeline, emphasizing the importance of feedback and automating tests at every level to validate deployments. Gone are the days of massive acceptance test scripts. He also talks about the evils of feature branching, and speaks on the DevOps practices to collaborate all the way through the delivery cycle.