Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Guides The InfoQ eMag - Taming Complex Systems in Production

The InfoQ eMag - Taming Complex Systems in Production


Software systems are becoming more complex, outages are becoming more expensive, and consumers are becoming less tolerant of downtime. All of this typically exerts a high mental (and physical) strain on operations engineers. Acknowledging the fragility of complex systems is the first step in building resilience into systems and people.

Further, to tame complexity and its effects, organizations need a structured, multi-pronged, human-focused approach, that: makes operations work sustainable, centers decisions around customer experience (tip: that’s what SLOs are for), uses continuous testing (yes, including in production), and includes chaos engineering and system observability.

In this eMag, we cover all of these topics to help you tame the complexity in your system and create a healthy working environment for people handling production.

Free download

The InfoQ eMag - Taming Complex Systems in Production include:

  • An Engineer’s Guide to a Good Night’s Sleep - Nicky Wrightson brings five tried and true techniques that helped her team at The Financial Times nearly eliminate out-of-hours support calls. This is not to say their system was fault-free, but rather that they were able to build a resilient technical and operational system. Wrightson highlights the importance of having teams that own their system, not just from a delivery point of view, but also the operational model and all the aspects supporting it. For instance, regularly practicing (injecting) failure scenarios recovery is fundamental to increase confidence of those supporting the system.
  • Designing Chaos Experiments, Running Game Days, and Building a Learning Organization - Speaking of failure scenarios and recovery, Daniel Bryant’s Q&A with prominent early adopters of chaos engineering from multiple organizations gives an overview of the benefits, challenges, and practices in this critical area for companies investing in seriously improving and learning from incident response. Reading this piece will help you separate the myth (chaos engineering is running random attacks in production) from the truth (chaos engineering is a principled practice of experimentation and information sharing that includes testing in production).
  • Sustainable Operations in Complex Systems with Production Excellence - Liz Fong-Jones stresses that production ownership (putting the team who develops a service on call) is not equivalent to production excellence. Without proper training and safeguards to ensure people’s well-being, production ownership can actually have a negative effect on service reliability and, even worse, demoralize the team. Tooling helps automate processes, but teams also need a roadmap for developing the necessary skills for production excellence. These include measuring what matters (SLOs), observability-enabled quick diagnosis of unknown issues, inter-team collaboration, blameless retrospectives, and risk analysis.
  • Unlocking Continuous Testing: The Four Best Practices Necessary for Success - Continuous testing is critical to assess the quality and reliability of systems, among other ilities. But without a strong focus on the quality of the tests, testing can become costly and ineffective, slowing down delivery. Lubos Parobek shares four key practices to ensure high test quality: investing in a high pass-rate, keeping tests short and atomic (which leads to shorter test execution and more reliable tests), testing across multiple platforms, and leveraging parallelization (to ensure test suites can grow without compromising speed of execution).
  • Testing in Production—Quality Software Faster - What if we could do continuous testing reliably and safely, not only during delivery but also in production? While chaos engineering focuses on failure injection, Michael Bryzek’s talk (of which we include a summary) focuses on testing happy day scenarios in production, dramatically increasing the confidence that business-critical services are working as expected every day, every hour, and every minute. Bryzek illustrates with practical examples of how implementing safeguards to testing in production can be much simpler than most people expect. What tends to be more challenging is architecting and delivering distributed services with minimal dependencies and learning to develop and trust high-quality automated tests.

InfoQ eMags are professionally designed, downloadable collections of popular InfoQ content - articles, interviews, presentations, and research - covering the latest software development technologies, trends, and topics.