InfoQ Homepage Reliability Content on InfoQ

Articles

RSS Feed

Newer Older

Cloud

Engineering Principles for Building a Successful Cloud-Prem Solution

Discover how Cloud-Prem solutions combine cloud efficiency with on-premise control, meeting data sovereignty and compliance demands while optimizing operational costs and enhancing customer security.

Satyam Dhar
on Jun 26, 2025
DevOps

Analyzing Apache Kafka Stretch Clusters: WAN Disruptions, Failure Scenarios, and DR Strategies

Proficient in analyzing the dynamics of Apache Kafka Stretch Clusters, I assess WAN disruptions and devise effective Disaster Recovery (DR) strategies. With deep expertise, I ensure high availability and data integrity across multi-region deployments. My insights optimize operational resilience, safeguarding vital services against service level agreement violations.

Srikanth Daggumalli Nishchai Jayanna Manjula
on Jun 20, 2025
Cloud

Designing Resilient Event-Driven Systems at Scale

Learn how to design resilient event-driven systems that scale. Explore key patterns like shuffle sharding and decoupling queues to handle load spikes and failures. Understand common pitfalls like over-relying on retries and neglecting observability for robust, scalable architectures.

Rajesh Kumar Pandey
on May 30, 2025
Culture & Methods

Reaching Your Automatic Testing Goals by Enhancing Your Test Architecture

If you have automatic end-to-end tests, you have test architecture, even if you’ve never given it a thought. Test architecture encompasses everything from code to more theoretical concerns like enterprise architecture, but with concrete, immediate consequences. Let's explore how you can achieve the goals you have for your automatic testing effort.

James Bornefelt Westfall
on Dec 04, 2024
Architecture & Design

How Cell-Based Architecture Enhances Modern Distributed Systems

Cell-based architecture has emerged as a response to many challenges associated with distributed systems. It employs the bulkhead pattern to isolate failures to a fraction of the affected infrastructure footprint and prevent widespread impact. Cells can also help organize large architectures into domain-bound deployment and delivery units, which provides essential sociotechnical benefits.

Erica Pisani Rafal Gancarz
on Oct 14, 2024
Culture & Methods

Prepare to Be Unprepared: Investing in Capacity to Adapt to Surprises in Software-Reliant Businesses

Incidents are often perceived as extraordinary aberrations, unconnected to "normal" work. For over twenty years, the field of Resilience Engineering has aimed at flipping this approach around — by understanding what makes incidents so rare (relative to when and how they do not happen) and so minor (relative to how much worse they can be) and deliberately enhancing what makes that possible.

John Allspaw
on Aug 12, 2024
Culture & Methods

Data-Driven Decision Making - Software Delivery Performance Indicators at Different Granularities

Optimizing a software delivery organization is not a straightforward process standardized in the software industry. Getting the organization to analyze the data and act on it is a difficult undertaking. This article presents insights into how a socio-technical framework for optimizing a software delivery organization has been set up and brought to the point of regular use.

Vladyslav Ukis
on May 23, 2023
DevOps

AIOps: Site Reliability Engineering at Scale

AIOps can simplify and streamline processes which can reduce the mental burden on employees while improving communication and collaboration between departments.

Dominick Blue
on May 02, 2023
Culture & Methods

Assessing Organizational Culture to Drive SRE Adoption

SRE adoption is greatly influenced by the organizational culture at hand. This article describes how to assess the organizational culture in terms of production operations at the beginning of the SRE transformation. It provides a roadmap of small culture changes accumulating over time, and shows how the leadership facilitated the necessary culture changes

Vladyslav Ukis
on Apr 04, 2023
Development

The Service and the Beast: Building a Windows Service that Does Not Fail to Restart

Windows Services play a key role in the Microsoft Windows operating system, and support the creation and management of long-running processes. When “Fast Startup” is enabled and the PC is started after a regular shutdown, though, services may fail to restart. The aim of this article is to create a persistent service that will always run and restart after Windows restarts, or after shutdown.

Michael Haephrati Ruth Haephrati
on Dec 28, 2022
Architecture & Design

Building & Operating High-Fidelity Data Streams

At QCon Plus 2021 last November, Sid Anand, chief architect at Datazoom and PMC Member at Apache Airflow, presented on building high-fidelity nearline data streams as a service within a lean team. In this talk, Anand provides a master class on building high-fidelity data streams from the ground up.

Sid Anand
on Sep 30, 2022
Culture & Methods

Employing Team-Based Agile Coaching to Establish SRE in an Organization

Establishing SRE in a software delivery organization typically requires a socio-technical transformation. Operations teams need to learn how to provide a scalable SRE infrastructure to enable development teams to run their services efficiently. This paper presents how agile coaching has been employed to run an SRE transformation in a 25-teams strong product delivery organization.

Philipp Gündisch Vladyslav Ukis
on Aug 23, 2022

Newer Articles

Older Articles

InfoQ Software Architects' Newsletter

Articles