InfoQ Homepage Reliability Content on InfoQ
-
Engineering Principles for Building a Successful Cloud-Prem Solution
Discover how Cloud-Prem solutions combine cloud efficiency with on-premise control, meeting data sovereignty and compliance demands while optimizing operational costs and enhancing customer security.
-
Analyzing Apache Kafka Stretch Clusters: WAN Disruptions, Failure Scenarios, and DR Strategies
Proficient in analyzing the dynamics of Apache Kafka Stretch Clusters, I assess WAN disruptions and devise effective Disaster Recovery (DR) strategies. With deep expertise, I ensure high availability and data integrity across multi-region deployments. My insights optimize operational resilience, safeguarding vital services against service level agreement violations.
-
Designing Resilient Event-Driven Systems at Scale
Learn how to design resilient event-driven systems that scale. Explore key patterns like shuffle sharding and decoupling queues to handle load spikes and failures. Understand common pitfalls like over-relying on retries and neglecting observability for robust, scalable architectures.
-
Reaching Your Automatic Testing Goals by Enhancing Your Test Architecture
If you have automatic end-to-end tests, you have test architecture, even if you’ve never given it a thought. Test architecture encompasses everything from code to more theoretical concerns like enterprise architecture, but with concrete, immediate consequences. Let's explore how you can achieve the goals you have for your automatic testing effort.
-
How Cell-Based Architecture Enhances Modern Distributed Systems
Cell-based architecture has emerged as a response to many challenges associated with distributed systems. It employs the bulkhead pattern to isolate failures to a fraction of the affected infrastructure footprint and prevent widespread impact. Cells can also help organize large architectures into domain-bound deployment and delivery units, which provides essential sociotechnical benefits.
-
Prepare to Be Unprepared: Investing in Capacity to Adapt to Surprises in Software-Reliant Businesses
Incidents are often perceived as extraordinary aberrations, unconnected to "normal" work. For over twenty years, the field of Resilience Engineering has aimed at flipping this approach around — by understanding what makes incidents so rare (relative to when and how they do not happen) and so minor (relative to how much worse they can be) and deliberately enhancing what makes that possible.
-
Data-Driven Decision Making - Software Delivery Performance Indicators at Different Granularities
Optimizing a software delivery organization is not a straightforward process standardized in the software industry. Getting the organization to analyze the data and act on it is a difficult undertaking. This article presents insights into how a socio-technical framework for optimizing a software delivery organization has been set up and brought to the point of regular use.
-
AIOps: Site Reliability Engineering at Scale
AIOps can simplify and streamline processes which can reduce the mental burden on employees while improving communication and collaboration between departments.
-
Assessing Organizational Culture to Drive SRE Adoption
SRE adoption is greatly influenced by the organizational culture at hand. This article describes how to assess the organizational culture in terms of production operations at the beginning of the SRE transformation. It provides a roadmap of small culture changes accumulating over time, and shows how the leadership facilitated the necessary culture changes
-
The Service and the Beast: Building a Windows Service that Does Not Fail to Restart
Windows Services play a key role in the Microsoft Windows operating system, and support the creation and management of long-running processes. When “Fast Startup” is enabled and the PC is started after a regular shutdown, though, services may fail to restart. The aim of this article is to create a persistent service that will always run and restart after Windows restarts, or after shutdown.
-
Building & Operating High-Fidelity Data Streams
At QCon Plus 2021 last November, Sid Anand, chief architect at Datazoom and PMC Member at Apache Airflow, presented on building high-fidelity nearline data streams as a service within a lean team. In this talk, Anand provides a master class on building high-fidelity data streams from the ground up.
-
Employing Team-Based Agile Coaching to Establish SRE in an Organization
Establishing SRE in a software delivery organization typically requires a socio-technical transformation. Operations teams need to learn how to provide a scalable SRE infrastructure to enable development teams to run their services efficiently. This paper presents how agile coaching has been employed to run an SRE transformation in a 25-teams strong product delivery organization.