InfoQ Homepage Reliability Content on InfoQ
Articles
RSS Feed-
The Service and the Beast: Building a Windows Service that Does Not Fail to Restart
Windows Services play a key role in the Microsoft Windows operating system, and support the creation and management of long-running processes. When “Fast Startup” is enabled and the PC is started after a regular shutdown, though, services may fail to restart. The aim of this article is to create a persistent service that will always run and restart after Windows restarts, or after shutdown.
-
Building & Operating High-Fidelity Data Streams
At QCon Plus 2021 last November, Sid Anand, chief architect at Datazoom and PMC Member at Apache Airflow, presented on building high-fidelity nearline data streams as a service within a lean team. In this talk, Anand provides a master class on building high-fidelity data streams from the ground up.
-
Employing Team-Based Agile Coaching to Establish SRE in an Organization
Establishing SRE in a software delivery organization typically requires a socio-technical transformation. Operations teams need to learn how to provide a scalable SRE infrastructure to enable development teams to run their services efficiently. This paper presents how agile coaching has been employed to run an SRE transformation in a 25-teams strong product delivery organization.
-
Establishing a Scalable SRE Infrastructure Using Standardization and Short Feedback Loops
This article explores an SRE implementation where the operations team builds and runs the SRE infrastructure and the development teams build and run the services leveraging the SRE infrastructure. This SRE solution enables the software delivery organization to scale the number of services in operation without linearly scaling the number of people required to operate the services.
-
Building Tech at Presidential Scale
Dan Woods discusses the unique challenges of building and running tech for a presidential cycle. Woods also describes how ML was applied at foundational points to reduce operating costs and some of the architectural choices made.
-
Improving Speed and Stability of Software Delivery Simultaneously at Siemens Healthineers
In this article, we focus on the software delivery process at Siemens Healthineers Digital Health. The process is subject to strict regulations valid in the medical industry. We show our journey of transforming the process towards speed and stability. Both measures improved at the same time during the transformation, confirming research from the “Accelerate” book.
-
Site Reliability Engineering for Native Mobile Apps
In this article, we will describe how we can apply Site Reliability Engineering (SRE) principles to mobile app development. First, we will describe the key SRE tenets and what tools can be used to implement them. Then, we will delve into organization topology, i.e. how an organization can be designed to adopt SRE for mobile app development.
-
Software Architecture and Design InfoQ Trends Report—April 2021
An overview of how the InfoQ editorial team sees the Software Architecture and Design topic evolving in 2021, with a focus on what architects are designing for today.
-
Shifting Modes: Creating a Program to Support Sustained Resilience
The second article in a series on how software companies adapted and continue to adapt to enhance their resilience explores how organizations can shift to a Learn & Adapt safety mode and compares the traits of an organization that is well poised for successfully persisting this mode shift. This shift will not only make them safer but will also give them a competitive advantage.
-
Failover Conf Q&A on Building Reliable Systems: People, Process, and Practice
One of the biggest engineering challenges associated with maintaining or increasing the reliability of a system is knowing where to invest time and energy. InfoQ recently sat down with several engineers and technical leaders who are involved with the upcoming Failover Conf virtual event, and asked their opinion on the best practices for building and running reliable systems.
-
Towards Successful Resilient Software Design
In this article, Uwe Friedrichsen explains the “why” and “what” of resilient software design, discusses the challenges he has met most often in recent years, and shares his thoughts on how to implement resilient software design in your organisation.
-
QoS for Applications: A Resource Management Framework for Runtimes
This article draws an analogy between QoS for networks and for applications, resulting in a mapping guide between the two and introducing a production solution for Java, (J)Ruby, and (J)Python apps.