InfoQ Homepage Observability Content on InfoQ
-
Analyzing Incident Data across Organizations: Courtney Nash on the VOID
The Verica Open Incident Database (VOID) is assembling publically available software-related incident reports. InfoQ talks with Courtney Nash on their recent findings including how MTT* metrics may not be beneficial, the average time to incident resolution, and the importance of studying near-miss reports.
-
Embracing Cloud-Native for Apache DolphinScheduler with Kubernetes: a Case Study
This article shares how Apache DolphinScheduler was updated to use a more modern, cloud-native architecture. This includes moving to Kubernetes and integrating with Argo CD and Prometheus. This improves substantially the user experience of deploying, operating, and monitoring DolphinScheduler.
-
DevOps and Cloud InfoQ Trends Report – June 2022
This article summarizes how we see the "cloud computing and DevOps" space in 2022, which focuses on fundamental infrastructure and operational patterns, the realization of patterns in technology frameworks, and the design processes and skills that a software architect or engineer must cultivate.
-
How to Fight Climate Change as a Software Engineer
We need to reduce and eliminate greenhouse gas emissions to stop climate change. But what role does software play, and what can software engineers do? Let’s take a look under the hood to uncover the relationship between greenhouse gas emissions and software, learn about the impact that we can have, and identify concrete ways to reduce emissions when creating and running software.
-
Chaos Engineering and Observability with Visual Metaphors
This article introduces a new actor for visualising chaos engineering and observability: metaphors. It provides the conceptual foundations of chaos engineering and observability, presents a state of art of visualisation techniques available in the market and shows how treemaps, gauge charts, geocentric and city metaphors can enrich the spectrum of the visual strategies to observe the chaos.
-
How to Best Use MTT* Metrics to Optimize Your Incident Response
Selecting the correct MTT* metric to improve your incident response is important. If the wrong metric is chosen, the improvements may get lost in the noise of a multivariable equation. This article reviews the various MTT* metrics available and discusses the best scenarios for selecting each one.
-
Why Change Intelligence is Necessary to Effectively Troubleshoot Modern Applications
Change Intelligence is often a missing component in incident management. Successfully correlating monitoring and observability data to arrive allows engineers to arrive at the root cause more rapidly. Telemetry provides the building blocks that enable change intelligence to identify and map the root cause, based on changes in the system and their broader impact.
-
Why the Future of Monitoring Is Agentless
Traditionally, monitoring software has relied heavily on agent-based approaches for extracting telemetry data from systems. Observability requires better telemetry than agents currently provide. OpenTelemetry is driving advances in this area by creating a standard format and APIs to create, transmit, and store telemetry data. This unlocks new opportunities in observability.
-
DevOps and Cloud InfoQ Trends Report - July 2021
This article summarizes how we see the "cloud computing and DevOps" space in 2021, which focuses on fundamental infrastructure and operational patterns, the realization of patterns in technology frameworks, and the design processes and skills that a software architect or engineer must cultivate.
-
Solving Mysteries Faster with Observability
At QCon plus, a virtual conference for senior software engineers and architects covering the trends, best practices, and solutions leveraged by the world's most innovative software organizations, Elizabeth Carretto discussed observability at Netflix and how their internal tool, Edgar, comes into play.
-
Cloud Native and Kubernetes Observability: Expert Panel
InfoQ recently caught up with Observability experts to discuss several topics including fundamental questions about what Observability really entails, the misconceptions and challenges that the users are facing, the open standards that are influencing the industry in general and why there is more interest in this area off late.
-
Site Reliability Engineering Experiences at Instana
With the popularity of distributed architectures, distributed databases, containers and container orchestrators, an approach that emphasizes automation and a culture of collaboration is a natural fit for modern day operations. Site Reliability Engineering takes engineering practices that have been established and proven in software engineering and applies them to the field of operations.