InfoQ Homepage Incident Response Content on InfoQ

News

RSS Feed

Newer Older

DevOps

Open Source Security Tool Trivy Hit by Supply Chain Attack, Prompting Urgent Industry Response

A major security incident affecting the widely used open source vulnerability scanner Trivy has exposed critical weaknesses in software supply chain security, after maintainers confirmed that a malicious release was briefly distributed to users.

Craig Risi
on Apr 03, 2026
DevOps

From Paging to Postmortem: Google Cloud SREs on Using Gemini CLI for Outage Response

A recent article by Google Cloud SREs describes how they use the AI-powered Gemini CLI internally to resolve real-world outages. This approach improves reliability in critical infrastructure operations and reduces incident response time by integrating intelligent reasoning directly into the terminal-based operational tools.

Renato Losio
on Feb 14, 2026
Development

How CNAME Ordering in RFC Specs Caused Cloudflare 1.1.1.1 Outage

In a recent article titled "What came first- the CNAME or the A record?" Cloudflare explains how an unclear RFC specification caused the popular Cloudflare’s 1.1.1.1 service to break. After identifying the breakage and the ambiguity in older DNS standards regarding record order, Cloudflare proposes a clarified specification.

Renato Losio
on Feb 07, 2026
DevOps

Cloudflare Launches "Code Orange: Fail Small" Resilience Plan after Multiple Global Outages

Cloudflare recently published a detailed resilience initiative called Code Orange: Fail Small, outlining a comprehensive plan to prevent large-scale service disruptions after two major network outages in the past six weeks.

Craig Risi
on Jan 16, 2026
DevOps

How Authress Designed for Resilience and Survived a Major AWS Outage

Identity and authentication services company Authress shared its strategy to stay operational during major cloud infrastructure outages like the massive October 2025 AWS outage that disrupted many major services. According to Authress CTO Warren Parad, the company's resilience architecture relies on strategies like multi-region deployment and minimizing reliance on AWS control plane services.

Sergio De Simone
on Dec 28, 2025
DevOps

AWS Debuts “DevOps Agent” to Automate Incident Response and Improve System Reliability

AWS recently announced the public preview of AWS DevOps Agent, a new "frontier agent" that aims to help organizations react more quickly to production incidents, identify root causes, and proactively strengthen system reliability.

Craig Risi
on Dec 17, 2025
DevOps

Report Finds LLMs Not Yet Ready to Replace SREs in Incident Management

A study by ClickHouse found that large language models (LLMs) can't yet replace Site Reliability Engineers (SREs) for tasks such as finding the root causes of incidents. The study tested five leading models against real-world observability data to determine whether AI could autonomously identify production issues.

Matt Saunders
on Sep 27, 2025
DevOps

PagerDuty's Kafka Outage Silences Alerts for Thousands of Companies

PagerDuty, the incident management platform used by thousands of organisations to alert them to problems on their systems, suffered a major outage itself on 28th August, 2025. In a comprehensive outage report, the company detailed the scope of the problem, the customer impact, and how it is working to prevent a recurrence.

Matt Saunders
on Sep 16, 2025
Architecture & Design

Datadog Employs LLMs for Assisting with Writing Accident Postmortems

Datadog combined structured metadata from its incident management app with Slack messages to create an LLM-driven functionality assisting engineers in composing incident postmortems. While working on this solution, the company dealt with the challenges of using LLMs outside of the interactive dialog systems and ensuring that high-quality content was produced.

Rafal Gancarz
on Apr 13, 2025
DevOps

How SREs and GenAI Work Together to Decrease eBay's Downtime: an Architect's Insights at KubeCon EU

During his KubeCon EU keynote, Vijay Samuel, Principal MTS Architect at eBay, shared his team’s experience of enhancing incident response capabilities by incorporating ML and LLM building blocks. They realised that GenAIs are not a silver bullet but can help engineers through complex incident investigations through logs, traces, and dashboard explanations.

Olimpiu Pop
on Apr 05, 2025
DevOps

Atlassian Announces Opsgenie Consolidation into JIRA Service Management

Atlassian recently announced that it is consolidating its IT Operations offering and transitioning Opsgenie’s capabilities into JIRA Service Management and Compass.

Aditya Kulkarni
on Mar 29, 2025
DevOps

How Locking, Saturation and CDN Network Issues Brought down Canva

The Canva engineering team recently published their post-mortem on the outage they experienced last November, detailing the API Gateway failure and the lessons learned during the incident.

Renato Losio
on Feb 08, 2025
DevOps

Cloudflare Experiences Major Incident in November, Resulting in Log Loss

Cloudflare has recently confirmed that on November 14th they experienced an incident affecting Cloudflare Logs with 55% of logs during a 3.5-hour period being lost. The incident impacted most customers using the service, with a misconfiguration triggering a cascading series of system failures and exposing weaknesses in handling unexpected spikes in demand.

Renato Losio
on Dec 07, 2024
DevOps

Grafana Frees up Engineers to Fix Problems with Improved Incident Management

Grafana Labs, a leading provider of observability solutions, has unveiled significant enhancements to its Incident Response and Management (IRM) platform. These changes help teams manage and respond to incidents more effectively by streamlining incident management processes and reducing response times.

Matt Saunders
on May 15, 2024
DevOps

Grafana Introduces ML Tool Sift to Improve Incident Response

Grafana Labs has introduced "Sift," a feature for Grafana Cloud designed to enhance incident response management (IRM) by automating system checks and expediting issue resolution. Sift automates various aspects of incident investigation. Sift provides valuable insights into potential issues within Kubernetes environments, helping engineers focus on resolving incidents.

Matt Saunders
on Sep 28, 2023

Newer News

Older News

InfoQ Software Architects' Newsletter

News