InfoQ Homepage Incident Response Content on InfoQ

News

RSS Feed

Newer Older

Architecture & Design

Expedia Uses AI-Driven Service Telemetry Analyzer to Accelerate Incident Investigation

Expedia Group has introduced STAR, an internal AI-assisted observability platform that helps engineers investigate production incidents using service telemetry and LLMs. Built with FastAPI, Datadog, Celery, Redis, and Langfuse, STAR follows structured workflows to analyze telemetry, generate root cause assessments, and support incident response while keeping engineers in the loop.

Leela Kumili
on Jul 23, 2026
Architecture & Design

Cloudflare Processes 10M+ Daily Insights with New Security Overview Dashboard

Cloudflare has launched a Security Overview dashboard that consolidates security signals into prioritized action items. It surfaces millions of daily insights, helping teams identify and remediate critical risks faster. Built on distributed checkers and real-time event processing, it integrates analytics workflows to reduce investigation overhead and improve response efficiency.

Leela Kumili
on May 04, 2026
DevOps

AWS Announces General Availability of DevOps Agent for Automated Incident Investigation

AWS has announced the general availability of DevOps Agent, a generative AI–powered assistant designed to help developers and operators troubleshoot issues, analyze deployments, and automate operational tasks across AWS environments.

Renato Losio
on Apr 18, 2026
DevOps

Open Source Security Tool Trivy Hit by Supply Chain Attack, Prompting Urgent Industry Response

A major security incident affecting the widely used open source vulnerability scanner Trivy has exposed critical weaknesses in software supply chain security, after maintainers confirmed that a malicious release was briefly distributed to users.

Craig Risi
on Apr 03, 2026
DevOps

From Paging to Postmortem: Google Cloud SREs on Using Gemini CLI for Outage Response

A recent article by Google Cloud SREs describes how they use the AI-powered Gemini CLI internally to resolve real-world outages. This approach improves reliability in critical infrastructure operations and reduces incident response time by integrating intelligent reasoning directly into the terminal-based operational tools.

Renato Losio
on Feb 14, 2026
Development

How CNAME Ordering in RFC Specs Caused Cloudflare 1.1.1.1 Outage

In a recent article titled "What came first- the CNAME or the A record?" Cloudflare explains how an unclear RFC specification caused the popular Cloudflare’s 1.1.1.1 service to break. After identifying the breakage and the ambiguity in older DNS standards regarding record order, Cloudflare proposes a clarified specification.

Renato Losio
on Feb 07, 2026
DevOps

Cloudflare Launches "Code Orange: Fail Small" Resilience Plan after Multiple Global Outages

Cloudflare recently published a detailed resilience initiative called Code Orange: Fail Small, outlining a comprehensive plan to prevent large-scale service disruptions after two major network outages in the past six weeks.

Craig Risi
on Jan 16, 2026
DevOps

How Authress Designed for Resilience and Survived a Major AWS Outage

Identity and authentication services company Authress shared its strategy to stay operational during major cloud infrastructure outages like the massive October 2025 AWS outage that disrupted many major services. According to Authress CTO Warren Parad, the company's resilience architecture relies on strategies like multi-region deployment and minimizing reliance on AWS control plane services.

Sergio De Simone
on Dec 28, 2025
DevOps

AWS Debuts “DevOps Agent” to Automate Incident Response and Improve System Reliability

AWS recently announced the public preview of AWS DevOps Agent, a new "frontier agent" that aims to help organizations react more quickly to production incidents, identify root causes, and proactively strengthen system reliability.

Craig Risi
on Dec 17, 2025
DevOps

Report Finds LLMs Not Yet Ready to Replace SREs in Incident Management

A study by ClickHouse found that large language models (LLMs) can't yet replace Site Reliability Engineers (SREs) for tasks such as finding the root causes of incidents. The study tested five leading models against real-world observability data to determine whether AI could autonomously identify production issues.

Matt Saunders
on Sep 27, 2025
DevOps

PagerDuty's Kafka Outage Silences Alerts for Thousands of Companies

PagerDuty, the incident management platform used by thousands of organisations to alert them to problems on their systems, suffered a major outage itself on 28th August, 2025. In a comprehensive outage report, the company detailed the scope of the problem, the customer impact, and how it is working to prevent a recurrence.

Matt Saunders
on Sep 16, 2025
Architecture & Design

Datadog Employs LLMs for Assisting with Writing Accident Postmortems

Datadog combined structured metadata from its incident management app with Slack messages to create an LLM-driven functionality assisting engineers in composing incident postmortems. While working on this solution, the company dealt with the challenges of using LLMs outside of the interactive dialog systems and ensuring that high-quality content was produced.

Rafał Gancarz
on Apr 13, 2025
DevOps

How SREs and GenAI Work Together to Decrease eBay's Downtime: an Architect's Insights at KubeCon EU

During his KubeCon EU keynote, Vijay Samuel, Principal MTS Architect at eBay, shared his team’s experience of enhancing incident response capabilities by incorporating ML and LLM building blocks. They realised that GenAIs are not a silver bullet but can help engineers through complex incident investigations through logs, traces, and dashboard explanations.

Olimpiu Pop
on Apr 05, 2025
DevOps

Atlassian Announces Opsgenie Consolidation into JIRA Service Management

Atlassian recently announced that it is consolidating its IT Operations offering and transitioning Opsgenie’s capabilities into JIRA Service Management and Compass.

Aditya Kulkarni
on Mar 29, 2025
DevOps

How Locking, Saturation and CDN Network Issues Brought down Canva

The Canva engineering team recently published their post-mortem on the outage they experienced last November, detailing the API Gateway failure and the lessons learned during the incident.

Renato Losio
on Feb 08, 2025

Newer News

Older News

InfoQ Software Architects' Newsletter

News