InfoQ Homepage Post-Mortems Content on InfoQ

News

RSS Feed

Development

How CNAME Ordering in RFC Specs Caused Cloudflare 1.1.1.1 Outage

In a recent article titled "What came first- the CNAME or the A record?" Cloudflare explains how an unclear RFC specification caused the popular Cloudflare’s 1.1.1.1 service to break. After identifying the breakage and the ambiguity in older DNS standards regarding record order, Cloudflare proposes a clarified specification.

Renato Losio
on Feb 07, 2026
Cloud

Cloudflare Global Outage Traced to Internal Database Change

Cloudflare’s recent global outage, linked to a database update, caused widespread disruption and highlighted the risks of single-vendor reliance. While service was restored, the incident sparked discussions on the importance of multi-vendor strategies in tech. Cloudflare's CEO vowed to enhance system resilience, emphasizing that outages can impact even the largest providers.

Steef-Jan Wiggers
on Nov 22, 2025
Architecture & Design

Datadog Employs LLMs for Assisting with Writing Accident Postmortems

Datadog combined structured metadata from its incident management app with Slack messages to create an LLM-driven functionality assisting engineers in composing incident postmortems. While working on this solution, the company dealt with the challenges of using LLMs outside of the interactive dialog systems and ensuring that high-quality content was produced.

Rafal Gancarz
on Apr 13, 2025
Culture & Methods

How External IT Providers Can Adopt DevOps Practices

IT suppliers can follow the “you build it, you run it” mantra by working in small batches, using an experimental approach to product development, and validating small product increments in production. The supplier has to find out what his client’s goal is, and it has to become the supplier’s goal as well to work in a collaborative way.

Ben Linders
on Aug 19, 2021
Architecture & Design

PayPal Engineering Teams Implement Premortem Analysis

In a recent blog post, the PayPal engineering team published how it uses premortem analysis as part of its regular software design process. The team adopted a customized version of premortem analysis last year, which highly benefited PayPal engineering. Premortem is a strategy in which a team imagines that a project failed and then works backward to determine what could lead to this failure.

Eran Stiller
on Jul 22, 2021
Culture & Methods

Why the Most Resilient Companies Want More Incidents

According to John Egan, the incident management process is meant to be a cycle of not just the response, but also the account of root cause and the updating of internal processes and practices across the industry. Lowering the barrier to reporting incidents, holding effective incident review meetings using blameless postmortems, and giving everyone access to postmortems is what he advises.

Ben Linders
on Jun 10, 2021
DevOps

How to Embrace “You Build It, You Run It” with Paul Hammant at QCon London

Paul Hammant talked at QCon London about having developers responsible for the first line of support in production, as the saying goes, “if you build it, you run it.” Hammant recommends following this practice only if there are proper support levels and escalation policies defined. As a result, companies could reduce the chances of burnout or staff quitting.

Christian Melendez
on Mar 05, 2020
DevOps

Blameless Post-Mortems and On-Call Gamification at 1st DevOpsDays Portugal (Day 2)

Ten years after the first DevOpsDays conference in Ghent, the evolution of DevOps and organizations trying to adopt it was at the forefront of the first DevOpsDays conference in Portugal. On the second day, a mix of local and international speakers covered topics such as learning from incidents without blame, gamifying on-call, modern pipelines, and more.

Manuel Pais
on Jul 14, 2019
Culture & Methods

Atlassian Announces Solutions for Incident Management

Atlassian announced on September 4 that they have launched a new product called Jira Ops and that they will acquire OpsGenie. Organizations can use Jira Ops for resolving incidents and doing post-mortems to learn from them. OpsGenie adds prompt and reliable alerting to Jira Ops.

Ben Linders
on Sep 20, 2018
Culture & Methods

Psychological Safety in Post-Mortems

Emotions often come to the fore when there is an incident; psychological safety in blameless post-mortems is essential for the learning process to happen. The post-mortem session must be fairly moderated, preferably by an outsider, giving everyone a turn to speak without criticism. Don’t start the analysis of the incident before there is a clear and common understanding of what actually happened.

Ben Linders
on Sep 06, 2018
DevOps

How ING Bank Does SRE

Janna Brummel and Robin van Zijll, from ING Netherlands, talked at the Velocity conference in London about how poor availability from their internet banking systems prompted the bank to implement an SRE culture. A centralized SRE team was set up in the Netherlands to provide tooling, consulting and education on reliability to product teams (known as BizDevOps squads internally).

Manuel Pais
on Dec 30, 2017
DevOps

Post-Mortems Trends and Behaviors

Eric Siegler presented his findings at Velocity from analyzing data from 1000 post-mortems ran by 125 different organizations over a six month period. Main trends include the prevalence of blameless post-mortems; the fact that only 1 in 100 post-mortems refer to "human error"; and that analyzing the lifecycle of incidents can provide useful insights on weaknesses in the incident response process.

Manuel Pais
on Nov 29, 2017
DevOps

John Willis Talks DevOps Superpatterns at DOES17 London

John Willis, co-author of The DevOps Handbook, spoke about the emerging DevOps Superpattern at the 2017 DevOps Enterprise Summit June 5th and 6th in London.

Helen Beal
on Jun 26, 2017
Handling Incidents and Outages

David Mytton, CEO at Server Density, shared with the devopsdays Amsterdam 2015 crowd how they handle incidents and outages. The process is grounded on a key set of principles: frequent public updates; exhaustive logging of the response activities; team effort and effective escalation. Server Density draws a lot of inspiration from the aviation industry, renowned for its safety procedures.

João Miranda
on Jun 29, 2015

InfoQ Software Architects' Newsletter

News