InfoQ Homepage Incident Response Content on InfoQ
-
Can MTTR Be an Effective Business Metric?
In a recent blog post, Sidu Ponnappa shared how MTTR should be a key business metric to measure engineering efficiency. Ponnappa notes that only tracking uptime provides no goals to target for improvements. In a recent talk at SREcon22, Courtney Nash, senior research analyst at Verica, shared that MTTR can misrepresent what is actually happening during incidents and can be an unreliable metric.
-
NCC Group Dissect Aims to Scale Incident Response to Thousands of Systems
Developed at Fox-IT, part of NCC Group, Dissect is a recently open-sourced toolset that aims to enable incident response on thousands of systems at a time by analyzing large volumes of forensic data at high speed, says Fox-IT.
-
Standardising Observability and Incident Management at Miro
The Miro Data Engineering team recently discussed how they systematised alerts and incident management. Along with standardising the observability metrics and alert(s) definitions, the team started using OpsGenie for incident management. This helped the team address challenges with scaling such as standard format for metric labelling, alert definitions, on-call duties, etc.
-
Lightstep Adds Incident Response to Their Observability Platform
Lightstep has announced the addition of incident response management to their observability platform. The general availability of Lightstep Incident Response provides integrations with common collaboration tools, rotation scheduling, escalation policies, APIs, and a CLI.
-
Grafana Cloud Adds Incident and On-Call Management Solutions
Grafana has announced the addition of incident management and on-call support to their Grafana Cloud offering. Grafana Incident, currently in preview, generates meeting spaces, integrates with Slack, and constructs incident timelines with information pulled from Grafana dashboards. Grafana OnCall provides on-call rotation scheduling and notification from connected monitoring systems.
-
Google Cloud Embraces Security Orchestration through Siemplify Acquisition
Google has announced the acquisition of security orchestration, automation, and response (SOAR) provider Siemplify, with the aim of integrating SOAR capabilities into its own Google Chronicle security solution.
-
Incorrect IAM Policy Raised Questions about AWS Access to S3 Data
An unexpected change in the policy used by AWS Support raised concerns about access to customers' S3 data. The cloud provider reverted the change, stating that the permissions were not and could not be used and published a security bulletin. Security experts suggest steps to detect and prevent similar issues in the future.
-
AWS US-EAST-1 Outage: Postmortem and Lessons Learned
On December 7th AWS experienced an hours-long outage that affected many services in its most popular region, Northern Virginia. The cloud provider released an analysis of the incident that started threads in the community about redundancy on AWS and multi-region approaches.
-
Why the Most Resilient Companies Want More Incidents
According to John Egan, the incident management process is meant to be a cycle of not just the response, but also the account of root cause and the updating of internal processes and practices across the industry. Lowering the barrier to reporting incidents, holding effective incident review meetings using blameless postmortems, and giving everyone access to postmortems is what he advises.
-
Amazon Introduces Incident Manager for Automated Response Plans
AWS recently introduced Incident Manager, a new capability of AWS Systems Manager that helps customers prepare and respond to application and infrastructure incidents.
-
AWS Releases Health Aware Providing Automated Health Alerts for Accounts
AWS recently announced the release of AWS Health Aware (AHA), an incident management and communications framework. AHA is an automated notification tool that sends AWS Health Alerts to a variety of endpoints. AHA is able to integrate with AWS Organizations to provide aggregated alerts across all accounts within the organization.
-
PagerDuty Adds AWS DevOps Guru and Microsoft Teams Integrations
PagerDuty has released a number of new updates and enhancements to their incident response platform. This includes new integrations with Amazon DevOps Guru, AWS Control Tower, and Microsoft Teams. Other improvements include improvements to mapping failures back to changes, automatic triggers, and content-based alert grouping.
-
Netflix Presents Telltale, an Application Health Monitoring Tool
The Netflix Engineering team recently blogged about Telltale, a monitoring and alerting tool that utilizes a variety of data sources to learn the typical health of an application. Telltale shows only the relevant data from application. There's also information about important events, such as nearby deployments and regional traffic evacuations.
-
GitHub Availability Report: Monthly Report Examining Incidents
Going beyond publishing the post mortem of major incidents, GitHub recently introduced the Availability Report. This report will not only have a description of incidents but also highlight what is being done to advance GitHub's engineering systems and practices.
-
Cloudflare’s 27 Minutes Outage Explained
Cloudflare recently suffered a partial outage, which lasted for 27 minutes. This outage caused 50% of traffic drop across the network.
Sponsored Content
The Blameless Complete Guide to Incident Management Part 1
You can never fully prevent incidents, so it's important to resolve them as efficiently as possible. This eBook will break down what to do when things go wrong. Download Now.
Bridging the Gap: DevOps to SRE
Enhance your incident management by investing in a powerful toolbox, aligning on SLOs, and creating a just culture. This eBook gives you practical steps to implementing SRE practices. Download Now.
Beyond the 4 SRE Golden Signals
The Four Golden Signals are only the foundation for a more meaningful understanding of system health. In this eBook, we'll examine how to get the most out of the golden signals, and show you how to build beyond them. Download Now.