InfoQ Homepage Incident Response Content on InfoQ

News

RSS Feed

Newer Older

DevOps

Cloudflare Experiences Major Incident in November, Resulting in Log Loss

Cloudflare has recently confirmed that on November 14th they experienced an incident affecting Cloudflare Logs with 55% of logs during a 3.5-hour period being lost. The incident impacted most customers using the service, with a misconfiguration triggering a cascading series of system failures and exposing weaknesses in handling unexpected spikes in demand.

Renato Losio
on Dec 07, 2024
DevOps

Grafana Frees up Engineers to Fix Problems with Improved Incident Management

Grafana Labs, a leading provider of observability solutions, has unveiled significant enhancements to its Incident Response and Management (IRM) platform. These changes help teams manage and respond to incidents more effectively by streamlining incident management processes and reducing response times.

Matt Saunders
on May 15, 2024
DevOps

Grafana Introduces ML Tool Sift to Improve Incident Response

Grafana Labs has introduced "Sift," a feature for Grafana Cloud designed to enhance incident response management (IRM) by automating system checks and expediting issue resolution. Sift automates various aspects of incident investigation. Sift provides valuable insights into potential issues within Kubernetes environments, helping engineers focus on resolving incidents.

Matt Saunders
on Sep 28, 2023
Culture & Methods

How Resilience Can Help to Get Better at Resolving Incidents

Applying resilience throughout the incident lifecycle by taking a holistic look at the sociotechnical system can help to turn incidents into learning opportunities. Resilience can help folks get better at resolving incidents and improve collaboration. It can also give organizations time to realize their plans.

Ben Linders
on Jun 15, 2023
DevOps

Can MTTR Be an Effective Business Metric?

In a recent blog post, Sidu Ponnappa shared how MTTR should be a key business metric to measure engineering efficiency. Ponnappa notes that only tracking uptime provides no goals to target for improvements. In a recent talk at SREcon22, Courtney Nash, senior research analyst at Verica, shared that MTTR can misrepresent what is actually happening during incidents and can be an unreliable metric.

Matt Campbell
on Oct 26, 2022
DevOps

NCC Group Dissect Aims to Scale Incident Response to Thousands of Systems

Developed at Fox-IT, part of NCC Group, Dissect is a recently open-sourced toolset that aims to enable incident response on thousands of systems at a time by analyzing large volumes of forensic data at high speed, says Fox-IT.

Sergio De Simone
on Oct 14, 2022
DevOps

Standardising Observability and Incident Management at Miro

The Miro Data Engineering team recently discussed how they systematised alerts and incident management. Along with standardising the observability metrics and alert(s) definitions, the team started using OpsGenie for incident management. This helped the team address challenges with scaling such as standard format for metric labelling, alert definitions, on-call duties, etc.

Aditya Kulkarni
on Sep 05, 2022
DevOps

Lightstep Adds Incident Response to Their Observability Platform

Lightstep has announced the addition of incident response management to their observability platform. The general availability of Lightstep Incident Response provides integrations with common collaboration tools, rotation scheduling, escalation policies, APIs, and a CLI.

Matt Campbell
on Mar 23, 2022
DevOps

Grafana Cloud Adds Incident and On-Call Management Solutions

Grafana has announced the addition of incident management and on-call support to their Grafana Cloud offering. Grafana Incident, currently in preview, generates meeting spaces, integrates with Slack, and constructs incident timelines with information pulled from Grafana dashboards. Grafana OnCall provides on-call rotation scheduling and notification from connected monitoring systems.

Matt Campbell
on Feb 26, 2022
Cloud

Google Cloud Embraces Security Orchestration through Siemplify Acquisition

Google has announced the acquisition of security orchestration, automation, and response (SOAR) provider Siemplify, with the aim of integrating SOAR capabilities into its own Google Chronicle security solution.

Sergio De Simone
on Jan 06, 2022
Cloud

Incorrect IAM Policy Raised Questions about AWS Access to S3 Data

An unexpected change in the policy used by AWS Support raised concerns about access to customers' S3 data. The cloud provider reverted the change, stating that the permissions were not and could not be used and published a security bulletin. Security experts suggest steps to detect and prevent similar issues in the future.

Renato Losio
on Jan 06, 2022
Cloud

AWS US-EAST-1 Outage: Postmortem and Lessons Learned

On December 7th AWS experienced an hours-long outage that affected many services in its most popular region, Northern Virginia. The cloud provider released an analysis of the incident that started threads in the community about redundancy on AWS and multi-region approaches.

Renato Losio
on Dec 18, 2021
Culture & Methods

Why the Most Resilient Companies Want More Incidents

According to John Egan, the incident management process is meant to be a cycle of not just the response, but also the account of root cause and the updating of internal processes and practices across the industry. Lowering the barrier to reporting incidents, holding effective incident review meetings using blameless postmortems, and giving everyone access to postmortems is what he advises.

Ben Linders
on Jun 10, 2021
Cloud

Amazon Introduces Incident Manager for Automated Response Plans

AWS recently introduced Incident Manager, a new capability of AWS Systems Manager that helps customers prepare and respond to application and infrastructure incidents.

Renato Losio
on May 21, 2021
DevOps

AWS Releases Health Aware Providing Automated Health Alerts for Accounts

AWS recently announced the release of AWS Health Aware (AHA), an incident management and communications framework. AHA is an automated notification tool that sends AWS Health Alerts to a variety of endpoints. AHA is able to integrate with AWS Organizations to provide aggregated alerts across all accounts within the organization.

Matt Campbell
on Mar 28, 2021

Newer News

Older News

InfoQ Software Architects' Newsletter

News