InfoQ Homepage Incident Response Content on InfoQ
-
Datadog Employs LLMs for Assisting with Writing Accident Postmortems
Datadog combined structured metadata from its incident management app with Slack messages to create an LLM-driven functionality assisting engineers in composing incident postmortems. While working on this solution, the company dealt with the challenges of using LLMs outside of the interactive dialog systems and ensuring that high-quality content was produced.
-
How SREs and GenAI Work Together to Decrease eBay's Downtime: an Architect's Insights at KubeCon EU
During his KubeCon EU keynote, Vijay Samuel, Principal MTS Architect at eBay, shared his team’s experience of enhancing incident response capabilities by incorporating ML and LLM building blocks. They realised that GenAIs are not a silver bullet but can help engineers through complex incident investigations through logs, traces, and dashboard explanations.
-
Atlassian Announces Opsgenie Consolidation into JIRA Service Management
Atlassian recently announced that it is consolidating its IT Operations offering and transitioning Opsgenie’s capabilities into JIRA Service Management and Compass.
-
How Locking, Saturation and CDN Network Issues Brought down Canva
The Canva engineering team recently published their post-mortem on the outage they experienced last November, detailing the API Gateway failure and the lessons learned during the incident.
-
Cloudflare Experiences Major Incident in November, Resulting in Log Loss
Cloudflare has recently confirmed that on November 14th they experienced an incident affecting Cloudflare Logs with 55% of logs during a 3.5-hour period being lost. The incident impacted most customers using the service, with a misconfiguration triggering a cascading series of system failures and exposing weaknesses in handling unexpected spikes in demand.
-
Grafana Frees up Engineers to Fix Problems with Improved Incident Management
Grafana Labs, a leading provider of observability solutions, has unveiled significant enhancements to its Incident Response and Management (IRM) platform. These changes help teams manage and respond to incidents more effectively by streamlining incident management processes and reducing response times.
-
Grafana Introduces ML Tool Sift to Improve Incident Response
Grafana Labs has introduced "Sift," a feature for Grafana Cloud designed to enhance incident response management (IRM) by automating system checks and expediting issue resolution. Sift automates various aspects of incident investigation. Sift provides valuable insights into potential issues within Kubernetes environments, helping engineers focus on resolving incidents.
-
How Resilience Can Help to Get Better at Resolving Incidents
Applying resilience throughout the incident lifecycle by taking a holistic look at the sociotechnical system can help to turn incidents into learning opportunities. Resilience can help folks get better at resolving incidents and improve collaboration. It can also give organizations time to realize their plans.
-
Can MTTR Be an Effective Business Metric?
In a recent blog post, Sidu Ponnappa shared how MTTR should be a key business metric to measure engineering efficiency. Ponnappa notes that only tracking uptime provides no goals to target for improvements. In a recent talk at SREcon22, Courtney Nash, senior research analyst at Verica, shared that MTTR can misrepresent what is actually happening during incidents and can be an unreliable metric.
-
NCC Group Dissect Aims to Scale Incident Response to Thousands of Systems
Developed at Fox-IT, part of NCC Group, Dissect is a recently open-sourced toolset that aims to enable incident response on thousands of systems at a time by analyzing large volumes of forensic data at high speed, says Fox-IT.
-
Standardising Observability and Incident Management at Miro
The Miro Data Engineering team recently discussed how they systematised alerts and incident management. Along with standardising the observability metrics and alert(s) definitions, the team started using OpsGenie for incident management. This helped the team address challenges with scaling such as standard format for metric labelling, alert definitions, on-call duties, etc.
-
Lightstep Adds Incident Response to Their Observability Platform
Lightstep has announced the addition of incident response management to their observability platform. The general availability of Lightstep Incident Response provides integrations with common collaboration tools, rotation scheduling, escalation policies, APIs, and a CLI.
-
Grafana Cloud Adds Incident and On-Call Management Solutions
Grafana has announced the addition of incident management and on-call support to their Grafana Cloud offering. Grafana Incident, currently in preview, generates meeting spaces, integrates with Slack, and constructs incident timelines with information pulled from Grafana dashboards. Grafana OnCall provides on-call rotation scheduling and notification from connected monitoring systems.
-
Google Cloud Embraces Security Orchestration through Siemplify Acquisition
Google has announced the acquisition of security orchestration, automation, and response (SOAR) provider Siemplify, with the aim of integrating SOAR capabilities into its own Google Chronicle security solution.
-
Incorrect IAM Policy Raised Questions about AWS Access to S3 Data
An unexpected change in the policy used by AWS Support raised concerns about access to customers' S3 data. The cloud provider reverted the change, stating that the permissions were not and could not be used and published a security bulletin. Security experts suggest steps to detect and prevent similar issues in the future.