Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News OpsRamp Introduces AI-Driven Suggestions for Incident Remediation

OpsRamp Introduces AI-Driven Suggestions for Incident Remediation

OpsRamp, a SaaS platform for hybrid infrastructure discovery, monitoring, management and automation has launched OpsQ Recommend Mode, a capability for incident remediation. OpsQ Recommend Mode provides predictive analytics to digital operations teams with the goal of reducing mean-time-to-resolution (MTTR).

OpsQ is OpsRamp's event management, alert correlation, and remediation engine. New AIOps capabilities help teams ingest, analyse and extract insights for event and incident management. The OpsQ Bot works with the new Recommend Mode with auto-suggested actions alongside alert escalation policies. Other new artificial intelligence for IT operations (AIOps) capabilities in the release include visualisation of alert similarity patterns and new alert stats widgets to provide transparency into machine learning-driven decisions.

The alert seasonality patterns feature in OpsQ can learn which environment alerts recur at a predictable frequency (a seasonality pattern) and automatically suppress them. Teams can visualise seasonality patterns that OpsQ has learned which helps them understand the auto-suppress decisions that OpsQ makes and trace recurring alert patterns to underlying activity. The new Alert Stats widget shows the total number of raw events, correlated alerts, inference alerts, auto-ticketed alerts, and auto-suppressed alerts handled by the OpsQ event management engine. This widget shows how OpsRamp OpsQ reduces event volume at each stage so that IT teams can build confidence in machine learning-based techniques for alert optimisation.

The release drives full-stack visibility for multi-cloud workloads with nineteen new cloud monitoring integrations (added to the existing one hundred and twenty) including: AWS Transit Gateway, AppSync, CloudSearch, and DocumentDB, Azure Application Insights, Traffic Manager, Virtual Network, Route Table, Virtual Machine Scale Sets, SQL Elastic Pool, and Service Bus, GCP Cloud BigTable, Cloud Composer, Cloud Filestore, Firebase, Cloud Memorystore for Redis, Cloud Run, Cloud TPU and Cloud Tasks.

In addition to AWS cloud topology maps, OpsRamp now offers topology discovery and mapping for Azure and GCP. Teams can apply cloud topology maps to analyse the impact of changes in their multi-cloud environments. Cloud topology is also applied in OpsQ's event correlation engine to increase the accuracy of machine learning models.

OpsRamp offers agentless discovery for Linux and VMware compute, network, and storage resources, and the new release introduces agentless discovery and monitoring for Windows compute resources. OpsRamp's enhanced synthetic monitoring provides insights and analysis for troubleshooting multi-step transactions. Application owners can break down each synthetic transaction and gain visibility into the performance of each step in a web transaction. InfoQ spoke with Michael Fisher, product manager at OpsRamp, about the new release.

InfoQ: What are some examples of typical seasonality patterns teams experience?

Michael Fisher: Seasonality patterns are frequently rooted in human routines. These routines are generally expressed in daily, weekly, monthly or yearly patterns. For our customers, the most common patterns generally express a daily or weekly pattern. For example, they might see high spikes in network traffic when their end users login to the network in the morning, or increased disk read/writes when they are performing their weekly backup jobs on their virtual machines. OpsRamp has the ability to learn these seasonal patterns and suppress alerts that occur seasonally, thus reducing false-positive alerts and alert fatigue.

InfoQ: How does OpsRamp provide insights to Kubernetes, containers and microservices?

Fisher: OpsRamp has a variety of different mechanisms to provide insight into Kubernetes environments. Native Kubernetes instrumentation allows teams to gain insight into the overall health of the cluster down to the individual container runtimes. This monitoring visibility is coupled with our Kubernetes Topology, which maps the cluster to the nodes, to the containers. This topological context is fed into our machine learning models to enhance correlation and alert deduplication, which aids Site Reliability Engineers (SREs) when troubleshooting Kubernetes related events.

InfoQ: How does OpsRamp handle security related incidents?

Fisher: OpsRamp provides the capability to monitor common firewall, or security centric hardware, for its overall health and performance. On top of this, OpsRamp has a generic web-hook API framework which can be leveraged to ingest security events from various vendors, which are fed through OpsRamp's correlation models for further analysis.

InfoQ: Does OpsRamp have any features that help teams perform blameless retrospectives post incident?

Fisher: OpsRamp's native help desk, dashboarding and reporting allow teams to track the lifecycle of an incident as it moves from incident creation, to incident resolution.

InfoQ: If a team has a "we build it, we own it&" mentality, how might this change the way in which OpsRamp is used?

Fisher: As an extensible platform, teams that seek to build their own custom monitoring, or integrations, are encouraged to do so. The strength in the OpsRamp platform is not only what we provide out of the box, but what we enable teams to do with the tools that we provide.

InfoQ: Can teams extract business metrics relating to web based customer journeys from the tool?

Fisher: OpsRamp provides various different synthetic offerings, from monitoring the round trip time (RTT) between an SMTP server and OpsRamp's globally located data-centres, to creating a synthetic transaction modelling a user's flow within your application. These various synthetic options provide businesses with visibility into their critical applications whilst helping them stay ahead outages that may affect their end users' experience.

InfoQ: What is a machine learning model?

Fisher: For OpsRamp, a machine learning model represents OpsRamp's ability to ingest data and then interpret it using various different features, such as topology relationships, resource attributes, time, etc. OpsRamp has several different models, each with a different intended purpose and degree of automation. For example, OpsRamp's new recommend mode provides businesses with the ability to stay informed of the machine learning model's action ("analyst in the loop") and be the final decision maker if, for example, an alert should be suppressed or turned into an incident. Recommend provides an opinion on how to handle alerts, but leaves it to the operator to "push-the-button", providing a blend of automated response and operator control.

InfoQ: How does OpsRamp perform topology discovery and visualisation?

Fisher: OpsRamp's topology discovery spans from L2 - L7. At the bottom of the stack, OpsRamp leverages various discovery protocols (such as CDP, LLDP, OSPF etc) to map the relationships between infrastructure components. Moving up the stack, OpsRamp is also able to model business applications, public cloud services and Kubernetes workloads. In addition to the ability to visualise these services, OpsRamp's machine learning models are able to train from the discovered relationship data to more accurately correlate and deduplicate alerts, which can reduct alert fatigue and MTTR for operators.

Learn more about the OpsRamp Winter 2020 release here.

Rate this Article