A Brief Introduction to Incident.MOOG with Rob Markovich

Recently we caught up with Rob Markovich, CMO of Moogsoft, to talk about the new version of their flagship product, Incident.MOOG.

InfoQ: Can you give a brief overview about what Incident.MOOG does?

a. Moogsoft's flagship product is Incident.MOOG, and it automates the real-time early warning of application affecting problems as they unfold as well as streamlines the workflow that cross-domain support teams need to resolve problems collaboratively and quickly. It acts much like a “manager of managers” ingesting the real-time telemetry (e.g. flows of events, logs, alerts, alarms, traps, messages) from tools, infrastructure, and humans – using machine learning to separate the signal from the noise, and then correlate it into a manageable set of “situations” that can be remediated faster.

b. Incident.MOOG is used by production engineering and support teams at larger enterprises and service providers, i.e. those with complex IT environments and need support systems that can scale; this includes IT Operations Management (ITOM), IT Service Management (ITSM), Site Reliability and DevOps teams.

c. Incident.MOOG solves key business challenges such as: making sense quickly out of the high volume of IT event/alerts/messages that are generated; providing immediate visibility of transient, outage triggering alarms that other tools can’t see; providing full situational awareness of applications across any mix of infrastructures: private and public cloud, virtualized SDN and NFV infrastructures. Incident.MOOG addresses these through:

A single pane of glass to manage IT incidents holistically

Automation that reduces the number of actionable work (alerts/tickets)

Facilitates IT incident-aware collaboration across Dev and Ops teams for major issues

Automatic IT incident awareness for all fault stakeholders

Consumption of any event telemetry, scaling without models and rules

Industrializing of infrastructure fabric change and transformation (i.e. hybrid cloud, SDN, NFV)

Understanding of relationships between alerts, domains and services

Optimizing of workflows, reducing manual effort

InfoQ: How does machine learning differ from traditional techniques?

Machine learning has many applications, and is defined as software algorithms that automatically identify meaning in data (like machine generated events and alerts, and human messages), doing this in real-time, and without being explicitly programmed to understand what is normal and abnormal. Doing this in real-time is critical, as the algorithms must process huge data streams on the flow to provide early warning as an IT incident starts to unfold, well before customers complain.

Incident.MOOG uses machine learning to process large volumes of IT telemetry (e.g. flows of events, logs, alerts, alarms, traps, messages from tools, infrastructure, and humans), separating the signal from the noise, and then correlating it into a manageable set of “situations” that can be remediated faster.

Here are the key benefits:

a. Some of the leading machine learning tools can reduce IT event alerts by 99 percent, using de-dupe, blacklist, noise reduction and other capabilities that present only “real” alerts to IT Dev and Ops teams, all in real-time.

b. By determining relationships between thousands of alerts across all operational domains, machine-learning algorithms cluster and correlate alerts into a small subset of meaningful situations.

c. With machine learning, all events can be automatically analyzed and scored, allowing IT operations teams to see a list of past situations with significant degrees of similarity, providing access to root causes and resolutions where knowledge was successfully used.

Note that Splunk ITSI and its new Adaptive Thresholding and Anomaly Detection will generate new and richer IT telemetry, allowing Moogsoft to provide even greater situational awareness. More intelligent events and alerts feed to Moogsoft allows operational teams to detect and resolve problems faster. Most of Moogsoft’s customers are also Splunk customers – hence our partnership (Moogsoft is part of SplunkBase) and the bi-directional integration of our products.

InfoQ: When filtering out the noise, how do you know that they aren’t also hiding real problems?

The short answer is that you eliminate the event duplicates. Anomalies by nature are exceptions, not the norm. By watching all the event telemetry over time, you can learn what is normal and what is not. Machine learning makes this scale without the need to make manual changes as the environment changes. Likewise, in today’s open, multi-vendor environments, “de-duplication” needs to be much more than simply deletes of exact copies of events; depending on what is generating events, the semantics can be slightly different, and you need natural language processing to make sense of these nuances and still perform the de-duplication. Finally, you can’t have too tight of a criteria for de-duping because you don’t want to miss pieces that may point to real problems. The next step of clustering related events into situations allows you to reduce the actionable workload even further while relaxing the de-dupe criteria.

InfoQ: How does Adaptive Thresholding and Anomaly Detection differ from how products like yours worked in the past?

The key difference with Moogsoft is that we apply a machine-learning, non-deterministic approach to detecting anomalies and we relax the constraints in trying to find single root causes to anomalies. Moogsoft applies a situation-based approach by de-duping the noise and then clustering related events into situations. We apply a variety of mostly non-deterministic, machine learning algorithms to cluster around anomalies; meaning that we don’t try to point to a specific root cause of anomalies, but instead cluster all the related events around the anomaly (almost always capturing the root causes for that anomaly within the situation). This non-deterministic approach, while not 100% perfect all the time, is highly agile and adaptable to constant change going on in the environment, i.e. it has no reliance on static rules or models. This is very different from tools in the past that depended on static rules and models, which assumed that environments didn’t change very much.

InfoQ: What types of things do you look for when deciding if a set of alerts should be clustered together instead of being listed as distinct events?

We have six different machine learning algorithms to identify related events and perform event clustering, and we apply multiple algorithms in parallel to do so. These algorithms are the secret sauce of our product and where we have most of our 15 patents pending. The general variables are:

Time.

Linguistic.

Topology.

Ops-Team-Define-Template.

Moogsoft Machine-Learned –Feedback.

Deterministic Cookbook-Based (optional).

Descriptions of these variables can be found here:
https://www.moogsoft.com/product/machine-learning/

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

InfoQ Article Contest

Rate this Article

This content is in the DevOps topic

Related Topics:

Related Editorial

Related Sponsored Content

Popular across InfoQ

The InfoQ Newsletter