BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles AIOps Strategies for Augmenting Your IT Operations

AIOps Strategies for Augmenting Your IT Operations

Key Takeaways

  • It’s better to identify specific near-term goals that you can pursue, step by step, as you carefully build up your AIOps capability. 
  • Use domain-centric AIOps features built into a monitoring tool for a one-off, specific use case, and deploy a domain-agnostic stand-alone solution that can straddle multiple use cases over time.
  • Choose AIOps tools that provide built-in, scalable enrichment, and you’ll drive intelligence throughout your operations.
  • Apart from driving down the workload of your IT Ops teams and increasing the speed of incident or outage resolution, automating your process frees up your operations teams to focus on high-value, challenging work, that both drives innovation for your business and improves their productivity.
  • By setting KPIs, you can track your progress and identify where the delays and performance gaps are.

If you’re part of a modernizing enterprise, you are probably looking to AIOps to enhance your IT operations by helping you cut costs while enhancing performance,  preventing IT incidents, and increasing the agility of your business. But with a wide range of different AIOps options on the market, how do you make sure you’re going down the right path? And once you have decided on an AIOps approach, how do you make the most of it?

As the title implies, we’d like to share 5 strategies that will help you make sure you build the right AIOps plan for your business. But first, let’s take a moment to define what the term ‘AIOps’ actually means.

Coined by Gartner in 2016, the term ‘AIOps’ refers to the combining of big data AI and machine learning to automate and improve IT operations processes. Back then, this very broad definition led to some confusion, with different IT vendors characterizing AIOps differently, depending on what they were actually offering. 

As the years went by, leading vendors defined the reality of what AIOps actually was through their products, which aimed to answer the challenges their customers were facing. As a result, AIOps is now more clearly understood and its definition is more focused, with practical applications and trends.

Whilst AIOps platforms enhance a broad range of IT practices and functions - Infrastructure and Operations (I&O), DevOps, SRE, service management, and more – it is specifically in the I&O domain the real benefits lie; benefits such as anomaly detection, diagnostics, event correlation, and root cause analysis (RCA) - all of which work to improve monitoring, service management and automation tasks across the board. 

So, now we have defined what we mean by AIOps, here are the 5 strategies we promised you earlier.

Be practical, not aspirational

In most things, it’s great to think big, the idea being that if you shoot for the stars, even if you fall short, you’re still getting far. When it comes to implementing an AIOps solution though, biting off more than you can chew by taking too general an approach can delay your project - often by months or even years. 

There may be a top-down edict by the company execs to push forward and implement AI and ML throughout the organization, with no clear definition of what specific needs are to be addressed. But, in fact, it’s better to identify specific near-term goals, not visionary ones, that you can pursue, step by step, as you carefully build up your AIOps capability. 

Let’s take your alarm-to-ticket flow, for example. Here, it’s good to take a gradual approach to adopting your AIOps platform, keeping your existing alarm-to-ticket flow infrastructure in place, while, in parallel, implementing one new AIOps capability at a time. So, you could start by feeding some of your monitoring alerts into an AIOps event correlation platform, and then feed the output back into your ticketing system. This provides a baseline that enables you to compare results before going into production. Once you are satisfied, you can incrementally add more of your tools into the AIOps platform, until you have fully integrated your monitoring and observability layer. Only then should you start looking into adding additional AIOps capabilities, such as root cause changes, remediation automation and more. 

In addition to making sure that your AIOps platform has proven itself before you begin to fully rely on it, this step-by-step approach gives your team the chance to accumulate the skills they need over time, rather than having to learn everything at once, which can be overwhelming and maybe even counter-productive.

Domain-centric or domain-agnostic? Choose wisely.

In its recent Market Guide for AIOps1, Gartner identifies two categories of AIOps solutions: domain-centric and domain-agnostic. Domain-centric AIOps capabilities are added on top of data that is specific to a domain or practice, such as network, application, infrastructure, or cloud monitoring. In contrast, the best domain-agnostic AIOps solutions work across domains to pull in data from multiple sources and IT technologies from multiple vendors, along with data describing changes happening in your environment, and then combine and correlate it all to derive insights.

As discussed in a recent webinar on AIOps [link], a good strategy here is to use domain-centric AIOps features built into a monitoring tool for a one-off, specific use case, and deploy a domain-agnostic stand-alone solution that can straddle multiple use cases over time. For example, if you are monitoring the signal quality in optical infrastructure, a domain-centric AIOps tool may help you understand a connection loss. But, if you are in charge of maintaining high-quality video calls running on top of this infrastructure, a domain-agnostic AIOps tool should be your choice, as a drop in service level can have many causes, spanning different domains and technologies that comprise the service - and you need to tie it all in to understand root cause.

It should be noted that, in general, Gartner states that “as organizations mature in AIOps adoption, they require a single domain-agnostic platform across I&O, DevOps, SRE and, in some cases, security practices”.

Use enrichment, drive intelligence

Enrichment is the unsung hero of the entire event correlation process. Raw alarm data is a start, but it’s not sufficient to be able to pinpoint the root cause and enable an effective fix. When you have alerts coming in from a variety of domains, it can be difficult to correlate them to produce a fine-tuned set of tickets. You can use timestamps or point of origin, but that will provide limited insight, and you'll miss connections between related alerts coming from other sources or from other time windows. 

Easy-to-deploy alert enrichments add value to every single alert, providing the extra layer of understanding needed to determine which alerts are interrelated, and in what way, enable you to focus on high-level correlated incidents, instead of following every low-level alert that comes in the AIOps platform. Done right, this process of enrichment reduces the ‘noise’, and helps you bring in topology information from your CMDB, APM, and orchestration tools, change information from your change management and CI/CD pipelines, and business context from your team’s knowledge and procedures.

Choose AIOps tools that provide built-in, scalable enrichment, and you’ll drive intelligence throughout your operations.

Automate your processes

Automation delivers many benefits, including consistency, saving time, and minimizing errors. When your AIOps platform automates ticketing, you can potentially reduce your Mean Time to Acknowledge (MTTA) to just milliseconds! 

Incorporating your runbooks into your ticketing system means that, when a specific alarm comes in, a specific workflow is triggered. Runbook automation takes care of all the technical steps that don’t require any thinking – such as checking the status of a network resource, or grabbing information from a server or system – putting it all into the ticket and taking it as far as possible before human intervention is required, if at all, to identify and apply the necessary fix. 

In addition to driving down the workload of your IT Ops teams and increasing the speed of incident or outage resolution, automation frees up your operations teams to focus on high-value, challenging work, that both drives innovation for your business and improves their productivity. 

Drive continuous insights 

The maximal value of implementing an AIOps solution goes beyond just improving ad-hoc resolution of performance issues. It also drives continuous process improvement over time, by enabling you to analyze every single stage and understand how long each one takes, from incident detection, to investigation and root cause analysis, to remediation and resolution. 

By setting KPIs, you can both track your progress, and identify where the delays and performance gaps are. This, in turn, gives you specific areas to focus on in your quest to make your processes work more efficiently, helping you to determine what next steps will deliver the most value and further improve your team’s productivity. For example - tracking and identifying  the applications or business services that are most impacted  by IT incidents over time provides you with a birds-eye view of where your operational hotspots are. Further tracking of the tops checks , top alert categories and their MTBF (Mean Time Between Failures) can help you pinpoint their exact location. Your overall operational efficiency can be measured and improved by tracking and measuring incident assignments over time (between L1s, L2s, L3s and specific groups within your organization). Tracking your MTTA (Mean Time to Acknowledge), MTTD (Mean Time to Detect) and MTTR (Mean time to Resolve)  KPIs over time can help you analyze and improve each stage of your incident management lifecycle.  

Do bear in mind that, as with any strategy, your IT Ops teams are critical partners in this process. Stay in close communication with them to make sure your AIOps solution is easing their workload, and not creating more work for them. Perhaps you've got correlation patterns that need to be updated or better tuned; or they could benefit from additional enrichment. Whatever it may be, you need to work with them to identify and address pain points, and, where things are going well, make sure they are aware and maximize their value. 

The world of AIOps is rapidly evolving. This makes it challenging to chart a course, and ensure that you can wisely choose from the many AIOps platforms that are available in the market. By defining what AIOps means for the future of your business, and adopting the five strategies outlined above, you will find that implementing an AIOps platform can deliver exceptional benefits and efficiencies that help to truly transform your operations.

 

1Gartner Market Guide for AIOps, Pankaj Prasad, Padraig Byrne, Josh Chessman, April 6 2021

About the Author

Yoram Pollack is a Director of Product Marketing at BigPanda.  His main interests are in emerging technologies in IT operations and security, with a focus on AIOps: exploring how the implementation of Machine Learning and AI into IT Ops can help reduce IT noise, detect and surface probable root cause and automate manual aspects of IT incident management. Combining his engineering background and storytelling expertise of over 20 years, Yoram has been working in his recent roles to help enterprises understand how technology can meet their needs and help their businesses grow.

Rate this Article

Adoption
Style

BT