Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News How AI Supports IT Operators to Resolve Issues Faster and Keep Systems Running

How AI Supports IT Operators to Resolve Issues Faster and Keep Systems Running

This item in japanese

AIOps is all about equipping IT teams with algorithms that can help in quicker evaluation, remediation or actionable insights based on their historical data without the need to solicit feedback from users directly. AI can help IT operators to work smart, resolve issues faster and keep the systems up and running to deliver great end-user experience.

Rajalakshmi Srinivasan spoke about the impact that artificial intelligence has on IT operation management at DevOps Summit, Canada 2021.

Artificial Intelligence (AI) and Machine Learning (ML) techniques have almost impacted every field or industry around us, particularly those that deal with massive data, Srinivasan mentioned. Among those, IT Operation Management (ITOM) has been one of the earlier adopters of AI & ML due to the sheer volume of data generated by IT operations, she said.

One of the primary objectives of ITOM is to send alerts to IT teams when the server/application goes down or when the response time exceeds the defined threshold value or for any incident in the system. Srinivasan gave an example of how AI can help operators to manage a flurry of alerts:

Alerts should be categorized and assigned to relevant technicians based on severity, and degree of business impact. At Site24x7, these alerts are automated and managed by AI algorithms by continuous training and learning from various user actions thus reducing the Mean Time To Repair (MTTR).

With the wide adoption of the cloud, the industry is working towards 99.999% uptime of all its resources, Srinivasan said. This involves many meticulous and mundane processes such as running scripts to perform corrective actions like restarting a process, clearing logs, stopping a service, invoke an URL/Rest API to create a ticket/incident, rebooting virtual/cloud VM, and many more. AI can be used for causal analysis and corrective action, as Srinivasan explained:

Thresholds need not be fixed user configurations; instead they are AI enabled custom values that vary for various parameters. The criteria for each of these actions are also different. If the disk utilization threshold is spiking than its usual range, the AI engine detects this as an anomaly and invokes the disk clearing action. If a new alarm/event is found, the AI engine identifies this and triggers the ticket/incident creation task.

Most of the corrective actions can be automated based on various predefined criteria and AI can be a breather in such repetitive scenarios, Srinivasan concluded.

InfoQ interviewed Rajalakshmi Srinivasan about applying AI in IT operations management.

InfoQ: What is the state of practice of AI in IT operations management? What are the possibilities?

Rajalakshmi Srinivasan: Anomaly detection, outage prediction, natural language processing, root cause analysis, seasonality trend analysis, and capacity planning are a few AI & ML techniques that come in handy in ITOM. Let me explain with some practical use cases.

For capacity planning, based on the past performance values of the disk utilization and how it grows, predictions can be made to forecast the disk usage. We have had instances where AI simply outweighed the regular static approach of extrapolating the data by providing seasonality trends and insights in the data. What we observed is that the values will not always be on the increasing side and it may show a decreasing trend towards the end of every month. Sometimes the value will increase only during the weekends and become normal during the weekdays. These details have been captured in our AI-based forecasting, which helped us smoothen the irregularities in the data collected leading to precise predictions.

AI techniques have also helped us with automatic anomaly detection whenever there is a drastic deviation to the metrics collected due to various reasons, such as a sudden increase in the number of requests to the website, the response time of a web transaction spiking to 4x from its usual range, the JavaScript (JS) error count from a particular geographical region being high, the number of archiving tasks being reduced from its normal count, and more… In all these situations, we greatly depend on AI techniques for automatic anomaly detection in our monitoring systems.

InfoQ: How does AI support IT operators?

Srinivasan: There are numerous ways AI and ML techniques support IT operators. Let me explain a few use cases in our day-to-day work.

Applying dynamic thresholds: defining thresholds for the metrics (response time, CPU usage, request count) collected can be easily automated with help from historical data. This dynamic thresholding not only helps us with operational accuracy & efficiency, but also in optimal resource allocation.

For instance, Site24x7’s web application has multiple distinctive grids for client access, rest API requests, archiving, data collection, data processing, and so on. The response time will vary for each of these grids. A background scheduled task in an archiving grid will take more time, but customer-facing client requests will need an instant response. And even within the same grid, multiple transactions will have different bench-marked response time values.

In these scenarios, we cannot define a constant threshold for all the grids or for all transactions within the same grid. This is a use case where AI & ML has helped us with dynamic thresholds without any user intervention.

Enhanced communication using chatbots: gone are the days where we had to log in to various monitoring tools to know the status of the system. Today, with chatbots, these communications are enhanced and integrated in such a way that we can seamlessly make a simple natural language query from our chat application (Microsoft Teams, Slack, Zoho Cliq) and get the status.

Natural Language Processing (NLP) is the AI technique used in combination with API calls to fetch the required data and act on it wherever required.

InfoQ: What do you expect that the future will bring for AI in operations?

Srinivasan: Like any other developing technologies, AI in IT operations is an ongoing process trying to improve the system to become more efficient and productive. Some of the enhancements will be aimed at:

  1. Achieving accuracy in anomaly detection and false alerts
  2. Being proactive and preventing an issue from occurring, rather than being reactive and resolving the issue after it has occurred
  3. Self-training systems to make precise predictions
  4. Increasing use of Deep Neural Network Algorithms in place of machine learning techniques to narrow down and pin-point problems/issues
  5. Availing AI and ML as a Service to create meaning out of the enormous data collected

About the Author

Rate this Article