Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News Lowering Recovery Time through AI-Enabled Troubleshooting

Lowering Recovery Time through AI-Enabled Troubleshooting

This item in japanese

Machine learning algorithms for anomaly detection can assist DevOps in daily working routines, where generalized ML models are trained and applied to detect hidden patterns and identify suspicious behaviour. Applied machine learning for IT-operations (AIOPs) is starting to move from research environments to production environments in companies.

Florian Schmidt, postdoctoral researcher at Technische Universität Berlin, spoke about AI-driven Support for Log-file Troubleshooting at DevOpsCon Berlin 2021.

According to Schmidt, companies leveraging experts for log-file troubleshooting face too high of costs as there is a shortage of DevOps/SREs, while applications increase in their number of single hosted components through containerized services and functions.

Schmidt explained how machine learning can be used to reduce troubleshooting time:

I see the major role of machine learning models to assist DevOps/SREs detect anomalies combined with insightful reporting. This process includes the identification of root cause components (e.g. network switch configuration, memory leak inside running service, or a hardware issue), delivering the key abnormal log messages, prioritizing incidents when many happen at the same time, and enriching further information like variable analysis helping to solve the concrete issue.

Machine learning models can detect hidden patterns and identify suspicious behaviour within log data, as Schmidt explained:

In more detail, there exist two types of anomalies within log data. The first type is called flow anomalies. Flow anomalies refer to anomalies capturing a problem shown in the frequency and sequence of arriving log messages. ML models learn the frequency, ratio, and sequence of arriving log message templates to detect the missing of expected log messages, newly derived messages, and change in the count of messages.

The second type is called cognitive anomalies. Cognitive anomalies represent the identification of problems within the log message. As log messages are typically written as unstructured text provided by developers for developers, anomalies are represented within the semantics of the text. ML models learn these semantics through NLP techniques to detect groups of words, which are typically associated with abnormal behaviours like: exception, timeout, and failed. Additionally, variables inside a message provide valuable insights (like HTTP response codes) indicating anomalies. Such anomalies are also classified as cognitive anomalies but require additional types of ML models capable of detecting variables within log messages and apply time series analysis.

InfoQ interviewed Florian Schmidt about AI-enabled troubleshooting.

InfoQ: What is the state of practice in troubleshooting complex applications using logs?

Florian Schmidt: Companies leverage log management frameworks like the Elastic-Stack to systematically monitor all application components, store the log data in a data warehouse, visualize application-specific performance KPIs, as well as applying configurable alerting capabilities.

Such frameworks allow to systematically automize application troubleshooting. DevOps/SREs can add self-defined queries to automatically look for suspicious regex patterns within the log messages, and additionally add thresholds to get alerted.

Still, there are many companies that have not integrated any log management frameworks into their operational processes yet, but try to resolve problems with a lot of experts, searching log files manually, to achieve a quick mean time to recovery (MTTR).

InfoQ: What are the advantages and disadvantages of these approaches?

Schmidt: Companies which already established key infrastructure components of log management are able to build upon it by adding analytical tools. The advantages are surely to automate the alerting process by identifying root cause components of an application and delivering the most suspicious log messages to the expert 24 hours, 7 days a week. Such assistance enables the expert to concentrate on the fixing of the problem rather than losing valuable time in the identification process.

I believe in the movement to log management frameworks, as they provide standardized APIs to interact with the log data and enable the integration of further plugins, which are capable of applying even more highly complex analysis through machine learning. This can additionally assist DevOps to quickly determine the correct root cause and accelerate MTTR.

InfoQ: What role can machine learning models play in troubleshooting?

Schmidt: In my PhD research on Anomaly Detection in Cloud Computing Environments, I showed that the steps for troubleshooting generalized ML models can be trained and applied in production environments.

We conducted a case study in which we were able to show that ML-driven anomaly detection is able to reduce the search time by 98% compared to manual search. The key idea of anomaly detection is capturing the "normal&quo; behavior as a high-dimensional distribution of the monitored service within daily operation. The distribution can be learned automatically (with AutoAD4j, a framework for unsupervised anomaly detection) when operating a service, while alerting abnormal/untypical situations (data that does not fit the learned "normal" operation).

The distribution for time series data like monitoring metrics (CPU, memory, network, etc.) can be captured by reconstruction models like autoencoders and forecasting models like ARIMA, while log data is typically described by Autoencoder models through word appearances within the log message over time. When alerting abnormal behavior through the deviation to the "normal" distribution, a reconstruction error is computed indicating the severity of the anomaly. Most severe anomalies are then reported to the DevOps by indicating the concrete service and the log messages.

InfoQ: What have you learned?

Schmidt: Key learnings when trying to deliver machine learning to IT-operations are:

  1. Building up a team of both DevOps and Data Scientists to efficiently learn from each other.
  2. Data is a key resource. Like most ML domains, labelled data is the most important value to create applicable models. Ask your DevOps to label the concrete log messages, which helps mostly to identify the underlying problem or solve the issue. Further building of testing environments with chaos engineering techniques can additionally help to capture problematic behavior more efficiently.
  3. In practice it is important to focus on the simplicity of ML models and the management of those. Models which require high training times or need to be retrained with DevOps feedback often are hard to maintain. Models which generalize and use AutoML or unsupervised techniques provide easier maintenance when it comes to operation within the infrastructure.

InfoQ: What do you expect the future will bring for AI applied to troubleshooting?

Schmidt: Current research and early applications like show the possibility of detecting flow anomalies and cognitive anomalies within logs with very precise results. For logs, the future goes into the direction of structured logging to standardize how logs are written, while additionally companies leverage more complex deep learning models to automatically detect anomalies.

In the long term, I expect that at some point there will be a fully automated self-healing pipeline, which will not only be able to detect anomalies, but also to recover and mitigate any anomalies. This would be an end-to-end solution, which I see as an immune system built for computers.

Rate this Article