InfoQ Homepage Articles Instrumenting the Network for Successful AIOps

Instrumenting the Network for Successful AIOps

Bookmarks

Nov 23, 2020 11 min read

InfoQ Article Contest

Share your knowledge Win a ticket to a QCon event or an InfoQ Dev SummitFind out more

Key Takeaways

AIOps platforms empower IT teams to quickly find the root issues that originate in the network and disrupt running applications.
Determining the root cause requires answering 4 Ws: What went wrong, When it happened, Why it happened and Where it happened.
High quality network data based on real time packet analysis is the key ingredient for AI/ML algorithms to answer what went wrong and where it happened.
Approaching an AIOps platform from a network visibility standpoint builds on a foundation which is common to NetOps, SecOps and CloudOps. On top of network visibility teams can add specific application instrumentation and other logs which are required to their specific needs.
Network visibility starts from TAPs around key middleboxes (Firewalls, Loadbalancers) and extends to spans and flow data in less critical points.

In 2018 one of the big consumer financial services companies experienced a major slowdown in user-experience responsiveness after a routine system maintenance. Their application consisted of 27 different microservices deployed in a couple of managed data-centers and on a public cloud. After a weeklong troubleshooting effort the team realized that one of the authentication services had been migrated to a different datacenter, and each authentication transaction was taking 10 milliseconds instead of less than the usual 1 millisecond. The micro-services were very liberal in the number of authentication transactions they performed. Moving this single service to a different datacenter impacted the user-experience adding multiple seconds to every call, causing frustration to the users which led them to leaving the service.

This example demonstrates why DevOps teams shouldn’t limit themselves to application performance monitoring tools and should think of network visibility as part of the overall user-experience visibility. The good news is that more and more organizations understand this imperative and are finding ways to merge network visibility with application performance monitoring. This trend takes urgency these days with many enterprises upgrading their networks to 100Gbps, and consolidating data-centers and infrastructure. As most applications today consist of tens or in some cases hundreds of microservices, the migration of the services might introduce significant impact on user-experience in overall responsiveness.

One of the promising technology advancements that can help a CIO/CISO to prepare and manage such transitions is the ability to leverage artificial intelligence (AI) and machine learning (ML) to turbocharge IT infrastructure and operations (I&O). Known as AIOps, this practice increases IT operational accuracy, agility and efficiency.

The challenge is that it’s very easy for AIOps initiatives to go wrong if the IT team skips over the necessary data considerations, for example by focusing on application visibility only without understanding the impact of the network on the user-experience. For AIOps to be put into action by the operational side of IT it must rely on accurate, high resolution and consistent data from the IT infrastructure, servers and the network.

The Benefits and Challenges of AIOps

AIOps exists at a crossroads of NetOps, SecOps, DevOps, BizOps and ITOps and benefits all of these teams, but for the purposes of this article we’ll focus on how network visibility should be deployed and managed by NetOps teams. There are a few reasons for picking this as an entry point for tackling AIOps. Application performance monitoring got a lot of attention over the last few years, and because it relies on instrumentation by developers it does a good job of identifying expected failures, but struggles in identifying unexpected failures.

Unexpected failures occur when systems from multiple vendors or developers have compatibility issues. In these cases packet captures can provide a source of truth to understand what went wrong. Secondly, high fidelity network data provides a critical building block for AI/ML algorithms to correlate failures; get the right data at the right resolution at the right time and you’re in good shape, get it wrong and you’ve got a garbage-in into the AIOps platform and garbage-out of the AIOps platform.

AIOps platforms require correlation between multiple sources and types of data which can detect abnormal network behavior that may indicate security risks or application performance issues. By sending high-quality relevant alerts along with important contextual information to security tools, network controllers and corporate dashboards, the network intelligence platform lets NetOps, DevOps and SecOps teams be proactive in identifying and resolving issues reducing the mean time to resolution.

Assuming data quality is high, AI/ML can automate much of the process of understanding which alerts matter, and what best to do about them. This requires contextual information around the “four W’s” – what happened, when it happened, why it happened and where it happened (the last two items are the difficult ones). Lack of access to high-fidelity network data makes it difficult or impossible to answer these questions – irrespective of whether artificial intelligence is involved or not.

Barriers to Successful AIOps

If data is what fuels network domain-centric AI/ML and analytics, then when it comes to AIOps it’s the network data in the purest form – i.e., network packet data – that matters. Specifically, high-resolution network packet data can feed AIOps with insight on problems; information flows for users, applications, cloud, IoT, etc.; security threats, including malicious activity, and of course application performance and the related end user experience.

By far, the most common issue for organizations starting an AIOps program is that they don’t have access to the data they need to quantify the problems they are facing. Rather than using data to establish baselines and understand where systems fall outside desired parameters, many organizations, even sophisticated ones, rely on subjective complaints from users to identify network issues. There’s no way to automate a solution to “the network feels slow.” Training machine learning models on something as complex as network or application performance is quite difficult in the best circumstances – trying to do this without a solid baseline backed up by high-fidelity network data is impossible.

Data from as many locations in the network as possible is important in order to:

Reduce or eliminate blind spots (for example, in public cloud infrastructure)
Better triangulate in determining where an issue occured

Rich and expansive network information matched with events and logs from servers, storage and other hardware provide important context to complement packet data and provide a holistic view of the operational health of the service. To give one example: if the AI/ML system is monitoring throughput over the course of a day and a server goes out in the afternoon, having telemetry from that server will help the system understand the sudden drop in throughput.

The Path to High-Quality Network Data

The highest quality network data is obtained by deploying devices such as network TAPs that mirror the raw network traffic. Many vendors offer physical and virtual versions of these to gather packet data from the data center as well as virtualized segments of the network. AWS and Google Cloud have both launched Virtual Private Cloud (VPC)traffic/packet mirroring features in the last year that allow users to duplicate traffic to and from their applications and forward it to cloud-native performance and security monitoring tools, so there are solid options for gathering packet data from cloud-hosted applications too.

The network taps let network monitoring tools view the raw data without impacting the actual data-plane. When dealing with high sensitivity applications such as ultra-low-latency trading, high quality network monitoring tools use timestamping with nanosecond accuracy to identify bursts with millisecond resolution which might cause packet drops that normal SNMP type counters can’t explain. This fidelity of data is relevant in other high quality applications such as real-time video decoding, gaming multicast servers, HPC and other critical IOT control systems. Specially-built network probes can identify gaps, errors and measure jitter providing the operations team with a real-time view of the quality of the data. This data with accurate timestamping can be used to extract further application level metrics such as response-time, round-trip-time, retransmissions, duplications and missing messages which are more directly related to user level applications.

Networks generate a huge amount of raw data. A single 100Gbps port can pass 50M packets every second and a mid-sized data-center can have hundreds of ports being monitored. Sending all of this data to a centralized location isn’t practical for the simple reason that it will require another overlay network which is as complex as the original network. To address this challenge the next step is to summarize, aggregate and consolidate the collected data as close as possible to the monitoring point and use Network Packet Brokers (NPB) to move the relevant data for further analysis by more tools. A packet broker is often required to send packet data to multiple ITOps/AIOps, network monitoring or security tools. It is important to ensure that you are collecting the data at an appropriate interval – for modern high-speed networks this means sub-second measurement, often called “microbursts.”

Different tools that connect to the network packet broker fabric extract relevant Key Performance Indicators (KPIs) from the raw packets. The KPIs include flow-data that covers throughput and more importantly analysis of latency, retransmissions and other specific protocol errors.

The transformation of raw packets into KPIs can be thought of as similar to the process by which a digital camera translates a real life view into pixels. These KPIs are now the raw data which AIOps platforms can further analyze and employ AI/ML algorithms to extract insight from. So the next step in the process is collecting the data from multiple sources, normalizing it, and adding context. Most network data is best described as time-series data where KPIs are collected over time. Databases such as InfluxDB or TimeScale DB which are optimized for time-series processing are best suited for the task. The storage of the data should take into account its cardinality, enrichment by adding meta-data such as mapping of IP addresses from packets into FQDN and/or the host-name from HTTPS handshakes, and so on. All this data has to be stored in a well-designed schema to enable the next steps.

On top of this meta-data that was collected from network monitoring points, DNS servers, and other systems theAIOps platform employs algorithms to create meaningful insights. It is important to note that there isn’t a single algorithm to solve all problems. The AI/ML algorithms can be thought of as a swiss knife where different algorithms are used for specific tasks and multiple algorithms may be needed to complete a larger mission. For example, to identify a network anomaly one algorithm may need to create a baseline - learning what a normal network looks like.

Many networks show periodic behavior, such as a stock-exchange network where a big spike occurs 5 days a week at 9AM EST, goes down during the day and spikes again near closing time. By employing one algorithm to identify this behavior the platform can identify a large deviation from the normal, and may alert the security team if a spike occurs over the weekend, or the network team if the network is quiet at 9:02AM EST. Other algorithms need to run in parallel looking for different types of network and different metrics of the network. A baseline for throughput is different from a baseline for errors and different from a baseline for latency. The AIOps platform has to allow for these algorithms to run in tandem and cooperate with each other.

Another family of algorithms worth mentioning are those that correlate data from multiple sources. These algorithms need to look for correlation - for example - between a server’s CPU load and the network load and identify if one is the cause of the other. However, there are many more types and families of algorithms required to understand and automate the predictive and reactive role of ITOps teams, which is why a well designed AIOps platform has to be able to be expanded all the time with more algorithms.

The last consideration for a well deployed AIOps platform is the ability to “close the loop” and for humans to provide guidance. While AIOps algorithms use automatic learning, by taking human input, e.g. identifying groups of IPs with meaning that is relevant to the operation teams, the algorithms can converge a lot quicker and adjust to the actual needs of the human operators by providing alerts that are useful and actionable.

The Crawl-Walk-Run Approach

AIOps platforms are a technology answer to the complex problems that today's IT operation teams face, where moving one small service from one location to another might impact the critical service of the company for weeks without a resolution. AIOps platforms use AI/ML algorithms to empower IT teams to answer the 4 Ws we mentioned before.

AIOps platforms are meant to answer these questions as quickly as possible, preferably before the user notices that something is wrong. However, deployment and operation of such platforms is a journey. Beginning an AIOps pilot deployment by focusing on network data acquisition makes for a solid crawl-walk-run approach. Network TAPs and packet brokers are easily available, so acquiring the data is not particularly complex or risky. This approach builds a solid foundation by making sure the network data is consistent and complete. As AIOps show value, teams can layer in other data sources to grow the program. It may also be helpful to start by using products and services from a single vendor to simplify this early rollout process.

AIOps can provide clarity on key IT issues, make IT teams more efficient and speed time to resolution by eliminating manual workflows. But the key to AIOps success and its ROI across the board for IT functions (including DevOps, AppOps, SecOps, CloudOps and NetOps) is to start with a solid foundational network-centric data approach. As with most AI/ML based systems, the quality of the data – its consistency, reliability, completeness, accuracy and precision – matters as much as the data itself. Without full visibility into a network, teams can’t build appropriate baselines to generate useful alerts, and they can’t provide enough context to make those alerts actionable. This instrumentation should be step one to set any AIOps program up for success.

About the Author

Ron Nevo serves as Chief Technology Officer at cPacket Networks. Ron has over 25 years of experience leading engineering teams through the creation and development of complex networking. Ron began his career in Qualcomm, where he was a lead system engineer for mobile telephony systems and was responsible for the creation of IP that is part of the core 3G and 4G systems. Ron was also the co-founder of Mobilian, a wireless semiconductor company. Ron joined Intel through the acquisition of Mobilian, where he successfully led engineering teams in the wireless group as well as Intel’s new business group. Ron holds a B.Sc. in Electrical and Computer Engineering from The Technion in Israel, and holds over 15 granted US patents.

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

Instrumenting the Network for Successful AIOps

InfoQ Article Contest

Key Takeaways

Related Sponsored Content

The Benefits and Challenges of AIOps

Barriers to Successful AIOps

The Path to High-Quality Network Data

The Crawl-Walk-Run Approach

About the Author

Rate this Article

This content is in the DevOps topic

Related Topics:

Related Editorial

Popular across InfoQ

The InfoQ Newsletter