Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Articles AIOps: Site Reliability Engineering at Scale

AIOps: Site Reliability Engineering at Scale

Key Takeaways

  • AIOps can simplify and streamline processes which can reduce the mental burden on employees
  • Another benefit is improved communication and collaboration between departments leading to more efficient use of resources and reduced budget overhead
  • AIOps can simplify implementing measures to minimize downtime, such as improving maintenance schedules or upgrading equipment
  • AIOps can improve customer satisfaction and enhance customer trust while reducing service disruptions.

It was in the 20th century when software began eating the world. In today’s 21st-century environment, its appetite has turned to humans.

Whether it is financial systems, governmental software, or business-to-business applications, one thing remains: these systems are critical to revenue, and in some cases, to human safety. They must remain highly available in the face of technological, natural, and human-made adversity. Enter the Site Reliability Engineer or SRE.

The SRE model was born out of Google when Ben Treynor Sloss established the first team in 2003:

Fundamentally, it’s what happens when you ask a software engineer to design an operations function ... So SRE is fundamentally doing work that has historically been done by an operations team, but using engineers with software expertise, and banking on the fact that these engineers are inherently both predisposed to, and have the ability to, substitute automation for human labor.[1]

Since its inception, engineering organizations have adopted this model in various ways, yet the fact remains the same. These engineers support revenue and business-critical operations 24x7x365.

It is challenging to locate, hire, and train SREs. In an ever-changing landscape of infrastructure and buzzwords, it begs the question of how to scale these teams sustainably to ensure the well-being of the team and the continuity of operations. Enter AIOps.

AIOps, or artificial intelligence for IT operations, is a set of technologies and practices that use artificial intelligence, machine learning, and big data analytics to improve the reliability of software systems. AIOps enables cognitive stress reduction, increased cross-functional collaboration, decreased downtime, increased customer satisfaction, and reduced cost overhead.

Reducing Cognitive Overload

The on-call engineer’s cognitive stress problem comes in two forms: alert vs. signal noise and information retrieval.

For anyone who has ever held the proverbial pager (we don’t still use real pagers, do we?), the noise versus signal problem immediately comes to mind when considering cognitive stress factors. This problem explores the balance between actionable alerts and alerts that are too sensitive or noisy. This creates a symptom called alert fatigue.[2]

One of the critical benefits of AIOps is cognitive stress reduction. AIOps systems can automatically identify and diagnose issues and can even predict potential problems before they occur. This can reduce the cognitive load on SRE teams, allowing them to focus on more business-aligned project work rather than spending their time troubleshooting issues. 

Additionally, AIOps systems can assist with the "front door problem" associated with incident triage. Monitoring systems have millions of data points they collect. The quality of information associated with the alert received is human-dependent. Often, this generates a single question when an SRE begins system triage:

"Where do I begin looking to understand the potential blast radius better?"

AIOps systems can assist with this initial triage measure by analyzing potential anomalies in system state and or telemetry data and providing both potential areas to focus on, as well as complimentary documentation sourced on the organizational intranet.

SREs must begin thinking about how to empower the adoption of AIOps in their organizations. While this is yet another tech stack SREs need to learn, the benefits can have exponentially positive results in reducing their overall cognitive load.

Enhancing the Cross-Team Engagement Model

AIOps (Artificial Intelligence for IT Operations) can significantly improve cross-functional engagement in a business. In traditional IT operations, different teams may work in silos, resulting in communication gaps, misunderstandings, and delays in issue resolution. AIOps can help bridge these gaps and facilitate collaboration between different teams.

One way AIOps improves cross-functional engagement is through its ability to provide real-time insights and analytics into various IT processes. This enables different teams to access the same information, which can help improve communication and reduce misunderstandings. For example, the data provided by AIOps can help IT teams and business stakeholders identify potential issues and proactively take action to prevent them from occurring, leading to better outcomes and higher customer satisfaction.

Another way AIOps improves cross-functional engagement is through its ability to automate various IT processes. By automating routine tasks, AIOps can free up time for IT teams to focus on strategic initiatives, such as improving customer experiences and innovating new solutions. This can lead to improved collaboration between IT teams and business stakeholders. Both groups can work together to identify areas where automation can be implemented to improve efficiency and reduce costs.

Overall, AIOps can improve cross-functional engagement by providing real-time insights and analytics, automating routine tasks, and enabling collaboration between different teams. By breaking down silos and improving communication between IT and business stakeholders, AIOps can help businesses deliver more reliable and efficient IT services, leading to better outcomes and higher customer satisfaction.

Reducing Downtime Throughout the SDLC

Another critical benefit of AIOps is decreased downtime. The nature of diagnosing a system degradation or failure involves the performance of computing systems within a constrained environment. The thousands of data inputs involve humans-in-the-loop (HIL) to design additional systems to alert an engineer based on a given set of metrics. Furthermore, the process extends further when an engineer has to read and interpret the data presented to them after an alert is triggered.

Metrics such as time-to-detection and time-to-resolution are an aggregate evaluation of an engineering team’s effectiveness at receiving, interpreting, triaging, and resolving such incidents. All of this can be drastically improved upon by implementing an AIOps system. In critical environments, it may be necessary to maintain a HIL to decide what actions to take inside a company’s infrastructure. All of this can be drastically improved upon by implementing an AIOps system. An AIOps system can intelligently and diligently analyze the streams of data points it ingests consistently while auto-remediating on less critical issues without the interference of a human, while only alerting for the highest severity issues.

Happy Customers, Happy Life

From a customer perspective, AIOps can have a significant impact on their satisfaction with the services they receive. For example, AIOps can help businesses proactively identify and resolve issues before they impact customers. This means that customers are less likely to experience service disruptions or downtime, resulting in improved availability and reliability of services. Additionally, AIOps can help businesses improve the speed and accuracy of incident resolution, which can help minimize the impact of incidents on customers.

Another benefit of AIOps is that it can help businesses identify and resolve issues more quickly, leading to shorter resolution times. This can be particularly important for customers who are experiencing critical issues or downtime. By resolving these issues faster, businesses can minimize the impact on customers and reduce the risk of customer churn.

Overall, AIOps has the potential to significantly improve customer satisfaction by helping businesses deliver more reliable and available IT services, faster incident resolution times, and shorter resolution times. As a senior software engineer, I believe that AIOps is a powerful approach to IT operations that can help businesses stay ahead of the curve in today's fast-paced and competitive market.

Patching the Leaky Bucket

AIOps can help automate and optimize various IT processes, including monitoring, event correlation, and incident resolution. By automating these processes, AIOps can reduce the need for manual intervention, which can help reduce labor costs. Additionally, by optimizing these processes, AIOps can help companies reduce the time and resources required to manage IT operations, leading to overall cost savings.

This can help companies reduce the number of service disruptions and outages, which can lead to significant cost savings. Downtime and service disruptions can be costly for businesses, resulting in lost productivity, revenue, and customer satisfaction. By detecting and resolving issues before they impact services, AIOps can help minimize the risk of service disruptions and downtime, leading to cost savings for the business.

Additionally, AIOps can help businesses improve their overall IT infrastructure and application performance. By providing real-time insights into application and infrastructure performance, AIOps can help companies optimize their resources and reduce inefficiencies. This can lead to cost savings by reducing the need for additional hardware and software resources.

A quick internet search will reveal the average salary of a Software Engineer in the United States is $90,000 - $110,000 USD. This roughly equates to $47 - $57 an hour. Imagine, on average, your incidents involve five engineers, and it takes you three hours to resolve an issue. That means your incidents cost you $705 - $855 per incident. Now imagine you have three incidents a month, bringing you to approximately $30,780 a year in costs. This doesn’t include any customer revenue loss or the intangible costs of losing customer trust. There are a few essential questions to ask yourself to get a rough estimate of how much an incident costs your company. 

  1. How much are engineers paid at my company?
  2. How many incidents do we have in a year?
  3. How long does it take us to resolve those incidents?
  4. What are the intangible costs to our company because of incidents?

Once you do this back-of-envelope math, you’ll quickly understand how even a 10% decrease in incidents will save your company an impressive amount of money on the bottom line.

Where to Start

The truth is, adopting AIOps is a long journey for any organization. However, with persistence and focus, a company can realize the benefits discussed earlier in this article. Here are a few considerations to get started on your adoption of AIOps. 

  1. Define your goals: The first step is to determine what you want to achieve with AIOps. This can include reducing downtime, improving incident response times, or optimizing resource utilization.
  2. Assess your current IT infrastructure: Before implementing AIOps, you need to understand your existing IT infrastructure, including the tools and technologies you currently use. This will help you identify any gaps that AIOps can fill and ensure that your AIOps program integrates smoothly with your existing systems.
  3. Choose an AIOps platform: There are many AIOps platforms available in the market. Evaluate different options and choose a platform that aligns with your goals and IT infrastructure. Look for features such as automated root cause analysis, anomaly detection, and machine learning algorithms.
  4. Identify data sources: AIOps platforms require a significant amount of data to operate effectively. Identify the data sources you will need to collect, such as log files, performance metrics, and configuration data.
  5. Develop a data strategy: Determine how you will collect, store, and manage the data required for AIOps. This includes deciding on data retention policies, data security measures, and data access controls.
  6. Train your AIOps platform: Once you have set up your AIOps platform and data strategy, you will need to train the platform to recognize patterns and anomalies in your IT infrastructure. This involves feeding historical data into the platform and tweaking the algorithms to optimize performance.
  7. Integrate with your IT operations: Finally, you will need to integrate your AIOps program with your IT operations. This includes setting up workflows for incident management, change management, and resource allocation.


In conclusion, AIOps is a set of technologies and practices that use artificial intelligence, machine learning, and big data analytics to improve the reliability of software systems. AIOps enables cognitive stress reduction, increased cross-functional collaboration, decreased downtime, increased customer satisfaction, and reduced cost overhead. These benefits can be achieved by automating incident management processes, providing real-time visibility into the performance of software systems, and optimizing resource allocation.


  1. Google Interview
  2. Want to Solve Over-Monitoring and Alert Fatigue? Create the Right Incentives!" [Kishore Jalleda, Yahoo, USENIX SREcon17]

About the Author

Rate this Article