BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News KubeCon NA 2025 - Salesforce’s Approach to Self-Healing Using AIOps and Agentic AI

KubeCon NA 2025 - Salesforce’s Approach to Self-Healing Using AIOps and Agentic AI

Listen to this article -  0:00

AIOps and Agentic AI technologies can help in developing solutions to intelligently analyze Kubernetes cluster health, automatically diagnose platform problems, and orchestrate issue resolutions with minimal human intervention. Vikram Venkataraman from AWS and Srikanth Rajan from Salesforce spoke on Tuesday at KubeCon + CloudNativeCon North America 2025 Conference about Salesforce’s approach to self-healing systems using AIOps and AI Agents.

The AIOps architecture was developed at Salesforce by the team that develops and supports software to manage infrastructure to support the Hyperforce Kubernetes Platform, a managed Kubernetes platform built on multiple clouds (AWS, GCP, and Alibaba Cloud) that provides namespace-as-a-service. The operational scale of their K8s platform includes 1400 K8s clusters, millions of pods, thousands of compute nodes, 40+ operators and integrations, and 200+ monitoring plugins. The speakers highlighted that they estimate the capacity to increase five times in the next couple of years. The overall goal of the solution is to enable application teams to focus on business requirements, rather than being bogged down by infrastructure overhead.

They discussed approaches to Kubernetes platform operations, leveraging generative AI and multi-agent collaboration to create a cluster management system that troubleshoots Kubernetes clusters, thereby reducing the mean time to identify (MTTI) and mean time to resolve (MTTR) for critical cluster issues. An agentic AI solution comprises AI agents with specific goals to assist with the AIOps platform and tools to retrieve data from the telemetry platform. Agents perform actions against their K8s environment, like rolling back upgrades in case any issues arise during the upgrade process.

Venkataraman and Rajan discussed the challenges of building AI for intelligent operations, including how different agents should communicate with each other and what guardrails and security permissions the agents must have to operate within the guidelines. They discussed the details of the solution architecture, hosted on the AWS cloud platform, which consists of AIOps UI for engineers, Collaborator Agent, Amazon Prometheus and its agent, Amazon EKS, k8sgpt Operator that helps with MTTI metrics, and ArgoCD Controller.

The speakers then shared the details of their tech stack, showing different layers with open source technologies as well as home-grown tools:

  • Substrate (Kubernetes cloud platforms like Amazon EKS, self-managed K8s, Google GKE, and Alicloud ACK)
  • Standard Capabilities: Storage, networking, autoscaling, DNS, load balancing, mesh, and Ingress. Technologies used in this layer include Istio, Cluster Autoscaler, CSI, OPA, Ingress, CNI, LBC, and CoreDNS.
  • Custom Integrations layer includes capabilities like identity, secrets management, guardrails, and log collection.
  • Platform Capabilities layer consists of components for platform abstractions, deployment orchestration, lifecycle automation, visibility & observability, resiliency, cost management, and best practices enforcement. Tools in this layer include Argo, Kyverno, Spinnaker, Helm, Kube Magic Mirror, Sloop, and Periscope.
  • Finally, the API layer provides customer access services and hosts the Control Plane, APIs, and self-service portals.

To solve problems like siloed tools, static workflows, limited feedback loop, the team developed AI agents based infrastructure management solution. They started small with a few AI agents, including the AIops agent (on-call report agent) and the Kubectl agent, which integrates with Teams channels in Slack and translates natural language questions into kubectl commands, providing debugging information on Slack. There is also the Live Site Analysis Agent, which automates the weekly platform availability review process by analyzing metrics such as SLA misses and generating root cause analysis (RCA) insights.

The speakers suggested progressive autonomy when adopting AI-based solutions within your own organization. Their initial approach was to include humans in the loop to ensure the safety and accuracy of the issue resolutions. Once the team gained confidence with AI agents, they started granting more autonomy to agentic solutions.

They concluded the talk by saying the team has just scratched the surface on what AI technologies can do, and AI agents can be useful in several other use cases. Their AIOps program roadmap highlights scaling AI agents to eliminate 80% of manual work, a knowledge graph that contains all the necessary information to connect the dots across different components in the overall system, and utilizing AI to detect and troubleshoot severe performance issues.

 

About the Author

Rate this Article

Adoption
Style

BT