Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations When AIOps Meets MLOps: What it Takes to Deploy ML Models at Scale

When AIOps Meets MLOps: What it Takes to Deploy ML Models at Scale



Ghida Ibrahim introduces the concept of AIOps referring to using AI and data-driven tooling to provision, manage and scale distributed IT infra.


Ghida Ibrahim is the Chief Technology Architect and head of Data at Sector Alarm. Prior to that, she spent 5+ years as a technical lead at Meta/Facebook building AI tools. Ghida’s experience also includes working for major European Telcos in roles at the intersection of distributed computing and advanced analytics. She also occasionally lectures at university and speaks at conferences.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.


Ibrahim: My name is Ghida. I'm a chief architect and head of data at Sector Alarm, which is one of Europe's leading providers of smart alarm solution. It's a Norwegian company, not present in the UK in case you didn't hear of it. Before that, I spent 6 years at Meta/Facebook as a technical lead working on infrastructure optimization.

I'm going to talk about how AIOps, which refers to using AI and advanced analytics for managing IT operations at scale, is really useful in the context of ML operations. This talk basically is based on my previous experience, previous to my current role, especially at Meta, even though I have to emphasize that this is not necessarily a Meta related talk or a talk sponsored by Meta. It's just inspired by my work there and also my work during my PhD on multi-cloud optimization.


First of all, I'll introduce the concept of AIOps, what do we mean by AIOps? Then I will talk about the transition that we have seen from DevOps towards MLOps. How MLOps is very different or is different in some ways, to DevOps. Where do orchestrators use, both for DevOps and MLOps, fail, and why AIOps could be actually the solution here. Then, we'll dive deeper into specific use cases of how to use AIOps for improving ML operations at scale, and specific use cases related to trend forecasting, workload orchestration, anomaly detection, and root causing.

What is AIOps?

What is AIOps? AIOps is AI operations. It refers to the use of AI, machine learning, and advanced analytics to enhance and automate various aspects related to IT operations and infrastructure management. You can think of anything related to forecasting trend, whether this is like the compute needs of a given application, the storage data needs, the network bandwidth needs. It could refer to orchestrating load. If you have different workloads, how do you decide which type of infrastructure to use for a specific workload and where to direct this workload in real time as well? Also, there is a lot of work related to anomaly detection, and how can we detect that something went wrong, whether it is at the application level or at the infrastructure level, and how to root cause it efficiently without producing a lot of low-noise signals to development teams?

From DevOps to MLOps: Where Do Current Orchestrators Fall Short?

In this section, I'll try to give an overview of the evolution from DevOps towards MLOps, and where do current orchestrators, which introduce some level of intelligence here, fall short. DevOps is a set of practices that combine software development and IT operations. The goal here is to shorten the system's development lifecycle and provide continuous delivery of high software quality. I'm sure you're all familiar with this diagram, like starting with planning what to code, coding, building, testing, releasing, deploying, operating, monitoring. This is an example using Azure as a platform. Here you see that as a developer, you could be using a tool like Visual Studio to do your coding. Then you would be committing it to GitHub, for instance. You would be using GitHub Actions to do things like building your code or testing it as well. Then, once it's ready, you can maybe transform it into a container. You can push it into a container registry, or you can deploy it to an orchestrator, like Kubernetes, for instance. Kubernetes is not the only choice here, just like an illustrative example here. You can deploy it also to other platforms or other services, like App Service, for instance. After that you have a bunch of monitoring capabilities tools: Azure Monitor, App Insights. In every cloud provider environment, you would find something similar to what I have just presented here in terms of giving developers tools to code, to test, to build, to deploy, to monitor.

What is MLOps? MLOps is a set of practices and tools that aim at combining machine learning and development operation to streamline the deployment, scaling, monitoring, and management of machine learning models in production environments. What does this mean? It means that we need to cover different steps, starting with data management. Machine learning is very heavy on data. It really relies on identifying patterns in data. Data management, data processing, getting the data ready for training, and then streaming data in real time for inference and serving purposes is a very important component of MLOps. You have the part related to feature engineering, deciding on the future, creating a new future, doing a bunch of transformation in the data to create these features. You have the model building itself, which we refer to as ML training. Really deciding on the model that we want to use, the hyperparameters, maybe fine-tuning, maybe comparing different models and deciding at the end, what do we want to go for? Of course, the evaluation part of this. Once we have made this decision, you have the ML model deployment or serving, monitoring and production, and optimization and fine-tuning. Of course, this is a cycle because how the model performs in production would influence what kind of data we decide to collect in the future. We would need to make sure that the collection of data is always up to date, feature engineering are revisited, the model is retrained. This is really a loop here. MLOps is quite like a multi-step process, a multi-step workflow.

What's the difference between MLOps and DevOps? DevOps mainly enables the deployment of a given application, probably a containerized application within a given cluster. Whereas MLOps deals with an entire end-to-end workflow that has a lot of dependencies, can have complex logic. Most importantly, the different steps that I just presented, from data management to model training to feature engineering to model serving, have different infrastructure and provisioning requirements. In a way, MLOps is quite more complex of DevOps. Of course, it can leverage DevOps to handle individual ML tasks. For instance, we can use it for training or for serving. It's not only DevOps, DevOps is just an enabler, is just a part of the entire workflow, of the entire equation. This is an example actually of an MLOps workflow that I took from a very interesting book by a writer called Chip [inaudible 00:08:15], about enabling ML systems design. I really recommend reading this book. It could be something like this. We could decide to pull data from a data warehouse. We featurize the data. We use different models. We compare these two models, and we evaluate them in real time. Then we decide which one to use. This logic is really not as simple to be deployed only in DevOps, so we really need to support complex logic and complex workflows here.

Where do current orchestrators fall short? Here, I have taken a bunch of orchestrators, like the most popular one, at least, for both DevOps and MLOps. I did lightweight comparative studies of these different orchestrators and where do they fall short in terms of enabling this entire workflow to operate smoothly. Kubernetes would be the most popular DevOps orchestrator. Of course, it has many advantages in terms of enabling, orchestrating IT applications and increment cluster and not worrying about a bunch of stuff, from scaling to failure management, to updating the code, all of these things. Still, it has its own limitation. Some of these limitations are related to the fact that we're dealing with a mono-cluster deployment. This cluster has intrinsic and similar capabilities, like it's a CPU cluster or GPU cluster, or a certain type of GPU or a certain type of CPU. It's really a mono-cluster configuration deployment here. You still have to configure yourself as a software engineer the container and decide which container would go together into one pod. Autoscaling is not always supported in Kubernetes. Failure can be detected at nodes, but it's not necessarily root caused and it's not detected at the application level, which in itself can be a limitation. Because if there is a problem that is impacting all nodes, and all nodes keep failing, we can redirect the load to another node, but we're really not resolving the problem here. At the end, cost efficiency or resources efficiency and service QoS are not really priority here. It's really not a priority to decide, is this cluster best suited from a latency perspective for this kind of workload or not? This extra logic is not really there. Now comes a series of MLOps orchestrator, that have evolved over time. We started with Airflow, which enables what we call configuration as code. Basically, you can code your workflow, or configure it, that like, let's start with this task and then continue to that task, and for this task, use this infrastructure. The problem with Airflow is that it enables a monolithic deployment of workflows. An entire workflow, including all the steps that I just mentioned, would be deployed in just one application, or one container, which is not always the most optimal thing to do, given that every step is separated. We want to limit this high dependency, high tight coupling between all these steps. Also, because these different steps have different infrastructure requirements, so maybe this is not the best solution from a provisioning perspective. Also, Airflow has limitations in terms of workflow customization and handling complex workflows. For instance, if we look at what I just presented here, this kind of workflow, here we have a certain complex logic, which is an if-then logic. This would not be supported in Airflow. Same thing, for instance, if we have a workflow that depends on, how many lines of data we have. Depending on the number of instances we have in the data, we decide to go for a model over another. Airflow would not enable this. These are the main limitations of Airflow that led to a next generation or a new generation of MLOps orchestrators, like Prefect and Argo, that obviously introduced a lot of improvement in terms of customization, in terms of making or supporting more dynamic workflows, more parameterized workflows. Still had also their own limitations, especially in terms of messy configuration. For instance, if we look at an orchestrator like Prefect, you still have to write yourself your own YAML file, and you have to also attach Dockerfiles to run specific steps in your workflow on a Docker container, for instance. It's really a lot of configuration. It requires a lot of expertise. It requires a lot of knowledge. The infrastructure configuration is still pretty much manual. You need to decide how many GPUs or TPUs or CPUs you need, and what type. Also, these two tools here, or these two orchestrators have limited testing capabilities in the sense that it only runs in production. It didn't offer developers that much flexibility in terms of testing their workflow. This is when a new generation of MLOps orchestrators came, like Kubeflow and Metaflow. Kubeflow is more popular than Metaflow. They solved the problem of limiting testing capabilities by enabling developers to run workflows from a notebook, both in production and in dev environments. A really smooth experience here for developers. There are still messy configuration components remaining here, especially for Kubeflow, where you really need to create a different YAML file and Dockerfile for each part of your process, and stitch them together in a Python file. This is quite a lot of configuration. The infrastructure configuration as well is remaining pretty much manual. This is where current orchestrators, even though they brought the developer community a long way and adapted to the ML needs, fall short. This is where introducing an extra abstraction level that we refer to as AIOps could be very beneficial.

AIOps for MLOps: Relevant Use Cases

In this section, we're going to dive a bit into AIOps and what do we mean by it, and what is this extra layer that we're referring to, and what kind of functionalities is enabling, and what kind of abstraction is enabling. Most importantly, what AI techniques or analytics techniques are relevant there. For instance, if we look at ML pipelines. As I mentioned, we have different components from data extraction, transformation loading, feature engineering, ML training, ML serving. We have the underlying resources, which can be on-premise, in the cloud. It could be actually a hybrid setting where a part of the resources are on-premise, a part are in the cloud. It could be a multi-cloud native setup. It could be really anything. You don't really need to be dependent on one cloud in this instance. The resources here could be physical infrastructure resources, like EC2 instances for AWS, like virtual machines. It could be platform resources like EKS or AKS, basically Kubernetes orchestrator, or other tools like AWS batch processing tool. It could be Software as a Service resources like Vertex AI, or Amazon SageMaker that offer a really high level of abstraction. Still, for us, as developers, trying to deploy machine learning models, these are all resources that we need to access, and that can be anywhere, in any cloud provider. Actually, some cloud provider could be better for ETL, other could be better for ML training. We need this layer in the middle to do this kind of matching between the needs of our ML pipeline, and what resources are best suited for this, both from a performance perspective, but also from a cost perspective for us, as like a business client.

The AIOps layer is really this layer on the middle doing this abstraction and doing this matching between the needs and the resources. It's doing a bunch of things, from capacity scaling to workload orchestration, and anomaly detection and root causing. We will dive next to each of these use cases, and see what kind of AI techniques could be used there. If we start with capacity scaling, I think the first use case would be what we call workload profile prediction. What do we mean by workload profile prediction? It's basically being able to tell based on the definition or the description of a given task, and if there are any past demand patterns or data that we have, also this. What would be the capacity needs of this task in particular in our ML workflow? It would look something like that. You would have the task SLAs, so service level agreement that are dependent on the task. For instance, if we're talking about data transformation, it will be about the data volume, the original schema of the data, the target schema of the data. We'll see an example. If we're talking about ML training, it could be about the data itself, but also about the models that we want to experiment with. If it is about ML serving, it could be about the number of concurrent requests that we're looking at, and what kind of latency expectations or other expectations we have in mind, and past demand patterns. What this engine should be able to tell us would be, how much compute do we need? What type of compute do we need? Is it CPU, is it GPU? Within CPU, GPU, we have different options. Even NVIDIA has different machines available for these purposes. What kind of machine is best suited for our workload? How many cores do we need? How much data capacity do we need? Is it best to have memory or on disk? Is it best to structure this data in SQL, non-SQL format? Most importantly, the network, so the input-output bandwidth, whether it is within a given instance, so between the memory and the compute, or between the disk and a GPU or CPU instance. This engine will really do this. For doing that, we can use a number of techniques. One technique would be time-series forecasting, especially if we have previous patterns of our application or of our task. We can use this to predict the future. The other thing would be doing some inference. We already have a lot of ML tasks deployed already in production by engineers and developers, and a lot of testing have taken place around this. What would be the best deployment choice to have here? We can use this knowledge actually to do inference in the future about similar tasks, like in the same way Netflix works in a sense that, based on what you have watched in the past, or based on what other people have watched, you get this recommendation. Same logic we can use here, to infer that for this specific task that is similar to that task that we know of in the past, we think that these resources and infrastructure is best suited. These are techniques that we can use in this context.

If we go into specific examples, let's take a task that consists of transforming medical training data for ML training purposes. We would have the initial data state, the volume, and the target data state or schema. Ideally, this engine would output the storage capacity required in terabyte or petabyte. The storage capacity type, so it could be that we want to use a data warehouse infrastructure for storing original and target data. We want to use memory in between in the transformation process. Also, we could need compute for doing these ETL jobs. For instance, here, the prediction could be the number of data warehouse units that we need. What do we think would be the data warehouse unit expected utilization over time, so that we don't provision a given number at all times? We can just scale it up and down, or basically schedule an autoscaling operation. Another example of a task would be, now we have transformed the data, now let's do the training. Let's predict cancer odds based on patient data. Here, we can say, we want to experiment with XGBoost, or a tree-based model. Or we can say, actually, we don't know what would be the best model to use here. If we know, we can provide model hyperparameters. We can say that we want to repeat this training every day, or every week, or at any other cadency that we prefer. We can give a description of the training data, which is the data that we just transformed in the previous step. This forecasting engine will give us like, the compute type for this is like, we need to use GPUs. Let's use NVIDIA H100. This is how many instances we need for really enabling these operations. What would be basically the use at different time slots? Because, for instance, training is something that we don't do online, it's an offline operation, so it doesn't really require capacity usage at all times. Now that we have built our model, we want to deploy it to production. The task description here would be to deploy a cancer odds prediction model in production, or what we refer to as ML serving. Here, again, it could be that the compute type is CPU, let's use devices or machines like NVIDIA Triton, and this is how much compute we need. Also, here, one component would be the compute location, because here we're talking about serving. Maybe latency is an important component, or having like a real time reply maybe in this use case, is an important component. Or the capacity forecasting engine would be able to tell from the task description, what are the requirements of this task in terms of latency or in terms of other QoS requirements, and take this into account when suggesting or outputting a certain infrastructure that can support that.

The second application or use cases where we can use AI for enabling MLOps operations at scale, is optimizing workload placement. Now that we have the workload profiles associated with the different tasks in our end-to-end ML workflow, we want to do what we call orchestration. Orchestration here is a bit different than the orchestration that we talk about when we talk about Kubernetes, for instance. What we mean by orchestration here is that we have different types of workloads from ETL, to feature engineering to training an ML model to ML serving. These workloads have different requirements. Also, they run at different time slots. For instance, ML training happens once a day. ML servings happens in real time, like every second. What we're trying to do here is that, we have all these workloads, we know that these are the requirements. How can we aggregate these workloads and decide where to direct them in real time? The first step here is to do workload aggregation, which refers really to multiplexing workloads with similar capacity type and geo needs. For instance, if we know that these different workloads all needs the same type of CPU, and we can use the same CPU cluster, and actually one workload needs these CPU machines at a certain time, but another worker at other times where they're not needed by the first task, then this constitutes the perfect use case for multiplexing these two workloads together and optimizing our infrastructure usage. This is the multiplexing that happens here, really based on similar capacity type, and also geolocation needs. The other step would be to aggregate workloads to cluster assignments. Now we know that, workload 1 and workload 3, I can batch together, because they have similar needs. Now, this is my infrastructure, these are my clusters. I have two AWS clusters, three Azure clusters, whatever, how can I best do this assignment of workloads to clusters? In this case, you can use something like operations research or constrained optimization for assigning these aggregated workloads to a cluster. What optimization here refers to is trying to maximize the number of workloads we're serving, and also the fact that we're serving them with a high level of SLA, ideally, really close to 100%, while being aware of the infrastructure requirement, and while trying also to keep the cost below a certain level. Here, what we're trying to achieve is to really have a certain objective, which could be, let's achieve all our SLAs, or let's maximize our SLA achievement. Our constraints would be the capacity that we have, and the cost. It could be the other way around, maybe our objective is to minimize the cost, and the constraint that we have is more that our SLA should be more than this. Maybe we're happy with only 99% SLA, we're not really interested with 99.999%, or something like that. It really depends on our priority as a company, as a user. We really can use here some kind of optimization to do this assignment. Then, based on the outcome of this assignment, we might decide that, for ETL, let's use AWS clusters in Brazil. For ML serving, let's use Azure in Western Europe. Based on this offline calculation, we do the direction of the workload in real time. Of course, things can change. Of course, our decision on how to direct workloads could be influenced by the fact that some clusters suddenly became unavailable. We need to react to this as well in real time. At least this provides us with a high-level direction on what to do if everything were to be fine, and what is the most optimal thing to do here.

The third use case is, now we predicted the workload profiles, and then we decided how to aggregate the workloads and where to direct them, to which type of cluster. Now we have all this in place, things can go wrong, things can fail. How do we detect that things fail as soon as possible, preferably before end users detect this themselves? Most importantly, how do we root cause? A bunch of motivation here exists, and I try to group it into cluster. First of all, we want to reduce the on-call fatigue. We don't want like 1000 alarms about a similar thing. We want to group similar alarms, we want to minimize the false positive. If there is a small alarm about something, but actually the end user will not be impacted, maybe we don't want to bother the engineering teams with it yet. Second thing, we want to reduce the time to detection. Ideally, detecting issues before users are impacted. Most importantly, we want to understand what has gone wrong, so we can learn from it and make sure that it doesn't propagate, so we can stop it. What can be done here? We can be looking at a bunch of signals from clusters' health, so really physical clusters have infrastructure metrics, or KPIs, to also more application-level or task-specific KPIs that are relevant for different tasks. For instance, if we're looking at ETL, it could be the number of data rows that we have. If it suddenly increased too much or suddenly decreased too much, it could be an indicator that something wrong has happened. If we're looking at feature engineering, if suddenly we see a big drop or increase in our features, this also could be an indication that something wrong has happened. If we're looking at ML training, it could be the change of the model precision that we have, could be also an important metric to look at. We don't need to wait until the entire model has been retrained and we've lost the precision. Even if we lost the precision on a batch, we can use this as a signal. Finally, for something like ML serving, we can look at the number of concurrent requests. If we see that the number of concurrent requests have dropped, with respect to the same time of the day, a day back, or a year back, or a week back, or whatever, then we can also detect this outlier behavior.

Here, all of these kinds of elements or components will generate alarm signals when something goes wrong. We can use anomaly detection and leverage something like hypothesis testing to ask ourselves, is this behavior of the cluster, or of the model, or of the data normal given what we have seen in the past or not, and compute what we call a z-score. If the score is high, this means that something wrong is happening, and this is an outlier behavior. The second thing would be maybe to implement in our logic, some kind of human knowledge saying, "This is normal, this is not very normal." This can be hard coded, or can be expert knowledge, put into our code or into our engine. Then the third step would be to do anomaly clustering. Can we have all these alarms? Are there any similarities between the alarms that we're seeing? Can we refer them to one task, or can we refer them to one type of user device, or one type of infrastructure like a given provider, like Azure, for instance, or a given cluster, in particular? Can we do this kind of grouping, which can help us really simplify the way we're doing things and simplify or reduce the fatigue of our on-call team. Then, this would lead to what we call a super alarm. An alarm that really aggregates many alarms, and it has a summary of many wrong things going on. We can do what we call an alarm root causing. Mainly the idea here is a tree-based model, like, what is the most common thing between these different alarms? We've seen this alarm for all users that have iOS devices, what wrong thing could have happened? Let's start with hypothesis 1. If this is not the case, let's start with hypothesis 2, and so on. This root causing can help us boil down what exactly went wrong, and how basically to fix this. Here, as you can see, we have used a bunch of techniques in this process, from hypothesis testing, to clustering, to tree-based models, to expert networks as well, for automating the generation and root causing of high-signal, low-noise alarms. This is just an illustration about really like, how can we leverage AI and advanced analytics and ML for really improving the way we deploy ML workflows?


As we have seen in this presentation, AIOps is key for tackling the increasing complexity of ML workflows. Given how complex ML workflow is, given the dependencies, given the different requirements, infrastructure requirements, we need this extra level of abstraction for tackling this increasing complexity and making things easier for developers. It enables abstracting complex decisions and operations from capacity scaling to workload orchestration and performance monitoring. I personally believe that when implemented correctly by existing cloud providers, or even new platform providers, because this really creates a space for new kinds of operators beyond cloud providers to provide this service that can be multi-cloud, hybrid, AIOps has the potential to further democratize ML and LLM adoption. As we have seen in the past year or so, LLM really democratized the deployment of applications. You can simply deploy applications by creating smart prompts. I think why not do the same for deploying ML and abstracting all this complexity that we have around ML operations by creating this AIOps empowered layer, where we can use a bunch of techniques from inference to time-series forecasting, to optimization.

Questions and Answers

Participant 1: Is this an intended direction that you see that the industry should take, or are you aware of existing solutions or platforms that already propose these functionalities?

Ibrahim: I think that this is a direction that we should be taking. I'm not aware of public solutions that do this kind of abstraction. What I know is that, for instance, at Facebook where I was, there were dedicated teams for creating an extra intelligence beyond the cluster level around how we should be scaling capacity, where we should actually be placing our data centers or our point of presence. How do we detect alarms better? How do we best orchestrate different workloads in a way that really allow us to manage the infrastructure better and reduce the cost? Because reducing costs have been obviously a very important theme for many big tech companies over the past years. I'm not aware of solutions that do this at this level of abstraction. Obviously, we have the orchestrators which I presented and probably there are many more that I haven't necessarily used, or I'm not completely aware of, but I think this is still pretty technical, this is still pretty low level. There is another layer of abstraction that can be added here. This is exactly what I'm talking about here.

Nardon: In my company, we actually developed ML models to detect capacity and try to increase the cluster as predicting when we needed more power. You have an experience with Facebook, but if you know how is the landscape for AIOps from other companies. Are these something that big companies are looking at and having teams for that? It seems it's something that each company is doing their own models. It's a landscape to invest on, it seems?

Ibrahim: I think that maybe cloud providers are best positioned to do this, and obviously some big tech companies that have their own clouds, like Facebook, which has an internal cloud. They are best positioned to do that. I think there is also a role for new players that are cloud agnostic or cloud native to emerge within this landscape, especially these different steps as I mentioned, could lead to decisions like, we don't want to use just one cloud provider. We want really our solution to be multi-cloud or hybrid. I think there is room for new types of players to come in and establish themselves beyond a given cloud provider or like big tech company.

Nardon: It's a good business to invest for those who are thinking about starting a startup.


See more presentations with transcripts


Recorded at:

Jun 27, 2024