Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Developing and Deploying ML across Teams with MLOps Automation Tool

Developing and Deploying ML across Teams with MLOps Automation Tool



Fabio Grätz and Thomas Wollmann discuss the MLOps Automation tool, and how it can be used to perform DevOps tasks on ML across teams.


Fabio Grätz is senior machine intelligence engineer @Merantix. Thomas Wollmann is VP engineering @Merantix.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.


Wollmann: Fabio and I will talk about developing and deploying machine learning across teams with our ML Ops automation tool. First of all, we will explain how Merantix Labs fits into the Merantix ecosystem, and how we do machine learning at Merantix Labs. After that, we will get into more detail what infrastructure we use for conducting deep learning projects, and how the Merantix devtool helps us to have best practices as code and infrastructure as code.

Merantix Labs

Merantix Labs is part of Merantix, which is a venture builder for AI companies in various domains. Within Merantix, we have ventures in medicine, automotive, but also in business intelligence. Merantix Labs itself is a solution provider, where we create bespoke solutions for our clients across industries and machine learning use cases.

Chameleon Ecosystem

For these very different machine learning projects, we don't always write everything from scratch, but rely on an ecosystem with reusable components. Going from low level, which we call our tool chain for data access for ML Ops, over our platforms for specific areas of machine learning, to solutions for machine learning use cases, to actual services. Because all of these things in this hierarchy are needed for successful machine learning projects. Namely, access, having a platform for developing your models, solving your use case, and then providing the services around it to bring into production.

Chameleon Core Platform

Within that, we have for example, Chameleon, which is our computer vision platform. It serves the needs of creating, for example, models for object detection and segmentation. Our philosophy here is that we try to automate everything that can be automated for heuristics of our optimization, and make everything flexible and adaptable that can't be automated. For example, the special needs of a machine learning model for a client. Within Chameleon, we don't just write everything from scratch again, but rely on several packages from the open source community. I want to share a few interesting things that we use. For example for ETL, we use Apache Beam. For machine learning, we use PyTorch. For configuration management, Hydra. For our unified data format, we use Zarr, or FSSpec to have cloud agnostic data access so you're not just tied to one cloud. Finally, for experiment tracking, we use MLflow. All of these technologies bring several requirements to our machine learning infrastructure.

Chameleon in Client Projects

How is Chameleon now used in client projects? In comparison to a product company, we have to create Chinese walls between our clients. Even if two projects have a very similar infrastructure, we have to create it over again for every client, so we have separate environments in terms of code, but also in terms of the compute resources. These reusable components like Chameleon can be used as Python packages within these projects. Custom code is injected for the project. For example, the obvious things are for data formatting or preprocessing, you'll usually have custom code because the data of the client is always a little bit different. Also, for tackling the challenging machine learning tasks of new clients, we have a plugin system for custom models and much more.

It doesn't end there. If you look at the whole workflow of a machine learning project with Chameleon, you also need infrastructure for uploading datasets, labeling the data, but also an experiment infrastructure so you can develop the model. When you have your model, you also need infrastructure to serve that model in your cluster with an API. You can't be sure if the model works in production so you have to monitor it to see for example, domain shifts, but also detect attacks. For that, we need a very elaborate machine learning infrastructure that we developed at Labs.

Software Infrastructure for AI Development is Complex

Grätz: The infrastructure that we would use in a client project could look for example, like this here. We have by default, a GCP project that acts as our Chinese wall, separating different clients. We have a Kubernetes cluster in our projects that we use for model training, as well as for the deployment of models. Also, for example, a labeling tool. We have, of course, storage configured so that we can store our datasets and artifacts. We have ETL, for example, configured using Apache Beam. We need artifact stores for Docker images and Python packages. Many more things to make a project successful. The task of an engineer could be that your project starts in a week, and they need to create all this here, so that they can start working on that project. We believe in a process where the ML engineers themselves, get the tooling to manage that infrastructure instead of having a dedicated DevOps on every project team.

I'm going to give you a walkthrough now of how the process looks like that an engineer goes through to create such an infrastructure for her or his client project.


We have the MX devtool, which is a click, click, Python click CLI. The core functionality that we will talk about is to generate functionality that we adapted from Ruby, which allows us to generate certain parts of a project including a skeleton of a project using so-called generators. The engineer would start by calling, for example, the MLProjectSkeleton generator and giving a name and entering certain information that is required, like name, compute region, other engineers that will work on this project. After that, the generator that has been called generates the skeleton of this project, that includes infrastructure configuration, the documentation skeleton, the README, and the serialized state here. Where all the information that the engineer entered a second ago is serialized so that when we apply other generators in the future, the engineer doesn't have to enter this information again.

In the infrastructure folder, there's, for example, a Docker file, because every project has its dedicated Docker image, so that every project can have its own dependencies. One client project might require PyTorch 1.7, and another one, for some reason, the newest version of PyTorch. Every project has its own base image. There's a Terraform configuration to create the infrastructure that I just showed you. There's Kubernetes configuration to manage experimentation and deployment infrastructure within the Kubernetes cluster.

The next step in the workflow would be to apply this Terraform configuration to create the cluster, IAM configuration, storage buckets. To create that the engineer goes to the infrastructure Terraform directory, does a Terraform init, which I omitted here. Terraform apply, then she or he waits a few minutes. Then the outline here, the Google Cloud project, in this case, the Kubernetes cluster has been created. The Apache Beam API, for example, has been activated. Storage buckets have been created. Docker registry has been created.

The next step in the setup is that the engineer creates a GitHub repository for version control and activates CI/CD, which in a Google Cloud project is by default, cloud build. The engineer creates the repository, then to set this up, calls the so-called CloudBuildDefaultConfig generator, which adds Terraform configuration to activate cloud build. The actual manifest that tells cloud build which steps to follow, which would be building the Docker image for that project, testing the code, linting the code, publishing the documentation, pushing the Python package that comes out of this project to our pypiserver, if desired. The engineer would navigate again to the infrastructure Terraform folder, and Terraform apply to set up CI/CD.

Now that that here is done, the next thing the engineer might want to do is create the base image, the base Docker image for this project. Since cloud build CI/CD has already been configured, we just need to push to the newly created repository so that the cloud build will build the project base image. We decided that our test should only pass if the code follows our style guide. To do that, we decided to not write pages of how we want the code to look like but we incorporated all that into the devtool, so the engineers will just run mx lint on a file or an entire folder. The devtool will check all the files, whether they adhere to our style guide, and tell the engineers what needs to be changed to do so. Since this is a newly rendered project, everything is already linted, so we can initialize the Git repo. Connect it to the remote GitHub repository push, and then cloud build will run and will build the base image for the project.

Since version control and CI/CD has been configured, we have a base image that contains all the requirements for our project, we can now go to setting up the experimentation and model training infrastructure in Kubernetes. First, before the engineers use anything, they need to connect to the new project. For that there's also functionality, again, in the devtool that configures G Cloud correctly, and also configures kubectl to have the credentials for the newly generated Kubernetes cluster. To manage deployments within the Kubernetes cluster, we use a really cool program called devspace. Devspace manages deployments, and also the development of projects on Kubernetes. We really like it. The main functionality we use is that it allows us to group certain manifests together in so-called deployments. For example, there are two here. One deployment is the training infrastructure, which includes a Kubernetes manifest for an MLflow tracking server that we want to install in our cluster. Another devspace deployment could be the CUDA drivers that we need to train models using GPUs. This devspace YAML here is a manifest file, which means that our engineers that don't edit the file itself, but the generators that engineers execute they add the deployments, so the project skeleton generator added the train infrastructure and CUDA driver deployments. Other generators that we'll see later will add their own deployments here.

The engineers then call devspace deploy, training infrastructure CUDA drivers, this will take a minute to run. After that, there is an MLflow tracking server running in the cluster that has a Postgres database configured to store experiment metadata and metrics. The MLflow tracking server has been configured to store model artifacts into a Google Cloud bucket that was created when we first call Terraform apply on the newly rendered project.

Now, the engineers following the few steps that we have seen, install the training infrastructure so we can start training a model. To train models we use MLflow with a Kubernetes backend. I don't want to go too much into detail because we mostly follow the default MLflow setup. There is an MLproject file that allows us to group different project entry points, for example, the trainer. The here that is rendered by the project skeleton generator is really just a minimal working example there. There is not much logic in here, but it runs, and it can be configured using Hydra and the default configuration here.

MLflow to run with a Kubernetes backend requires a train_job.yaml that is also part of the default project skeleton generator. Everything we need from the MLflow side is here. We need to connect our localhost 5000, which was the default URI for MLflow from the cluster to our dev machine. We chose to not expose it due to security reasons, but we opened a Kubernetes port forward, which we wrapped inside mx flow connect so that our engineers don't have to run kubectl port forward, service name every time. Once that port is forwarded, our engineers can run mx flow run, choose the trainer entry point, and the Kubernetes backend just like you would do with MLflow run. Mx flow run is really a light wrapper around MLflow run that sets some environment variables correctly and configures where to reach the tracking server basically.

I've shown you how we get from no infrastructure at all to configuring cloud infrastructure using Terraform, installing training infrastructure into the Kubernetes cluster using devspace. Then using MLflow to start a training. This box here is filled. What happens, for example, when the engineers want to do distributed training on their cluster, what do they need to do? There is a generator for that. The engineer would navigate back to the root directory of the project, and call the distributed training infrastructure generator. The distributed training infrastructure generator adds a new train job template for MLflow. This is not a default Kubernetes job, but it's a Kubeflow PyTorch job manifest. The generator adds one new section to the devspace.yaml. This section here includes a manifest that installs the kubelet PyTorch operator. The good thing is that engineers don't have to know every detail about how all this is set up, because they don't need to configure it from scratch. They just read in the documentation which generator they have to call, and which devspace or Terraform command they need to apply so that this manifest here gets installed into the cluster. They can now start distributed training using MLflow and the custom plugin that we build to handle this distributed job manifest.

We have training, experimentation, infrastructure now. I showed you how an engineer can quickly render or create the infrastructure that is required for distributed training. We do the same things for every other component. Let's say for example, we have a model now and we want to deploy it for a client, or we want to deploy a labeling tool, or the Streamlet app for a milestone meeting to showcase a model. For that we use Istio, which is a service mesh and Seldon Core for model deployment. To install that into the cluster, the engineer would again, run the generator, in this case the deployment infrastructure generator. This requires a Terraform apply, because it creates a static IP for the cluster, it creates DNS records that map a domain that is already configured for this client project to that static IP. Then we can use in devspace install Istio, the service mesh that we use, and Kubernetes manifest that configures an ingress to the cluster. They wait a few minutes, and the domain name is already configured, HTTPS was configured, the certificate was configured, the engineers don't have to worry about that. Everything is rendered using these generators.

Engineers can then deploy the infrastructure, for example, for Seldon Core, into the cluster. Seldon Core is a Kubernetes library for model serving and model monitoring that we really like. It's very powerful, and the engineers have to follow only these steps here to install the infrastructure that is required into the cluster to make this work. Same goes for a labeling tool. Let's say the engineers are in the beginning of a project not on the deployment phase, and client data that has been uploaded needs to be labeled. They run the CVAT labeling tool generator. In this case, the CVAT labeling tool generator will also render the configuration for the service mesh and the ingress, because to expose the labeling tool, we need that tool. Since we already did apply it here, we don't need to apply it again. Otherwise, engineers would have to, but we can now directly proceed to deploying the CVAT manifest using devspace, the labeling tool that's installed into the cluster. Since we already configured the ingress, we can reach the labeling tool at, for example. The engineers don't have to worry about static IP for the cluster. They don't have to worry about managed certificates for that domain. They don't have to worry about these details, because they just render the configuration using the generators.

This is how we created, for example, the service mesh now, labeling tools, deployed models. I didn't go into detail how the Streamlet app would be deployed, but you do the same thing. There is a generator that gives you a skeleton. The engineers run that generator and have a skeleton with the Kubernetes manifest that are required to deploy it. Also, I didn't talk about documentation. We use Sphinx for documentation, and there is also devtool command, and xbuild docs to build the documentation. Using a generator, we can also configure cloud build in a way that this is automatically deployed every time we push the master on GitHub.

I gave you an overview now over how an infrastructure looks like for a typical client project at Labs. The important part that I wanted to bring across here is the way that we automate the management of this infrastructure. Because to us, it is important that we don't need a DevOps engineer on every single client project, but that we build tooling that enables our machine learning engineers to manage the infrastructure themselves.

DevTool - Your ML Project Swiss Army Knife

Wollmann: We showed you how the devtool is our Swiss Army knife for spinning up infrastructure in our client projects. It serves infrastructure as code, so we can be reproducible, we make everything still maintainable, and also traceable. This can be all done by the machine learning engineers and data scientists themselves. They can spin up their infrastructure in the project and manage it without having a DevOps always at hand. We do this also by using project templates, which has several benefits. First of all, you have a common structure of the projects, which makes it easier for an engineer to get into a new project. We can use it to preserve learnings across different projects. Moreover, we can use the tool to have best practices as code, because nobody likes to read, for example, long style guides for code. All these repetitive tasks for checking these style guides and also other repetitive tasks within your workflow can be automated.

We try to make our tool fun to use for other engineers, so they use it voluntarily. Moreover, we didn't just build that devtool for ourselves, so we made it extendable because we are in the Merantix ecosystem. There's a core component which saves you all these functionality for so-called generating templates that we just showed. Different ventures within the Merantix ecosystem can write their own extensions as plugins for their own operations. Hopefully today, we could inspire you a bit how ML Ops at Labs works. Maybe it gave you some ideas for your own operations.

Questions and Answers

Jördening: How many fights did you have about concurrent Terraform applies from machine learning engineers working on the same project?

Grätz: There's a lock. As long as people don't force unlock it, that's ok.

Jördening: You have a convention on how to write the locks too.

Grätz: Terraform does that when one engineer calls the Terraform apply. One important point is we synchronize the Terraform state in the cloud bucket, so then everybody has a lock. As long as they don't force unlock it, nothing happens.

Jördening: The locks are probably then semantically generated from your project names, so that you have a unique name for each project?

Grätz: Yes.

Jördening: We already had the question regarding the K8s native versus cloud native. What is your tradeoff, especially since you're working together with customers, you probably sometimes have someone saying, GCS is not really our preferred cloud provider, or is it ok for all your customers?

Grätz: We are a GCP partner, so we like to work with Google Cloud. It's true that, especially like small and medium enterprises in Germany, they start and say, yes, but only Azure. We can only use Azure, for example. Working with Kubernetes is great for us, because then we can tell them, ok, that works for us, please give us access to a managed Kubernetes service that runs in your infrastructure, and we'll take it from there. This is why Kubernetes is really great for us.

Jördening: How do you do it if you share the machine learning models between different customers? Because it seems like you have VCS for every project running on a separate GKE cluster? Is there some synchronization you have between the VCS, or how do you avoid the big, do not repeat yourself?

Wollmann: Most of our clients prefer that they are completely isolated from our other clients. There are just a few components that we share to not repeat ourselves. One is, for example, Chameleon to build computer vision models, which are complete like task agnostic, it's more like a set of things to ease your work and automate what can be automated. This is distributed through a pypiserver, where we have the central storage and the different projects can read from it, from this pypiserver and install things, and the same with more elaborate stuff, which has been, for example, in Docker containers. It's not that we have lots of reads and writes between the projects, we try to separate them. This was also a motivation for the devtool and this templating because we have to create the same infrastructure multiple times, without doing everything manually. Copy pasting Terraform files gets infeasible at some point.

Jördening: What benefits do you see of building such an environment as opposed to getting an off-the-shelf ML Ops tool with all the elements?

Grätz: Let's say we build on Google's Vertex AI, or in SageMaker. Then the next client comes to us and say, but we have to do it on Azure for internal reasons. Then we are completely vendor locked and would have a hard time fulfilling the contract with them, because we rely on a certain platform from another cloud provider. That's one argument.

I think the other one is that it's very powerful if you can manage these things yourself, because you're not locked in by what certain managed solutions offer. That being said, I also don't want to make managed solutions sound bad. I think managed solutions are great, and you should use them wherever you can. Since this is the core of our business, I think for us having the control here is also important.

Jördening: Regarding control, you said that you basically manage the domain for the ingresses in every project itself. How do you manage the access there? Do you buy a domain for every project, because sharing domain is a sensitive topic, I feel?

Grätz: The main goal for us was that deploying a labeling tool for a client project would be easy. Not every data scientist or machine learning engineer knows exactly how DNS records work. They don't want to maybe go through the hassle with certificates so that the labeling tool, for example, is TLS secured. This had to be automated in some way, so they do just deploy few things, maybe call Terraform apply here, a devspace deploy there and then you wait until it's there basically. The way we do it is that we have one domain which is That we bought, of course. Then there are subdomains that are for the individual projects. When you create the deployment generator, it creates Terraform configuration that creates a static IP address for the cluster. It creates DNS records that map that subdomain. We own the domain, and the subdomain is mapped onto that static IP, so the engineer doesn't have to worry about that. Then there is Kubernetes configuration rendered, that creates or configures an ingress object that knows that the static IP exists. It will use it. It will use a managed certificate where the domain has already been rendered into. Everything is there, the engineer just has to apply these configurations using Terraform, and then Kubernetes. Then wait a little. It takes a few minutes to propagate. Everything is there beforehand, and the data scientist doesn't really have to know until it breaks, of course.

Wollmann: This is done in Google Cloud. We have this zero trust principle. We have a proxy in the front that separates the access from the different clients between the deployments, even if they are under a single domain. This keeps it secure, at least for these tasks.

Jördening: I was more thinking about basically, if you deploy one project, and you have access to the domain that you could redirect some other traffic.

If you have the environment, you have your machine learning code, how long does it then take you to get the product ready for your customers, in terms of man hours? You can put it on a critical time path.

How long it took to get one product released, so from getting all the infrastructure up to having probably an endpoint serving it.

Grätz: Our machine learning projects are very custom, so we might have some that take two weeks, we might have some that take half a year. The infrastructure setup takes an hour or so, tops, with runtime. Getting the infrastructure ready for deployment also takes very little time. What happens in between there, there's lots of custom development happening usually and that can have very different time spans.

Was the question rather targeted to how long it took us to get to this point where the infrastructure can be managed that quickly?

Jördening: It's for getting a product ready. It was about the tool.

Grätz: We started a year ago.

Wollmann: It was more like an iterative process. We built the first prototype, then tested if it's like sticky with the engineers, if they liked it also in terms of usability. Then extended the functionality, especially of the templates to add more stuff that can be automatically templated and deployed. Recently, we worked this whole system to also make this adaptable to different kinds of projects so it's less static, because in some project, you need some of the functionalities, in some you don't. For example, with this generator system, you are very flexible. We looked at different solutions, tried different things until we got to the final thing. We started the waterfall development, and now it's there. It was more like an iterative thing.

Jördening: Is this open source or how can I try this?

Wollmann: Primarily, we built this for ourselves to speed up our workflows and to automate whatever we can. We figured out that this is maybe also interesting for others, so that's why we also presented now to a larger audience to see if there's interest. We are currently thinking about how to get a larger user base hands-on on it. In the Merantix ecosystem, we already created it to have it modular to also edge things, according to the different Ops of ventures, because not every company operates the same.

Jördening: Where do your data scientists prototype? I assume they prototype in the K8s cluster where you will deploy in the end this flow, or do you have a separate prototyping environment?

Grätz: We use MLflow to schedule trainings. In MLflow you can have different backends, you can have a local one that runs in a con environment, you can have a local Docker environment, and you can have a Kubernetes environment. You can write plugins for your own environment, basically. The engineers we hired, they typically know Docker. We pay attention to that in the hiring process because we use it everywhere. The engineers locally, they will develop in a Docker container that acts as Python's runtime. They attach that to their IDE, then they schedule a training on the cluster. The training runs on the cluster. We have custom plugins for distributed training in Kubernetes that use components of Kubeflow.

What we also find really great is that devspace, which I can really recommend to everyone working with Kubernetes, has a dev mode that synchronizes files to your cluster and back. We have like a debug pod configured that has every dependency that the base image of the project has, including JupyterLab and some other tools you might need for debugging. Then you can, in the terminal, just write devspace dev deployment debug pod, and that will spin up this pod running JupyterLab in the cluster, and you get the file synchronized back and forth. You change the notebook you have it on your computer, immediately. You drop an image into the folder on your dev machine, you have it in the cluster immediately. It's like having a GPU on your laptop.


See more presentations with transcripts


Recorded at:

Apr 28, 2022